tensormonk.detection

Implementation may vary when compared to what is referred, as the intension was not to replicate but to have the flexibility to utilize concepts across several papers.

AnchorDetector

class AnchorDetector(config: tensormonk.detection.config.CONFIG)[source]

A common detection module on top of base network with NoFPN, BiFPN, FPN, and PAFPN.

Base is the backbone network (a pretrained or a custom one)

    Ex: ResNet-18
    1x3x224x224     1x64x56x56   1x128x28x28   1x256x14x14   1x512x7x7
       input    ->      o     ->      o     ->      o     ->      o
                        x1            x2            x3            x4
    Lets call x1, x2, x3, x4 as levels.

Base2Body has one 1x1 convolutional layer per level to convert the
depth of (x1, x2, x3, x4) to a constant depth (config.encoding_depth)

Ex: config.encoding_depth = 60
Base2Body((x1, x2, x3, x4))[0].shape == [1, 60, 56, 56]
Base2Body((x1, x2, x3, x4))[1].shape == [1, 60, 28, 28]
Base2Body((x1, x2, x3, x4))[2].shape == [1, 60, 14, 14]
Base2Body((x1, x2, x3, x4))[3].shape == [1, 60,  7,  7]

Body can have stacks of NoFPN/FPN/BiFPN/PAFPN layers. Essentially,
these act as context layers that are interconnected across levels
(exception is NoFPN layer).
batch_detect(p_label: torch.Tensor, p_boxes: torch.Tensor, p_point: torch.Tensor)[source]

A list of Responses from detect.

Parameters
  • p_label (Tensor) – label predictions at each pixel for all levels

  • p_boxes (Tensor) – boxes predictions at each pixel for all levels

  • p_point (Tensor) – boxes predictions at each pixel for all levels

  • p_label.size – self.centers.size(0)

Return type

[tensormonk.detection.Responses, tensormonk.detection.Responses, …]

batch_encode(r_label: Union[list, tuple], r_boxes: Union[list, tuple], r_point: Union[list, tuple])[source]

Encode’s raw labels, boxes and points of a batch of images.

Parameters
  • r_label (list/tuple) – list/tuple of tensor’s to encode. See encode for more information

  • r_boxes (list/tuple) – list/tuple of tensor’s to encode. See encode for more information

  • r_point (list/tuple) – list/tuple of tensor’s to encode. See encode for more information

Return type

tensormonk.detection.Responses

detect(p_label: torch.Tensor, p_boxes: torch.Tensor, p_point: torch.Tensor)[source]

Detects labels, boxes and points of a single image.

Parameters
  • p_label (Tensor) – label predictions at each pixel for all levels

  • p_boxes (Tensor) – boxes predictions at each pixel for all levels

  • p_point (Tensor) – boxes predictions at each pixel for all levels

  • p_label.size – self.centers.size(0)

Return type

tensormonk.detection.Responses

encode(r_label: torch.Tensor, r_boxes: torch.Tensor, r_point: torch.Tensor)[source]

Encode’s raw labels, boxes and points of a single image.

Parameters
  • r_label (Tensor) – label for each object (0 is background)

  • r_boxes (Tensor) – ltrb boxes of each object (pixel coordinates without any normalization)

  • r_point (Tensor) – x, y, x, y, … for each object (pixel coordinates without any normalization), nan’s are avoided in loss computation.

Return type

tensormonk.detection.Responses

predict(tensor: torch.Tensor)[source]

Calls AnchorDetector.batch_detect with no grads.

Parameters

tensor (torch.Tensor) – input tensor in BCHW

Return type

tensormonk.detection.Responses

Block

class Block(encoding_depth: int, n_features: int, fusion: str = 'softmax')[source]

DepthWiseSeparable + FeatureFusion or FeatureFusion + DepthWiseSeparable. (EfficientDet: Scalable and Efficient Object Detection)

Parameters
  • encoding_depth (int, required) – depth of all the input tensor’s

  • n_features (int, required) – #Features to fuse. When n_features = 1, FeatureFusion is performed with input and the ouput of DepthWiseSeparable layer. Otherwise, FeatureFusion is performed on all the inputs, followed by DepthWiseSeparable layers.

  • fusion (str, optional) – fusion logic after resizing all the tensor’s to match the first tensor in the list/tuple/args using bilinear interpolation. Options - "sum", "fast-normalize", "softmax". (default = "softmax")

# TODO: More options for Block

Classifier

class Classifier(config: tensormonk.detection.config.CONFIG)[source]

Classifier layer to predict labels, boxes, points, objectness and centerness.

Parameters

config (CONFIG) – See tensormonk.detection.CONFIG for more details.

Return type

tensormonk.detection.Responses

CONFIG

class CONFIG(name: str)[source]

CONFIG is used to configure all the options for object detection tasks.

Example: Assume an object detection model that is trained on 320x320 images to detect dogs and cats.

import tensormonk

config = tensormonk.detection.CONFIG("mnas_bifpn_dogs_cats")

# Define input size
config.t_size = (1, 3, 320, 320)

# Use pretrained MNAS model as base network
config.base_network = "mnas_100"
config.base_network_pretrained = True
# Given the above config and input size of (4, 3, 320, 320), base
# network will return a tuple of tensor's of shape
# ((4, 24, 80, 80), (4, 40, 40, 40), (4, 96, 20, 20), (4, 320, 10, 10))
# By using base_network_forced_stride, base network will return a tuple
# of tensor's of shape
# ((4, 24, 40, 40), (4, 40, 20, 20), (4, 96, 10, 10), (4, 320, 5, 5)).
config.base_network_forced_stride = True

# All the ouputs from base network are encoded to have constant depth
# (96) using a 1x1 convolution per level.
# Essentially, the base network output with tensor shapes
# ((4, 24, 40, 40), (4, 40, 20, 20), (4, 96, 10, 10) and (4, 320,5, 5))
# is converted to
# ((4, 96, 40, 40), (4, 96, 20, 20), (4, 96, 10, 10) and (4, 96, 5, 5))
config.encoding_depth = 96

# Define a body network with 4 "bifpn" layers.
config.body_network = "bifpn"
config.body_network_depth = 4

# Define number of labels (labels to detect + background)
config.n_label = 2 + 1
config.label_loss_fn = tensormonk.loss.LabelLoss
config.label_loss_kwargs = {
    "method": "ce_with_negative_mining",
    "pos_to_neg_ratio": 1 / 3.,
    "reduction": "mean"}

# Define loss function and encoding for bounding box
config.is_boxes = True
config.boxes_loss_fn = tensormonk.loss.BoxesLoss
config.boxes_loss_kwargs = {
    "method": "smooth_l1", "reduction": "mean"}
config.boxes_encode_format = "normalized_offset"

# Enable objectness and disable centerness
config.is_point = False
config.is_objectness = True
config.is_centerness = False

# Define encode_iou
# minimum iou required for a prior to set a location as non background
config.encode_iou = 0.5
# Define detect_iou - iou_threshold for non-maximal suppression
config.detect_iou = 0.2
# Define score_threshold - minimum score required to label an anchor as
# non background during inference.
config.score_threshold = 0.46
# Define ignore_base - As a pretrained base network is used in this
# example, disable the gradients to reach base_network for 5000
# iterations.
config.ignore_base = 5000

# Define anchors
config.anchors_per_layer = (
    # anchors at 40x40
    (config.an_anchor(32,   32), config.an_anchor(46,   46)),
    # anchors at 20x20
    (config.an_anchor(64,   64), config.an_anchor(90,   90)),
    # anchors at 10x10
    (config.an_anchor(128, 128), config.an_anchor(180, 180)),
    # anchors at 5x5
    (config.an_anchor(256, 256), config.an_anchor(320, 320))
print(config)
an_anchor(w: int, h: int, offset: int = 0)[source]

A namedtuple with w and h of anchor.

property anchors_per_layer

All anchors per layer. A list/tuple of list/tuple of config.an_anchor’s.

Parameters

value (int, optional) – default = 0.

property base_network

Base network for anchor detector (str/nn.Module).

Parameters

value (str, optional) – Current options are "mnas_050", "mnas_100", and "mobilev2". See tensormonk.architectures.MNAS, and tensormonk.architectures.MobileNetV2 for more information. Also accept a custom network. default = "mnas_050".

Example custom network:

import torch
import torch.nn as nn
import torch.nn.functional as F
import tensormonk


class Tiny(torch.nn.Module):
    def __init__(self, **kwargs):
        super(Tiny, self).__init__()
        self._layer_0 = torch.nn.Sequential(
            nn.Conv2d(3, 16, 3, stride=2, padding=1), nn.PReLU(),
            nn.Conv2d(16, 16, 3, stride=2, padding=1), nn.PReLU())
        self._layer_1 = torch.nn.Sequential(
            nn.Conv2d(16, 24, 3, stride=2, padding=1), nn.PReLU(),
            nn.Conv2d(24, 24, 3, stride=1, padding=1), nn.PReLU())
        self._layer_2 = torch.nn.Sequential(
            nn.Conv2d(24, 32, 3, stride=2, padding=1), nn.PReLU(),
            nn.Conv2d(32, 32, 3, stride=1, padding=1), nn.PReLU())
        self._layer_3 = torch.nn.Sequential(
            nn.Conv2d(32, 48, 3, stride=2, padding=1), nn.PReLU(),
            nn.Conv2d(48, 48, 3, stride=1, padding=1), nn.PReLU())

    def forward(self, tensor: torch.Tensor):
        x0 = self._layer_0(tensor)
        x1 = self._layer_1(x0)
        x2 = self._layer_2(x1)
        x3 = self._layer_3(x2)
        return (x1, x2, x3)


config = tensormonk.detection.CONFIG("tiny")
config.base_network = Tiny
property base_network_forced_stride

Used when base_network is "mnas_050", "mnas_100", or "mobilev2" to add an additional stride in the second or third convolution layer.

Parameters

value (bool, optional) – default = False

property base_network_pretrained

Used when base_network is "mnas_050", "mnas_100", or "mobilev2" to load pretrained weights.

Parameters

value (bool, optional) – default = True

property body_fpn_fusion

Fusion scheme used by FPN and NoFPN. See tensormonk.layers.FeatureFusion and tensormonk.detection.Block for more information.

Parameters

value (str, optional) – default = "softmax". See tensormonk.layers.FeatureFusion for all available options.

property body_network

Body network options are

Parameters

value (str, optional) – "bifpn", "fpn", "nofpn", and "pafpn". default = "bifpn".

"bifpn" = tensormonk.detection.BiFPNLayer

"fpn" = tensormonk.detection.FPNLayer

"nofpn" = tensormonk.detection.NoFPNLayer

"pafpn" = tensormonk.detection.PAFPNLayer

property body_network_depth

Number of FPN or NoFPN layers to stack. Below is an example config of body network that has 6 "bifpn" layers:

import tensormonk
config = tensormonk.detection.CONFIG("mnas_bifpn")
config.base_network = "mnas_050"
config.encoding_depth = 96
config.body_network = "bifpn"
config.body_network_depth = 6
Parameters

value (int, optional) – default = 2.

property body_network_return_responses

When True, compute_loss in tensormonk.detection.AnchorDetector also return’s the responses from body network.

Parameters

value (bool, optional) – default = False.

property boxes_encode_format

Boxes encoding format. See tensormonk.detection.ObjectUtils for more options.

Note: IOU based loss functions require “normalized_offset”

Parameters

value (str, optional) – Options “normalized_gcxcywh” or “normalized_offset”. default = "normalized_gcxcywh".

property boxes_encode_var1

Single Shot MultiBox Detector <https://arxiv.org/pdf/1512.02325.pdf>`_.

Parameters

value (float, optional) – default = 0.1.

Type

Variance used to encode boxes - `SSD

property boxes_encode_var2

Single Shot MultiBox Detector <https://arxiv.org/pdf/1512.02325.pdf>`_.

Parameters

value (float, optional) – default = 0.2.

Type

Variance used to encode boxes - `SSD

property boxes_loss_fn

Loss function to compute loss given p_boxes and t_boxes. This function is initialized in tensormonk.detection.AnchorDetector. A custom loss function can be initialized as long as it is a nn.Module and all the boxes_loss_kwargs are set.

Parameters

value (nn.Module, optional) – default = tensormonk.loss.BoxesLoss

property boxes_loss_kwargs

Dictonary of parameters required to initialize config.boxes_loss_fn function.

Parameters

value (dict, required) – See tensormonk.loss.BoxesLoss for more information if config.boxes_loss_fn is tensormonk.loss.BoxesLoss.

property detect_iou

IOU used to filter boxes during detection.

Parameters

value (float, optional) – default = 0.5.

property encode_iou

IOU required by a box to map it to an anchor.

Parameters

value (float, optional) – default = 0.5.

property encode_iou_max_background

IOU below which is considered as background.

Parameters

value (float, optional) – default = 0.5.

property encoding_depth

Encoding depth to convert all the base network outputs to a constant depth in order to enable FPN and NoFPN layers.

Parameters

value (int, required) – See the example in tensormonk.detection.AnchorDetector for more information.

property hard_encode

Eliminates boxes with centers that are not within pix2pix_delta.

Parameters

value (bool, optional) – default = False.

property ignore_base

Gradients are not propagated to base network for ignore_base iterations. Used when a pretrained network is used to tune parameters.

Parameters

value (int, optional) – default = 0.

property is_boxes

Flag to enable bounding box detection. Not used in current implementation (default = "True"), will get updated with inclusion of segmentation task.

Parameters

value (bool, optional) – default = "True"

property is_centerness

Fully Convolutional One-Stage Object Detection <https://arxiv.org/pdf/1904.01355.pdf>`_

Parameters

value (bool, optional) – default = False.

Type

Enables centerness as defined in `FCOS

property is_objectness

An Incremental Improvement <https://pjreddie.com/media/files/papers/YOLOv3.pdf>`_.

Parameters

value (bool, optional) – default = False.

Type

Enables centerness as defined in `YOLOv3

property is_pad

Used for computing centers.

Parameters

value (bool, optional) – default = True.

property is_point

Flag to enable point localization within a bounding box.

Parameters

value (bool, optional) – default = "False"

property label_loss_fn

Loss function to compute loss given p_label and t_label. This function is initialized in tensormonk.detection.AnchorDetector. A custom loss function can be initialized as long as it is a nn.Module and all the label_loss_kwargs are set.

Parameters

value (nn.Module, optional) – default = tensormonk.loss.LabelLoss

property label_loss_kwargs

Dictonary of parameters required to initialize config.label_loss_fn function.

Parameters

value (dict, required) – See tensormonk.loss.LabelLoss for more information if config.label_loss_fn is tensormonk.loss.LabelLoss.

property n_label

Number of labels (including background) to predict.

Parameters

value (int, required) – Must be >= 2.

property n_point

Number of points to detect in an object. This is relavent to tasks like identifying body parts/joints in person detection, facial landmarks in face detection, etc.

Parameters

value (int, optional) – Must be >= 1 and set when config.is_point is True.

property point_encode_format

Point encoding format. See tensormonk.detection.ObjectUtils for more information.

Parameters

value (str, optional) – default = "normalized_xy_offsets".

property point_encode_var

SSD normalization variance is used for points.

Parameters

value (float, optional) – default = 0.5.

property point_loss_fn

Loss function to compute loss given p_point and t_point. This function is initialized in tensormonk.detection.AnchorDetector. A custom loss function can be initialized as long as it is a nn.Module and all the point_loss_kwargs are set.

Parameters

value (nn.Module, optional) – default = tensormonk.loss.PointLoss

property point_loss_kwargs

Dictonary of parameters required to initialize config.point_loss_fn function.

Parameters

value (dict, required) – See tensormonk.loss.PointLoss for more information if config.point_loss_fn is tensormonk.loss.PointLoss.

property score_threshold

Score threshold used to filter boxes during detection.

Parameters

value (float, optional) – default = 0.5.

property single_classifier_head

Flag to enable single classifier head in tensormonk.detection.Classifier.

Parameters

value (bool, optional) – default = False. See tensormonk.detection.Classifier for more information.

property t_size

Input tensor size in BCHW. Also, used to precompute centers, anchor_wh and pix2pix_delta.

Parameters

value (tuple, required) – Input tensor shape in BCHW (None/any integer >0, channels, height, width).

FPN Layers

All FPN layers use DepthWiseSeparable convolution (with BatchNorm2d and Swish) and FeatureFusion layer.

BiFPNLayer

class BiFPNLayer(config: tensormonk.detection.config.CONFIG)[source]

A modified version of BiFPNLayer compatible with CONFIG. Upscale/downscale is done with bilinear interpolation. (EfficientDet: Scalable and Efficient Object Detection).

Parameters

config (CONFIG, required) – See CONFIG for more details.

Logic: n_scales = 4
-------------------
low-resolution    o ------> o ->
                   _\_____  ^
                  |  \    \ |
                  o -> o -> o ->
                   ___ | _  ^
                  |    v  \ |
                  o -> o -> o ->
                         \  ^
                          \ |
high-resolution   o ------> o ->

FPNLayer

class FPNLayer(config: tensormonk.detection.config.CONFIG)[source]

A modified version of FPN compatible with CONFIG. Upscale/downscale is done with bilinear interpolation. (Feature Pyramid Networks for Object Detection).

Parameters

config (CONFIG, required) – See CONFIG for more details.

n_scales = 3           Ex: Base with single FPN layer
------------           ------------------------------
    -> o ->            o -> o -> low-resolution
       |               ^    |
       v               |    v
    -> o ->            o -> o ->
       |               ^    |
       v               |    v
    -> o ->            o -> o -> high-resolution
                       ^
                       |
                       o
                       ^
                       |
                     input

NoFPNLayer

class NoFPNLayer(config: tensormonk.detection.config.CONFIG)[source]

Residual DepthWiseSeparable is used as base block.

Parameters

config (CONFIG, required) – See CONFIG for more details.

n_scales = 3
------------
Ex: Base with single FPN layer

Pretrained | Detection Layers
Ex: ResNet | with anchors
-----------|-----------------
    o      |   -> o
    ^      |
    |      |
    o      |   -> o
    ^      |
    |      |
    o      |   -> o
    ^      |
    |      |
    o      |
    ^      |
    |      |
           |
  input    |

PAFPNLayer

class PAFPNLayer(config: tensormonk.detection.config.CONFIG)[source]

A modified version of PAFPN compatible with CONFIG. Upscale/downscale is done with bilinear interpolation. (Path aggregation network for instance segmentation).

Parameters

config (CONFIG, required) – See CONFIG for more details.

Logic:  n_scales = 3
--------------------
low-resolution    -> o -> o ->
                     |    ^
                     v    |
                  -> o -> o ->
                     |    ^
                     v    |
high-resolution   -> o -> o ->

Responses

class Responses(label: torch.Tensor, score: torch.Tensor, boxes: torch.Tensor, point: torch.Tensor, objectness: torch.Tensor, centerness: torch.Tensor)[source]

An object with all the predictions from anchor detector. The list of properties are (label, score, boxes, point, objectness, centerness)

Parameters
  • label (torch.Tensor/None) – Predicted labels.

  • score (torch.Tensor/None) – Predicted scores.

  • boxes (torch.Tensor/None) – Predicted boxes (encoded).

  • point (torch.Tensor/None) – Predicted point (encoded)

  • objectness (torch.Tensor/None) – Predicted objectness.

  • centerness (torch.Tensor/None) – Predicted centerness.

Sample

class Sample(image: str, labels: numpy.ndarray, boxes: numpy.ndarray, points: Optional[numpy.ndarray] = None)[source]

Sample is an object that contains image path, labels, bounding boxes and points for object detection tasks that can localize landmark. It can augment data (random 90/180/270 rotates, random pad and random cropping) during training – boxes and points are adjusted accordingly. The image can be resized along with boxes and points if Sample.OSIZE is initialized.

Attributes (are set once):
INVALID (float): In cases where some points are not available set the

value to float(“nan”). This allows to track those points after augmentation (must be filtered during loss computation – tensormonk.loss.PointLoss automatically handles it). default = float("nan")

OSIZE (tuple): (width, height) of output image, when not set returns

image without resize along with its attributes (boxes and points) after augmentation. default = None

RESIZE (bool): When True along with OSIZE != None will resize the image

during augmentation and adjust the boxes and points to new image size.

ROTATE_90 (bool): Enables random rotation (90/180/270).

default = True

ROTATE_90_PROBS (tuple): Probability of ROTATE_90.

default = (0.4, 0.6, 0.8). 40%, 20%, 20% and 20% probable to rotate 0, 90, 180, and 270 degrees respectively

PAD (bool): Does random padding.

default = True

PAD_PERCENTAGE (float): Maximum percentage of height and width that

is padded. Must be 0 < PAD_PERCENTAGE < 1. default = 0.1

CROP (bool): Does random cropping.

default = True

CROP_MIN_SIDE_PERCENTAGE (float): Minimum percentage of the size that

must be retained. Must be 0 < CROP_MIN_SIDE_PERCENTAGE < 1. default = 0.3

CROP_MIN_OBJECT_SIDE (int): Minimum side of the object that has to be

maintained after crop and resize. In case of multiple objects, at least one object will have min(w, h) >= CROP_MIN_OBJECT_SIDE. Must be 0 < CROP_MIN_OBJECT_SIDE < min(Sample.OSIZE). default = 16

CROP_N_ATTEMPTS (int): Number of attempts to find random crop, when

failed randomly selects one object and extracts a crop around it. Depends on cpu (a larger number can slow down dataloader). default = 16

RETAIN_AREA (float): An object is retained only if

original area * RETAIN_AREA >= visible area after a crop. default = 0.5

Parameters
  • image (str, required) – Full path to image (does not accept ndarray or pillow image since large dataset can not fit in memory)

  • labels (list/tuple/np.ndarray, required) – labels of all the objects in the image. In order to use LabelLoss use 0 for background.

  • boxes (list/tuple/np.ndarray, required) – bounding boxes for all the labels. Must be in pixel coordinates and ltrb form (left, top, right, bottom)

  • points (list/tuple/np.ndarray, optional) – [x, y, x, y, …] points of all the bounding boxes. If points for some objects are missing use float(“nan”) and maintain all the labels to have same number of points. When not required use None.

import torch
from tensormonk.detection import Sample
from torchvision import transforms

Sample.OSIZE = 320, 320
Sample.RESIZE = True
Sample.ROTATE_90 = False
Sample.PAD = False
Sample.CROP = True
Sample.CROP_MIN_SIDE_PERCENTAGE = 0.3
Sample.CROP_MIN_OBJECT_SIDE = 16
Sample.CROP_N_ATTEMPTS = 8

data = [["./image1.jpg", [1], [[4, 6, 4, 6]]],
        ["./image2.jpg", [4, 6], [[4, 6, 4, 6], [2, 6, 3, 6]]]]


class SomeDB(object):
    def __init__(self, data, osize: tuple):

        self.samples = []
        for x in data:
            self.samples.append(
                Sample(image=x[0], labels=x[1], boxes=x[2],
                       points=None))

        self.transforms = transforms.RandomApply(
            [transforms.ColorJitter(0.1, 0.1, 0.1, 0.1),
             transforms.RandomGrayscale(p=0.25),
             transforms.ToTensor()])

    def __len__(self):
        return len(self.samples)

    def __getitem__(self, idx):
        image, labels, boxes, points = self.samples[idx].augmented()
        tensor = self.transforms(image)
        labels = torch.from_numpy(labels).long()
        boxes = torch.from_numpy(boxes).float()
        if points is None:
            return image, labels, boxes

        points = torch.from_numpy(points).float()
        return image, labels, boxes, points


dataset = SomeDB(data, (320, 320))
# To check how augmentation is working use the following to visualize
dataset.samples[0].annotate_augmented()
# To visualize original data
dataset.samples[0].annotate()
annotate(ids: list = [], image: Optional[PIL.Image.Image] = None, boxes: Optional[numpy.ndarray] = None, points: Optional[numpy.ndarray] = None)[source]

Annotates boxes and points on the image.

annotate_augmented()[source]

To visualize augmented data.

augmented()[source]

Provides augmented data.

avoid_nans_to_visualize(points: numpy.ndarray)[source]

Removes nan’s in the points.

property boxes

A property that returns a copy all of boxes (np.ndarray) on the image in ltrb format.

property boxes_cxcywh

A property that returns a copy all of boxes (np.ndarray) on the image in cxcywh format.

data()[source]

Provides a copy of original data.

property image

A property that returns pil image (reads every time).

property image_name

A property that returns image full path.

property is_boxes

A property that returns True when boxes are available.

property is_points

A property that returns True when points are available.

property labels

A property that returns a copy all of labels (np.ndarray) on the image.

property points

A property that returns a copy all of points (np.ndarray) on the image in pixel coordinates.