View on GitHub

Project Vision

Capstone Project for Extensive Vision AI Program

Download this project as a .zip file Download this project as a data.zip file

pytorch MIT License LinkedIn

Table of Contents

  1. Problem Statement
  2. Model
  3. Dataset
  4. Model Development
  5. Set up Model Training
  6. Training
  7. Detection
  8. Future Scope
  9. Leaving Note
  10. Contact

Problem Statement

The assignment is to create a network that can perform 3 tasks simultaneously:

  1. Predict the boots, PPE, hardhat, and mask if any present in the image
  2. Predict the depth map of the image
  3. Predict the Planar Surfaces in the region

The strategy is to use pre-trained networks and use their outputs as the ground truth data:

Model

The Network is of Encoder-Decoder Architecture

vision

Dataset

The data used for the training of the model is as below.

Download Data

The high level steps taken to create the dataset is as below:

  1. Collect images from different website of people wearing hardhat, masks, PPE and boots.
  2. For object detection, use YoloV3 annotation tool to draw bounding box for the labels.
  3. Use MidasNet by Intel to generate the depth for the images
  4. Use Planercnn to generate plane segmentations of the images

A detailed explanation and code can be found in this Repo

Issues faced

Additional Data

Download link

Model Development

In this section I will explain the steps taken to reach the final trainable model. Significant amount of time was invested in the initial to read all the research papers of each model and get a understanding of their architecture, this would enable us to split their encoder from their decoder.

  1. Step 1: To define the high outline of the final model and then start to give definition for each of its components
  1. Step 2: Define Encoder Block
    • The 3 different encoder block in each of the networks:
      • MidasNet - ResNext101_32x8d_wsl
      • Planercnn - ResNet101
      • Yolov3 - Darknet-53
    • My initial thoughts was to use Darknet as the base encoder, as the similar accuracy as ResNet and it is almost 2x faster based on performance on ImageNet dataset, but the downside of it is compartively complex to separate only the config of Darknet from Yolov3 config and then run the same code blocks from Yolov3 from model definition and forward method, This could mean i have to recreate those code blocks with changes so that only Darknet encoder is proccesed. Hence, as the enocder-decoder of Yolov3 is tighly coupled in code i decided against using it.
    • On other two options, I had tried both of them separately as the encoder blocks, based on the benchmarks ResNext-101 has perfomed better than Resnet-101 and ResNext WSL is maintained by facebook and are pre-trained in weakly-supervised fashion on 940 million public images with 1.5K hashtags matching with 1000 ImageNet1K synsets, followed by fine-tuning on ImageNet1K dataset, So the below ResNext block is used as enoder with the pretrained weights
          resnet = torch.hub.load("facebookresearch/WSL-Images", "resnext101_32x8d_wsl")
    
  1. Step 3: Define Depth decoder block
  1. Step 4: Define Object detection decoder block
  1. Step 5: Define Plane segmentation decoder block
    • Planercnn is built of MaskRcnn network which consists of resnet101 as the backbone for feature extractor and then it is followed by FPN,RPN and rest of the layers for detections
    • The first 5 layers(C1 - C5) of FPN are directly from the resnet101 block, which i changed to connect to our layers from the custom encoder block (note: C1 & C2 together form the layer 1 of our ResNext101 Encoder)
      • Encoder layer 1 output –> FPN C1 layer
      • Encoder layer 2 output –> FPN C2 layer
      • Encoder layer 3 output –> FPN C3 layer
      • Encoder layer 4 output –> FPN C4 layer
    • Key concept in Planercnn integration is that the default nms and ROI is coplied on the torch verions 0.4, which is incompatible with other decoder modules which use latest torch version, to handle this the default nms was replaced with the nms from torchvision and the ROI Align buit on pytorch(link) was used
    • One key issue faced during training is of gradient explosion after one iteration of the model train, post significant time debugging the reason in due to the replacement of the resnet101 directly with the custom encoder blocks, the solution for the issue was to retain the resnet101 structure but to replace the value of tht corresponding layers in FPN with enocder layers in the forward method
  2. Step 6: The Trainable model

Set up Model Training

  1. Step 1: Define input parameters for training
    • As each of the 3 network have their own multiple default parameters for decoder configurations and data preproccessing, I combined the Arg parser of all the 3 decoders into single file options.py
    • This ensures we able to pass the required input parameters including weights path for each of the decoders separately
  2. Step 2: Define Skeleton
    • Midasnet repo is defined only for inference and train for custom data is not part of it
    • Planercnn repo is pretty huge and structure for train dataset and process for generating the dataset is complex and time taking
    • Yolov3 repo has defined the training process
    • hence used the Yolov3 training code as reference for the training the model
  3. Step 3: Data loader
    • Defined different dataset for train and test from the split train.txt and test.txt file from the dataset
    • included the data tranformations from the 3 decoders
    • Additionaly for Planercnn training loaded the segmentation_final.png and plane_parameters.npy & plane_masks.npy files and similarly for depth training loaded the depth map images
    • Code blocks of Planercnn is defined only to work for batch size of 1, hence any other batch size > 1 will not work with the current code base
  4. Step 4: Loss function
    • Object detection - the compute loss method of the yolov3 code base to calculate the loss for the object detection network, The complete loss equation as below

    yolo_loss

    • Depth Estimation - to compare the predicted depth map with that of the target image a combination of the below two loss is use
      • RMSE(Root Mean Square Error): RMSE helps to cut the large errors interms of difference in the intensity of the pixels of the image rmse
      • SSIM(structural similarity index measure): SSIM helps to also measure the structural differences between the predicted and the acutal depths and also to punish noises in the prediction
      • Depth_loss = RMSE + SSIM
    • Plane Segmentation - to define loss for plane segmentation
      • The predefined loss function in planercnn uses cross_entropy loss to compare rpn_class and rpn_bbox
      • MSE(Mean Squared Error) is used to directly compare the plane_parameters.npy & Plane_masks.npy with the predicted np arrays, MSE performs better at pixel level comparison
      • SSIM is also used to compare the segmentation image predicted with target image
      • Plane_loss = computed_loss + MSE_Loss + SSIM
    • overall loss

    all_loss = (add_plane_loss * plane_loss) + (add_yolo_loss * yolo_loss) + (add_midas_loss * depth_loss)

  5. Step 5: Optimizer
    • Stochastic Gradient Descent(SGD) is used as the default optimizer with the below params
      • start lr : 0.01
      • Final lr : 0.0005
      • momentum : 0.937
      • weight_decay : 0.000484
    • Scheduler : Lambda lr

Training

  1. One key issue faced during the training for frequent running out of memory, below steps were used to handle the same - Clear torch cuda cache at the end of each epoch - use python grabage collector at the end of each training iteration to clear the space of variables that are no longer required
  2. Trining on small resolution images - Initial few epochs was performed on 64x64 images but this could only be done for Object detection and depth map as planercnn accepts image only with minimum size of 256
  3. Optimum resolution as which the entire model could train is 512 x 512, most of the epchs are run with this resolution
  4. The additional data was used for training the planercnn mode separately
  5. The time taken for each epoch initial was around 1.15 hours, this was reduced after standardising the image scales in all decoder to 512, (Planercnn had 480*640 and midas was working with 384*384). The time taken for each epoch right now is 40-50 min

  6. Part 1 training we could see the overall loss decreasing as the model is getting trained, The initial for epoch 0 was at 21. (notebook)

part1

  1. Part 2 training the overall loss further reduced in subsquent epochs to 7 (notebook)

part2

Detection

python detection.py

Future Scope

Leaving Note

Contact

For any further clarification or support kindly check my github repo or contact