Object Detection Using Deep Learning From Very High Imagery
*Vijayalakshmi.S, Mopidevi Omkar*Assistant Professor, Saveetha School of Engineering, Saveetha Institute of Medical and
Technical Sciences, Thandalam, Chennai, Tamilnadu, India- 602 105
UG Scholar, Saveetha School of Engineering, Saveetha Institute of Medical and Technical
Sciences, Thandalam, Chennai, Tamilnadu, India- 602 105
Abstract – In recent technology development of autonomous vehicle surveillance, smart video surveillance, facial detection, crowd monitoring application, etc peculiar object detection algorithm are in demand. These algorithm involves not only recognizing and classifying every object in an image, but positioning each and every object by marking the appropriate boundary around it.
Which makes object detection a significantly even more complex process than traditional image classification and computer vision acquisition. This paper gives you a comprehensive survey on newer algorithm introduced for object detection using deep learning.
Keywords: object detection, image classification, computer vision, deep learning
Object detection approaches are enhancement of image classification models. Recently Google has released a new object detection API for Tensorflow which has a pre built architecture for new models:
Region Based Fully Convolution Networks
This paper covers the comprehensive survey on the above models and gives you the better model used for object detection.
R-CNN or Region based convolution neural network consist of 3 simple steps,
Read the input image for possible objects using an algorithm called selective search which generated approximately 2000 region proposals
Applies a CNN on top of each regions proposed in the previous step
Takes the output of each CNN and feed it into an SVM to classify the region and a linear regression to obtain the boundary region of each and every objects.
The above stated three points are illustrates in the image Fig.1
On optional way, at first the proposed regions are extracted and then classification of those regions based on their features is done, R-CNN was very intuitive , but very slow when high resolution imagery is used.
Enhanced version of R-CNN is Fast R-CNN with improvement on its detection speed through two main key points.
Feature extraction are done over the image before proposing regions, thus only one CNN is used instead of 2000 CNNs over 2000 regions
Using softmax layer instead of SVM, thus extends neural network for predictions instead of creating a new model.
As inference from the picture we see region are proposed based on feature map and not from original image. In addition there is on softmax layer which outputs the class probabilities.
Fast R-CNN performs much better in speed. But one bottleneck remaining, its the selective search algorithm for proposing selective regions.
The main aspect of Faster-RCNN is to replace the selective search algorithm with a fast neural network. Thus it introduced the Region Proposed Network (RPN). Following are the working of RPN
A sliding window moves across the feature map and maps it to the lower dimension. It is located over the last layer of an initial CNN.
For each sliding window position, it generates multiple possible region based on k fixed ration anchor boxes.
Each regions proposed consists of an object score for the region and 4 points representing the boundary region.
Fig.3 depicts the working of one sliding window. At each location of feature map consider k different box regions centred around it, a tall box, a wide box, a large box etc. For each box regions output is predicted based on presence of the object.
RPN output the boundary region coordinates, it does not tries to classify any potential objects. A anchor box which has an object score above a threshold value. It is passed forward as proposed region. Once the region has been extracted it is feed into Fast R-CNN. Thus it has a pooling layer, some fully-CNN and finally a softmax classification layer and bounding box region projection. In short Faster R-CNN= RPN+Fast R-CNN. Fig.4 shows how Faster R-CNN works.
Thus the algorithm achieves better speed and accuracy. In other words , Faster R-CNN may not be the simplest or faster method for object detection, but it giver higher performance than other models. The best real time example is Tensorflows Faster R-CNN with Inception ResNet is their slowest but most accurate model.
This works similar to Fast R-CNN which improves the original detection speed by sharing CNN across all proposed region. The same is followed in R-FCN Region- based Fully Convolutional Network, increasing the speed by maximizing shared computation.
During the process of classification of an object, additional information on location invariance of amodel need to be known. For example, regardless of where the cat appeares in the image, we want to classify it as a cat. On other hand, when performing detection of the object, information of location variance is needed. if the cat is in the top left-hand corner, we want to draw a box in the top left-hand corner. For compromise between location invariance and location variance, R-FCN gives a solution as positive sensitive score maps. Each position-sensitive score map represents one relative position of one object class. Following are the working of
At initial stages runs a CNN over the given input image
Fully convolution layer is added to generate a score bank position sensitive score map, it should be k?(C+1) score maps, with k? representing the number of relative positions to divide an object and C+1 representing the number of classes plus the background.
A fully convolutional region proposal network (RPN) is runned to generate regions of interest
For each RoI, divide it into the same k? bins or subregions as the score maps
For each bin, check the score bank to see if that bin matches the corresponding position of same object.
Once each of the k? bins has an object match value for each class, average the bins to get a single score per class.
Classify the RoI with a softmax over the remaining C+1 dimensional vector
However R-FCN simultaneously address location variance and invariance by region proposal refer to same score maps. These score maps must learn to classify the object as object regardless where it appears. Finally R-FCN is several times faster than Faster R-CNN, and achieves comparable accuracy.
Single-Shot Detector most Likely to R-FCN, it provides enormous speed gains over Faster R-CNN.As discussed above in the first two models region proposals and region classification are performed by two different stages. First they used region proposal network to calculate the region of interest, second they used fully connected layers or CNN to classify the regions. SSD accomplish the above two task in single shot. It simultaneously predicting the bounding box and the class as it processes the image. Fig.6 shows the architecture diagram of SSD.
Following points states the working of SSD
Initially the image is passed through a series of convolution layers, which gives various sets of feature maps.
As equivalent boundary box region in Faster R-CNN a small set of default boundary boa region are generated by applying 3×3 convolution filter on each feature maps generated initially.
For each box, prediction on boundary box offset and clss probabilities is done simultaneously
During the process of training a perfect match on ground truth box is found out based on predicted boxes based on IoU. Thus will be named as positive with all other boxes that have IoU with match threshold>0.5.
Thus SSD classify and draw boundary boxes from all single position in the image, multiple different shapes, at different scale. As a result it generates greater number of boundary box than the other models. As a final point it also contains an extra feature layer to scale down in size which helps to capture objects of different sizes.
Hence SSD simply skips the preliminary step on region proposal instead it consider every single boundary boxes and location simultaneously to classify. SSD does everything in one shot, it is the fastest of the three models, and still performs quite comparably.
Table.1 shows the comparison with the models
R-CNN Region based convolution neural network very slow when high resolution imagery is used
Fast R-CNN Fast Region based convolution neural network performs much better in speed
Faster R-CNN Faster Region based convolution neural network slowest but most accurate modelR-FCN Region- based Fully Convolution Network achieves comparable accuracy
SSD Single-Shot Detector performs quite comparably
Table.1 Performance comparison with the models
This paper finally give you on various model of object detection using deep learning and CNN, and how these models performs with one another. Faster R-CNN, R-FCN, SSD are the three models which are widely used currently. Other relative models which tends to be likely to these models which relays on deep neural network for classifying / object detection.
 A. Krizhevsky, I. Sutskever, and G. Hinton. ImageNet classification with deep convolutional neural networks. In NIPS,2012.
 M. D. Zeiler and R. Fergus. Visualizing and nderstanding convolutional neural networks. In ECCV,
 K. Simonyan and A. Zisserman. Very deep onvolutional networks for large-scale image recognition. In ICLR, 2015. K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CVPR, 2016. R. Girshick, J. Donahue, T. Darrell, and J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, in CVPR, 2014.
 J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. FeiFei. ImageNet: A large-scale hierarchical image database. In CVPR, 2009. J. Deng, A. Berg, S. Satheesh, H. Su, A. Khosla, and L. FeiFei. ImageNet Large Scale Visual Recognition Competition 2012 (ILSVRC2012). M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results, 2007.
 M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes (VOC) Challenge. IJCV, 2010. Lin, Tsung Yi, et al. Microsoft COCO: Common Objects in Context. Computer Vision ECCV 2014. pringer International Publishing, 2014:740-755.