Ⅰ. INTRODUCTION
We estimated 600,000 surveillance cameras in Tianjin, China, and one camera produces around 50 petabytes of data every day (Xiao et al., 2015). High quality of camera’s resolution and large volume of video data generated by the long time span has put high pressure on data storage. For this reason, innovation in camera system is needed if we were to find or detect an object in 50 petabytes of video data in a single day.
A huge number of CCTV camera systems still require human supervision. A dramatic efficiency gains could be seen by utilizing the recent advancement in computer vision and artificial intelligence to embed in those camera systems so as to detect identity of human and ensure public security such as criminal activities prevention and investigation, accident monitoring, people protection, public properties guarding, and etc.
Parking management presents across a wide variety of industry, including universities, entertainment venues, hospitals, airports, convention centers, and public offices buildings. Hence, a lot of systems (Hasegawa et al., 1994;Chen et al., 2010; Taghyaeeyan and Rajamani, 2014) exist to tackle the counting data issue in those large public venues, yet when introducing a new or installing an upgraded management system, it could result in significant expenses in servers, sensors, and network infrastructure. Additionally, the traditional sensors used to collect data information turn out to be either excessively costly, unreliable, and require extensive design work.
By utilizing modern deep CNNs, we could depend entirely on the image captured by the camera. This gives a couple of benefits. First, expensive sensors can be eliminated from the system. Secondly, reducing sensors would also result in removing the complexity of different data integration, resource sharing, computation power, battery life, and weight. Finally, relying solely on camera provides a great deal of installation flexibility. Installing a new camera counting system, moving a counting zone, or repointing the camera to define a new zone is relatively as simple as making some adjustment within the software program to suit the current camera view. Whereas using traditional in-ground sensor or beam-break devices generally calls for uninstalling after which reinstalling those devices at the new location. It is apparent that deep CNNs is an effective and efficient model to provide accurate, cost and energy-saving counting solution.
For this reason, in our study we would like to use deep CNNs together with the proposed models to detect car, truck, and pedestrian so as to be embedded in surveillance camera in a parking lot area.
Ⅱ. RELATED WORK
1. Traditional Approach
Traditionally, machine learning researchers approach image classification and image detection task using feature extraction for images in Global feature descriptors such as Local Binary Patterns (LBP) (Pietikäinen, 2010), Histogram of Oriented Gradients (HoG) (Dalal and Triggs, 2005), Color Histograms as well as Local descriptors such as Scale-invariant feature transform (SIFT) (Lindeberg, 2012), Speeded up robust features (SURF) (Bay et al., 2006), Oriented FAST and Rotated BRIEF (ORB) (Rublee et al., 2011) etc. However, these are hand-crafted features that requires people who has expertise level in that domain. Furthermore, the high variance of the nature of images such as scale, illumination, rotation, deformation, occlusion, and viewpoint are presented as an obstacle for research to discover new and better algorithm that can outperform those traditional approaches.
2. Deep CNN-based Object Detection
In recent years, deep learning has become best known for its ability to learn from experience, and is used in complex problems. Noticeably, deep convolutional neural networks (CNNs) have made tremendous progress in large-scale object recognition (He et al., 2016;Krizhevsky et al., 2012;Szegedy et al., 2015) and in detection problems (Ren et al., 2015;Liu et al., 2016;Redmon and Farhadi, 2017).
In the endeavor to accomplish a fully autonomous vehicle, a lot of computer science researchers have applied a deep CNN to extract information about the road and to understand the environment surrounding the vehicle, ranging from detecting pedestrians (Angelova et al., 2015), cars (Zhou et al., 2016), and bicycles, to detecting road signs (John et al., 2014) and obstacles (Hadsell et al., 2009).
CNNs were also applied to address the issue of counting objects in images. Onoro-Rubio and López-Sastre (Onoro-Rubio and López-Sastre, 2016) proposed a convolutional neural network model called Counting CNN (CCNN) to estimate the number of vehicles in a traffic congestion and to count people in a very crowded scene. CCNN works as a regression model that learns to map the appearance of the image patches to their corresponding object density maps. (Zhang et al., 2015) also propose a CNN architecture to predict density maps for crowd counting by a switchable learning process.
Ⅲ. METHODOLOGY
1. YOLO Object Detection
YOLO, short for You Only Look Once (Redmon et al., 2016), is an object detector focused on real-time processing. It takes a different approach than other networks that use region proposal or sliding window, instead it re-frames object detection as a single regression problem. YOLO looks at the input image just once, and divide it into grid of S x S cells. Each grid cell predicts B bounding boxes and a confidence score representing the IOU with ground truth bounding box and probability reflects how likely that the predicted bounding box contains some objects and how accurate it is:(1)
denotes as the intersection over the union between the predicted box and the ground truth. Each cell also predicts C conditional class probabilities, Pr(object). Both the confidence score and class prediction outputs one final score telling us the probability that this bounding box contains a specific type of object.
2. YOLOv2 Architecture (S1)
YOLOv2 (Redmon and Farhadi, 2017) is the second version of YOLO with significantly improve in accuracy and speed. There are two different model architectures we used in conducting our training. The first experiment we conducted was based on darknet architecture of YOLOv2, <see Table I>. The total number of layers of YOLOv2 is 31, in which 23 are convolutional layers with a batch normalization layer before leaky ReLu activation and a maxpool layer at the 1st, 3rd, 7th, 11st, and 17th layer. The first thing we have to do to train the network on our own dataset is to reinitialize the final convolutional layer so that it outputs a tensor with shape 13 × 13 × 30, where 30 = 5 bounding boxes × (4 coordinates + 1 confidence value + 1 class probability).
3. Our Proposed Architecture (S2)
In an attempt to minimize the computational expenses and model size of the neural network, we proposed the second model (S2 architecture), as seen in <Table 2>. While requiring just 18 million parameters with 27 layers, S2 is a modified version of the S1 architecture which needs roughly 48 million parameters. The final result shows that the average precision and recall score of the S2 architecture is similar as that of S1 architecture while being smaller in size and faster in computation speed.
1) Anchor Box Model
Generally, in object detection techniques, only one object can be detected in each of the generated grid cells. The issue emerges when more than one object exists in each cell. The solution when dealing with this circumstance is using the idea of anchor boxes. Instead of predicting a one-dimensional 5 + num_of_class, it instead predicts (5 + num_of_class) × num_of_anchorboxes. Each anchor box is designed for detecting objects of different aspect ratios and sizes. For example, box 1 can detect objects that are large and tall, whereas box 2 detects objects with a small square shape, and so on.
YOLOv2 divided the entire image into a 13 x 13 grid, and places 5 anchor boxes. The bounding box and class predictions are then made for each anchor box located there. The appropriate bounding box is selected as the bounding box with the highest intersection over union (IOU) between the ground truth box and the anchor box. The unique feature of YOLOv2 is that the anchor boxes are designed specifically for the dataset it trains on. The anchor boxes originally provided by YOLOv2 are for general objects in the Visual Object Classes (VOC) dataset, not “car”, “truck”, or “pedestrian” shapes and sizes. For this reason, we ran a k‑means clustering technique on our training set to generate five different anchor boxes tailored more towards our dataset objects.
Input: K number of cluster centroids and training set
Randomly initialize K cluster centroid
for i = 1 to m
c(i): = index (from 1 to K) of cluster centroid closed to x(i)
for k=1 o K
k=1 o K μK : = average (mean) of points assigned to cluster k
}
Although choosing a larger value of K (centroid) gives a higher average IOU, it will slow down the model since we also need more detectors in each grid cell. YOLOv2 choses 5 anchors as it is the optimal choice for a good trade-off between recall and model complexity.
2) Denser Grid Model
The dataset we used comes with a big resolution of 1920 x 1200. The YOLOv2 network operates at a network resolution of 416 x 416, and after its convolutional layers downsample the images by a factor of 32, the grid cell (output feature map) is 13 x 13 <see Fig. 2>. Unlike the desired square input resolution of the inspired model (S1), our dataset images’ resolution is large and wide. However, as this is a very high resolution, we decided to use an input network resolution of 960 x 608, which is half of the original resolution and will produce a 30 x 19 grid after downsampling <see Fig. 3>. Fig. 1
3) S2-Anchor and Den-S2-Anchor
We made a few modifications to S1 to build the S2 architecture.
-
∙23rd and 24th layer of S1 consumes more than 18 million parameters. Layer 29 alone requires more than 11 million parameters. Hence, removing these three convolutional layers can reduce about 30 million parameters.
-
∙We make S2’s 23rd layer to have a filter size of 2048 by modifying S1’s 26th convolutional layer from 64 filters to 256 filters.
-
∙The reorganized 25th layer has a depth of 1024 by reorganizing the 24th layer from 26 x 26 x 256 to 13 x 13 x 1024.
-
∙Route layer 26 and route layer 25 (13 x 13 x 1024) with the 22nd convolutional layer (13 x 13 x 1024) output 13 x 13 x 2048.
-
∙We developed 2 different model for S2. The first one is the use of Anchor Box Model. we created S2-Anchor by modifying the width and height of S1’s anchor boxes. By applying Denser Grid Model on top of that, we created Den-S2-Anchor as the second model.
Ⅳ. EXPERIMENTS
1. Data Collection and Processing
We use the annotated driving dataset provided by Udacity’s Self-driving car (Udacity, 2018) that consists of driving in Mountain View California and neighboring cities during daylight conditions. It contains over 65,000 labels of cars, truck, and pedestrian across 9,423 frames collected from a Point Grey research cameras running at full resolution of 1920 x 1200. For our study, we used only 7,423 images for training set, and 1,000 images for testing set. Below is the summary of our training and testing set. <Table 3, 4>
2. Training
Our implementation on S1, S2-Anchor, and Den-S2-Anchor architectures is based on an open source YOLO framework called Darkflow (Thtrieu, 2016). We trained all three models using pre-trained weights on PASCAL VOC and/or Microsoft Common Objects in Context (MS COCO) due to the fact that the pre-trained weights in deep CNNs had been trained on general purpose object detection, so it has the ability to learn low and middle level features such as Corners, Edges, color, and shape in a hierarchical fashion.
Training was performed on a GeForce GTX 1080 with 10 GB RAM. In all our training sessions, we used the Adam optimizer because of its tendency towards expedient convergence.<Table 5>
Training the S1 model started with a learning rate of 1e - 5 to quickly reduce loss. After training for 30 epochs, we validated the training with test images that the model had never seen; the performance was not good, with a number of false positive and false negative bounding boxes. For this reason, we changed the learning rate to 1e - 6 for another 30 epochs so as to ensure finer granularity and proper convergence of our models.
Like S1, we trained the S2-Anchor and Den-S2-Anchor models at the same learning rates of 1e - 5 and 1e - 6 for 30 epochs and 30 epochs, respectively. As far for as the validation process, we had Den-F2-Anchor train for another 30 epochs at the 1e - 6 learning rate. However, the model’s performance got slightly worse from overfitting, so we reverted to the previous checkpoint. Training for the three models was done in about four days. <Fig. 4-6> show all three models’ training loss. Fig. 5
3. Parking Lot Environment Simulation
The parking lot environment <see Fig. 7> was built by using Lego bricks, motors, and Lego Mindstorms EV3 to power the conveyor belt. The decision to use Lego Mindstorms EV3 was because of its ability to make buildings or any kind of simulation environment simple, fast, and powerful. Programming and commanding was also relatively efficient (Valk, 2014).
As for detection technique, we used ArUco as a detection Markers to verify if the objects (car, truck, or pedestrian) have entered the area to be counted. ArUco (OpenCV, 2018) is a library for camera pose estimation using squared markers which composed by a wide black border and an inner binary matrix which determines its identifier (id). ArUco is extremely fast as the black border facilitates its fast detection in the image. Since car and truck have a wider width, we decided to use 2 different ArUco Markers mounted on 2 pole Lego pointed right in front of the camera with a conveyor belt carrying detected object running in between.
The camera was designed and coded to detect the ArUco Markers in real time, so whenever an object fully blocks the 2 markers, our proposed models that we embedded in the camera will fire to detect the object as well as start counting if the object is either car, truck, or pedestrian.
However, to guarantee the accurate quantity of accessible parking space, the counting system is programmed to count only those vehicles that will potentially take a parking spot (cars, trucks) and ignore those objects that may accidentally go through the counting area and incapable of taking a parking area (people, animals, bicycle, motorcycle, and so on).
Ⅴ. RESULTS
We evaluated the performance of the networks using precision and recall score with the following formulas:(2)(3)
TP, FP, and FN denote true positives, false positive, and false negative respectively. n represents the total number of testing images. In the testing phase, we tested the model against 1000 images. We compute the average precision for all the models.
Based on the result table, we can see that Den-S2-Anchor model produced the highest relative mean average precision of all the three objects. Although S1 (YOLOv2) is a promising model, it does not perform well on real-life application of images with high resolution. While S2-Anchor, on the other hand, obtained similar mean average as the inspired model (S1) with a much smaller size of parameters. S2-Anchor processed the images the fastest at 0.032s (32 FPS). S1 came second, running at 0.023s (23 FPS), while Den-S2-Anchor came last at 0.021s (21 FPS).
The results show that despite a little compromise on speed, our proposed Den-S2-Anchor models achieved better performance, with a large reduction in computational complexity and model size. <see Fig. 8-9>
There are a couple of things we need to examine on the result outcome. Firstly, despite the fact that S2-Anchor has a huge reduction in the number of parameters and less layers compared to S1, the former can still produce a comparable outcome in precision. This is due to the use of pre-trained weight in the training phrase. The models were able to learn well from the low and middle-level features. Secondly, “car” seems to have a higher accuracy result than “truck” and “pedestrian” object. The reason is because our datasets appear to have imbalanced object instance. The training set has 87.5%, 5.8%, and 6.7% for car, truck, and pedestrian respectively out of the total object instance while the same pattern could be seen: 78.7% for car, 4% for truck, and 17.3% for pedestrian out of the total object instance in testing set. This case could potentially influence our system to deliver a higher accuracy toward “car” object.
Ⅵ. CONCLUSION AND FUTURE WORK
We presented a deep-learning architecture designed to detect car, truck, and pedestrian to be used in real-life application as well as build a simulation parking lot environment for the experiment. The new proposed models shrink the number of parameters by a large margin, with a notable increase in performance and speed. We demonstrated that deep-learning based vehicle counting is accurate, intelligent, and relatively easy to deploy than the traditionally used sensors. In spite of this study’s limitation, our research study could be of useful subset information on the further development of parking lot management.
In term of the future work, First, we would enhance the imbalanced object instances of other classes in the dataset. Secondly, we could train the system to count vehicles with other attributes such as vehicle types, and brand if such dataset is available. On top of that, we could also improve our model with other architectural approach and apply it on other type of vehicle dataset with various level of real-life difficulty. Finally, the same camera used to check the counting data, may likewise serve as a video feed to the security group to upgrade the overall system of parking lot management.