Building a high-quality annotated image library to improve object detection
CCTV cameras in cities are used for community safety and crime prevention applications but most of the time these cameras are not actually in active use.
We are interested in harnessing the redundancy within this ready-made sensor network so that decision makers can better understand how people are using public spaces and to better manage resource constraints.
In this long-read blog, Luis Serra and Maralbek Zeinullin explain how and why we are using annotated images to train computer vision models to be deployed on CCTV cameras.
What is a computer vision model?
Computer vision models are algorithms developed to automate tasks typically associated with the human visual system – including the detection and localisation of objects of interest in images. To be successful, such algorithms first need to learn which objects to search for.
In supervised machine learning (ML), we show a model the correct answers to a given problem by labelling the objects of interest within the dataset. In this way algorithms learn the measurable characteristics associated with those objects (such as shape and colour), equipping them to recognise other examples that they are presented with in future. Another important consideration for the model’s ability to learn the patterns to search for is the amount of training data. Usually, the more labelled examples to train a model the better but, ultimately, it all depends on the complexity of the project in question.
The most common type of object annotation in images is bounding boxes, which are polygons surrounding objects of interest. Typically, the task of annotating objects on an image is performed by humans (known as “annotators” or “labellers”) who visually inspect the image, recognise the objects and draw a polygon around each.

Figure 1: Example of annotated objects in a picture. Notice the different colours of polygons used for different types of objects. Photo by Wladislav Glad on Unsplash.
Why develop a new computer vision model to be deployed on CCTV cameras?
Widely available off-the-shelf object detection models have not been trained with CCTV imagery. Instead, their training datasets typically consist of photographs captured by pedestrians, with a lower point of view and inconsistent image quality. As a result, when faced with relatively unfamiliar CCTV imagery, these models perform less effectively.
Furthermore, the off-the-shelf models that we have evaluated are usually incapable of detecting cyclists - one of the objects of interest for this project. While some may be able to detect bicycles it is difficult to distinguish stationary parked bikes from those being ridden within static imagery.
Also, a reliable system that can monitor city activity 24/7, 365 days of the year, provides a greater breadth of data than the conventional (and costly) periodic counts of city activity performed by humans (e.g., cyclist cordon counts).
Our project objectives
For this type of deployment on CCTV cameras, it is widely accepted by ML practitioners that models tend to achieve greater accuracy when trained with CCTV-like images (e.g., similar camera height, direction and orientation and similar image properties).
The objectives of our project required us to annotate persons, cyclists, and vehicles on at least 10,000 CCTV-like images which we collected from the city centres of Glasgow, Newcastle, Manchester, and Sheffield.
Capturing imagery from several cities provides a broad spectrum of environments which facilitates the development of more generalisable computer vision models, especially across cities in the UK. A rule of thumb to collect ML data is to gather data that covers the entire range of inputs for which one aims to develop the predictive model. Models just trained on Glasgow imagery would probably be biased towards the specificities of Glasgow.
A key challenge for this project was the difficulty in finding a persistent and diverse set of CCTV-like imagery to train a model, as a result of (among other things) privacy concerns.
Conventions for annotated objects
In addition to various forms of motorised vehicles, UBDC’s annotation project put a special emphasis on active forms of travel, e.g., walking and cycling. Consequently, we decided to annotate the following classes of objects: Bus, Car, Cyclist, Crowd, Lorry, Motorcycle, Pedestrian, Taxi and Van.
You might assume that, although time-consuming, image annotation would be a straightforward and routine process. Indeed, that is often the case. However, early in the project, the team was faced with several situations where the choice of how or where to apply labels was not clear or involved an element of subjective interpretation. How should persons that are not pedestrians be labelled? How should a motorbike rider or pillion passenger be handled? This could be particularly problematic where labelling was being done by several individuals, who may each have their assumptions about how to approach such situations, and therefore introduce inconsistency into the training dataset. To resolve these uncertainties, the team agreed on a set of conventions that should be applied uniformly. For example, a person pushing a bike by hand is still considered a cyclist and should be labelled as such.
The annotation process
The annotation work started on the 5th of January 2022 and ended on the 31st of March 2022.
The videos for the annotations were collected by CTS Traffic and Transportation Ltd between 01/02/2022 and 19/03/2022 in the four cities mentioned above. In each location, the company collected 14 consecutive hours of video, between 6am and 8pm. The videos were produced in colour mode, with a spatial resolution of 1280 x 720 pixels and a frame rate of 25 frames per second. All sites were positioned with a CCTV-like camera on top of a pole, measuring between three and three and a half metres.
Previously to CTS collection, the project labelled images collected in September 2021 were from the Glasgow cordon counts, since the CTS imagery was not available at the start of the annotation. The cordon counts videos were produced in colour mode and the same spatial resolution as CTS videos although at a lower quality, due to the way they were captured.
The annotations are made on still images and not videos and therefore before annotation we were required to extract frames from the videos. To ensure suitable diversity of imagery and objects, the frames were randomly selected during varying hours and weather conditions.
In ML, the images are annotated in specialised annotation platforms. After comparing several specific software available, the technical team decided to use the CVAT tool, due to it being open-source and providing a set of convenient tools to organise, annotate and review images.
The Annotation team was composed of five annotators and two reviewers. The five annotators were all MSc students (thanks to Mingkang Wang, Zoe Watts, Miles Peterson, Aliza Aijaz and Amy Russel) while the two reviewers were part of the UBDC Data Science Team. The reason to double-check the annotations was to minimize the number of mislabelled data.
In addition to image representativeness and labelling consistency, we also put a great effort into labelling all objects of interest within an image (completeness) and tightly enclosing the entirety of the object to label (positional accuracy), thus achieving a high-quality standard on the annotated dataset.
Finally, we established procedures to minimise errors and ensure consistency among annotators. For instance, the enforcement of regular breaks.
Results
The project labelled 10,446 images with 99,246 unique objects from the CTS dataset and 10,313 images with 40,628 unique objects from the cordon counts dataset. Overall, the project labelled 139,874 unique objects.
Object | CTS dataset | Cordon dataset |
---|---|---|
Bus | 3,241 | 1,684 |
Car | 9,697 | 20,202 |
Cyclist | 1,937 | 936 |
Crowd | 645 | 1 |
Lorry | 318 | 582 |
Motorcycle | 52 | 183 |
Pedestrian | 81,336 | 11,391 |
Taxi | 387 | 399 |
Van | 1,633 | 5,250 |
Total | 99,246 | 40,628 |
For the CTS dataset, the label “Pedestrian” comprises almost 82% of the total number of labelled objects, according to the pie chart below. Despite a great effort to find city locations and hours of the day where the likelihood to find cyclists was high, this label contributes to less than 2% of the total number of labels.

Figure 2: Proportion of labelled classes in the CTS dataset.
In the case of the cordon counts dataset, the most labelled class of object was “Car”, accounting for almost 50% of the total number of labels. Still, cyclists contribute roughly the same percentage as the CTS dataset, 2.3%.

Figure 3: Proportion of labelled classes in the cordon counts dataset.
An invaluable resource
Collecting and annotating a diverse range of images that represent most city environments in the UK, while minimising the likelihood of mislabelled data, has been a huge effort.
In addition to using the annotated images within its own CCTV automation projects, UBDC will make them available to other academic developers wanting to develop or validate their own models.
We hope the labelled dataset produced constitutes an invaluable resource to develop better models for CCTV-like cameras.