crowd

Counting Dense Crowds with Computer Vision AI Models

Apurva Kumar

29 Jul 2025 — 15 min read

CityData has studied, analyzed and evaluated the universe of computer vision AI models so you don't need to!

1. Executive Summary

The accurate and automated estimation of crowd sizes, particularly in scenarios involving thousands of individuals captured from aerial platforms such as drones, helicopters, or satellites, presents a formidable challenge in computer vision. Key difficulties include extreme occlusion of individuals, vast variations in scale due to perspective and altitude, the minuscule size of individual objects in high-altitude imagery, and complex environmental factors like diverse viewpoints, background clutter, and fluctuating illumination.Traditional object detection methods, which aim to delineate each person with a bounding box, are inherently ill-suited for such dense and occluded environments, often leading to significant inaccuracies.

The most effective paradigm that has emerged to address these limitations is density map regression. This approach involves training deep learning models to predict a continuous density map, where pixel intensity corresponds to the concentration of individuals. The total crowd count is then derived by integrating the values across this map.A significant advantage of density maps is their ability to provide not only the total count but also crucial spatial distribution information, which is invaluable for applications such as public safety management, urban planning, and real-time surveillance.

For accurately counting very dense crowds from drone or helicopter snapshot pictures, the most effective computer vision models are predominantly sophisticated deep Convolutional Neural Networks (CNNs) specifically engineered to handle multi-scale features and integrate contextual information. The leading methodologies include:

Density Map Estimation: The overarching paradigm that converts point annotations into continuous density maps, allowing for counting by summing pixel values.
Multi-Column CNNs (MCNN): Pioneering architectures that use parallel networks with different receptive fields to handle scale variations.
Single-Column CNNs (CSRNet): Streamlined networks employing dilated convolutions to capture large receptive fields while preserving high resolution, excelling in highly congested scenes.
YOLO-based Frameworks (e.g., YOLOv8 with Context Enrichment Module - CEM): Object detection models adapted for crowd counting in aerial images, enhanced to detect tiny targets and integrate multi-scale context.
Point-based Networks (e.g., P2PNet & CrowdSat-Net): Models that directly predict individual head points, offering precise localization alongside counting, particularly effective for satellite imagery (CrowdSat-Net) or general dense crowds (P2PNet).

Model performance is rigorously evaluated using standard metrics such as Mean Absolute Error (MAE) and Mean Squared Error (MSE), which quantify counting accuracy and model robustness, respectively.Other metrics like F1-score, Precision, and Recall are also employed for specific tasks like crowd detection.Evaluations are conducted on highly challenging benchmark datasets tailored for aerial and dense crowd scenarios, including NWPU-Crowd (distinguished by its expansive density range, 0-20,033 individuals), UCF-QNRF (known for its large scale and high resolution), VisDrone-CC (comprising drone-captured images), and CrowdSat (featuring VFR satellite imagery).Quantitative results vary across datasets and models, reflecting their specialized strengths. For instance, FPNCC achieved an MAE of 11.66 on the VisDrone-CC2020 dataset, while CrowdSat-Net demonstrated an F1-score of 66.12% and Precision of 73.23% on the CrowdSat dataset.These figures underscore the robust performance of these models in handling the complexities of dense aerial crowd images.

2. Introduction to Dense Crowd Counting from Aerial Imagery

The Critical Need for Automated Crowd Counting

The accurate and automated estimation of crowd sizes in unconstrained environments is a paramount capability with wide-ranging applications. This includes enhancing public safety by identifying potential overcrowding risks, optimizing video surveillance systems, informing urban planning decisions, facilitating disaster response and emergency management, and streamlining large-scale event organization.By providing precise counts and distributions, these systems enable proactive crowd management, significantly reducing the likelihood of accidents such as crowd crushes and security incidents. Furthermore, the analysis of historical crowd patterns offers invaluable information for long-term urban development and infrastructure optimization.

Unique Challenges of Counting Dense Crowds from Drones/Helicopters/Satellites

Counting dense crowds from aerial platforms presents several unique and formidable challenges that distinguish it from traditional object counting tasks.

Extreme Occlusions: A primary hurdle in dense crowd counting is the severe occlusion of individuals. In highly congested scenes, people are often so tightly packed that their bodies and even heads overlap significantly, making it exceedingly difficult for algorithms to distinguish and delineate individual entities.
Scale Variation: Aerial imagery inherently captures subjects at vastly different distances from the camera. This results in extreme variations in object size within a single image, with individuals appearing as anything from a few pixels to hundreds of pixels in diameter.This dynamic range poses a significant challenge for models, particularly for traditional detection-based methods.
Small Objects: From an elevated aerial perspective, individuals often appear as remarkably tiny blobs. In extremely dense areas, a person's head might occupy as little as a single pixel.This extreme miniaturization complicates both the precise localization and subsequent tracking of individuals.
Wide Viewpoints and Dynamic Camera Perspective: Unlike fixed surveillance cameras, drone and helicopter platforms introduce wide viewpoint variations and dynamic camera movements.These changing perspectives distort object appearances and densities in unpredictable ways, demanding highly adaptable counting algorithms.
Background Clutter and Illumination Variation: Aerial scenes are prone to complex backgrounds that may contain objects structurally similar to human heads (e.g., rocks, foliage, patterns on the ground), leading to false positives.Additionally, varying light conditions—from bright sunny days to cloudy or nighttime scenes—significantly alter image characteristics, further complicating accurate detection and counting.
Annotation Burden: The manual annotation of bounding boxes for thousands of individuals in dense aerial images is an incredibly labor-intensive and time-consuming process. This practical constraint makes point annotations, where only the center of a head is marked, and subsequent density regression, a far more practical and scalable approach for ground truth generation.

Evolution from Detection-Based to Density Map Regression

Early attempts at crowd counting primarily employed object detection frameworks. These methods typically utilized sliding windows to identify visible body parts, such as heads or shoulders, and then aggregated these detections to estimate the crowd size.

The fundamental objective of traditional object detection is to precisely delineate each object with a bounding box.However, in the context of extremely dense crowds, this approach encounters severe limitations. Firstly, individual objects become minuscule, heavily occluded, and often visually indistinguishable from one another.This makes the accurate placement of distinct bounding boxes around each person practically impossible.Secondly, the sheer volume of individuals—potentially thousands in a single image—imposes an overwhelming annotation burden for generating bounding box ground truths.These critical failure modes of detection-based methods, including poor scalability, difficulty in learning consistent features, and resolution issues for tiny blobs, directly necessitated a paradigm shift. The inability to reliably pinpoint individual personsled researchers to pivot towards density map regression, which treats the crowd as a continuous "regional feature" rather than a collection of discrete, detectable objects.This represents a clear cause-and-effect relationship where the inherent shortcomings of one computational vision approach directly spurred the innovation and adoption of a more suitable alternative for dense crowd scenarios.

Density map regression was developed to overcome these limitations. This method involves converting sparse point annotations (typically marking the center of a person's head) into a continuous density map through convolution with a Gaussian kernel.The sum of pixel values within this generated density map directly yields the total crowd count, while simultaneously providing valuable information about the spatial distribution of the crowd.This approach effectively bypasses the need for precise individual detection and bounding box annotation in highly occluded and dense scenes.

3. Core Methodologies for Aerial Crowd Counting

Density Map Estimation: The Dominant Paradigm

At its core, density map estimation for crowd counting operates on the principle of transforming discrete point annotations—manually marked center points of individuals' heads—into a continuous, smooth representation of crowd density. This transformation is typically achieved by convolving each point annotation with a Gaussian kernel.The resulting density map is a grayscale image where pixel intensity directly correlates with the density of people in that region. The total crowd count for an image is then derived by simply summing the pixel values across the entire density map.

This methodology offers distinct advantages, particularly in the context of dense and highly occluded crowds, which are common in aerial imagery. Firstly, it elegantly circumvents the fundamental failure mode of traditional object detection methods in such conditions, where individual objects are too small or occluded to be accurately bounded.Secondly, beyond merely providing a numerical count, the density map inherently preserves and conveys the spatial distribution of the crowd.This distribution information is critical for diverse applications, such as identifying high-risk congestion points, analyzing crowd flow patterns, or even monitoring adherence to social distancing guidelines.The standard deviation (σ) of the Gaussian kernel is a crucial parameter, influencing the spread of the probability density distribution; a smaller σ results in a more concentrated distribution around the head center.To account for the varying apparent sizes of individuals due to perspective distortion in images, adaptive Gaussian kernels are often employed. These kernels dynamically adjust their σ parameter based on the local crowd density, for instance, by considering the average distance to a person's k-nearest neighbors.

The quality of the ground truth density maps is paramount for effectively training these models. These maps are meticulously generated by convolving the manually labeled point annotations with carefully selected Gaussian kernels.

Leading Models for Dense Aerial Crowd Counting

1. Multi-Column CNNs (MCNN)

The Multi-column Convolutional Neural Network (MCNN) pioneered the use of parallel network structures to address the challenge of scale variation. It typically comprises three parallel CNN columns, each designed with filters possessing different sizes of local receptive fields (e.g., large, medium, small).This multi-column design enables the features learned by each column to adapt specifically to the varying sizes of people or heads, which is a common phenomenon in images due to perspective distortion. A significant innovation in MCNN was replacing traditional fully connected layers with 1x1 convolution layers, enabling the network to accept input images of arbitrary size without distortion.The feature maps generated by all columns are then concatenated and mapped to the final density map.

MCNN proved highly robust to significant variations in people/head size and perspective effects.Its architecture also demonstrated good performance and a notable ease of transferability across different datasets.Despite its effectiveness, multi-column CNNs are computationally intensive and demand substantial memory for training due to the parallel processing of multiple networks.This architecture can also lead to the generation of redundant features, and often necessitates an initial density level classifier, which further increases complexity.

Performance:

ShanghaiTech Part A: MAE 110.2, MSE 173.2.
ShanghaiTech Part B: MAE 26.4, MSE 41.3.

Visual Examples: The original MCNN paper includes figures that illustrate original crowd images and their corresponding ground truth density maps, which are generated by convolving point annotations with geometry-adaptive Gaussian kernels. It also shows examples of estimated density maps produced by the MCNN model for test images from datasets like ShanghaiTech Part A. These density maps visually represent crowd concentration, with brighter pixels indicating higher density.Other research also describes predicted density maps where crowded locations have brighter pixels.

2. Single-Column CNNs (CSRNet)

CSRNet (Context-aware Saliency Regression Network) represents a significant advancement by adopting a streamlined single-column architecture, specifically optimized for highly congested scenes. It typically consists of a pre-trained VGG16 network as its front-end for initial feature extraction, followed by a back-end composed of dilated CNN layers. The ingenious use of dilated kernels in the back-end allows the network to achieve a large receptive field, crucial for capturing broader context, without resorting to pooling operations that would reduce spatial resolution.This design preserves fine-grained spatial information, which is critical for accurately counting small, dense objects. This end-to-end approach is also known for its relative ease of training.

CSRNet has consistently achieved strong performance, particularly excelling in highly dense crowd scenarios. Its dilated convolutions are key to effectively capturing long-range dependencies within the image while maintaining the high resolution necessary for precise density map estimation.

Performance:

ShanghaiTech Part A: MAE 68.2, MSE 115.0.
NWPU-Crowd: MAE 121.3, MSE 387.8.
UCF-QNRF: MAE 110.6, MSE 190.1.

Visual Examples: CSRNet models output predicted density maps that are generally similar to the ground truth density maps.These maps represent interested objects (people) by pixel intensity.While specific aerial image examples are not directly provided in the text, the model is designed to produce density maps where the intensity reflects crowd distribution.

3. YOLO-based Frameworks (e.g., YOLOv8 with Context Enrichment Module - CEM)

While YOLO (You Only Look Once) is primarily an object detection model, improved frameworks based on YOLOv8 have been specifically tailored for crowd counting in aerial images. These models address the challenges of small target size (often just a few pixels) and lack of distinctive contextual cues in drone imagery. A key enhancement is the introduction of a Context Enrichment Module (CEM), which significantly improves the model's ability to detect and localize tiny targets by capturing multi-scale contextual information and differentiating them from complex backgrounds.This modified YOLOv8 framework is capable of accurately detecting, localizing, and counting individuals in complex environments with varying crowd densities and altitudes.

Traditional YOLO models, being detection-based, struggle in highly dense environments where objects are small, crowded, and occluded, leading to inaccuracies. However, the adapted YOLOv8 with CEM aims to overcome these limitations by enhancing its feature extraction for small objects and leveraging contextual information.

Performance (YOLOv8 with CEM on VisDrone-CC2020): The VisDrone-CC2020 dataset provides dot annotations, which are converted into four-tuple bounding box annotations to be compatible with YOLOv8 for training.While specific MAE/MSE results for the YOLOv8 CEM model on VisDrone-CC2020 are not explicitly provided in the snippets, the framework is stated to be applied to this challenging dataset to illustrate its efficacy.The VisDrone-CC2020 challenge uses MAE as the primary metric for ranking methods.

Visual Examples: The VisDrone-CC2020 dataset, used for evaluating YOLOv8 CEM, contains sample images from diverse scenes captured by drones, representing comprehensive coverage of different environments and conditions.While direct visual outputs of YOLOv8 CEM's density maps or bounding box predictions on dense aerial crowds are not provided in the snippets, the framework is designed to detect and localize individuals.Other YOLO-based applications demonstrate counting objects in polygon zones in scenarios like shopping mall alleys, subway stations, and market squares.

4. Point-based Networks (P2PNet & CrowdSat-Net)

Point-based networks represent a distinct approach that directly predicts a set of point proposals to represent individual heads, aligning closely with human annotation practices (which often use point annotations for heads in dense crowds).This framework discards superfluous steps like density map generation and directly outputs points to locate individuals, benefiting from the high-precision localization property of point representation and its relatively cheaper annotation cost.

P2PNet (Point to Point Network): P2PNet is an intuitive solution within this point-based framework. It directly receives a set of annotated head points for training and predicts points during inference.It utilizes a VGG16 backbone to obtain fine-grained deep feature maps and employs two branches to simultaneously predict point proposals and their confidence scores.A key aspect of P2PNet is its one-to-one matching strategy between predicted points and ground truth points, which is beneficial for accuracy.

Performance: P2PNet has achieved state-of-the-art performance on several challenging datasets with various densities.

ShanghaiTech Part A: MAE/MSE (e.g., 64.4%/70.3% for nAPδ=0.05/0.25)
ShanghaiTech Part B: MAE/MSE (e.g., 76.3%/84.2% for nAPδ=0.05/0.25)
UCF_CC_50: MAE/MSE (e.g., 54.3%/54.5% for nAPδ=0.05/0.25)
UCF-QNRF: MAE/MSE (e.g., 53.1%/55.4% for nAPδ=0.05/0.25)
NWPU-Crowd: MAE/MSE (e.g., 65.0%/71.3% for nAPδ=0.05/0.25)P2PNet also achieves promising localization accuracy.

Visual Examples: P2PNet directly predicts a set of points to represent the locations of individuals.While specific embedded images are not provided, the research mentions "Visualized demos for P2PNet"and describes how it directly predicts head points in images during inference.It aims to overcome the inaccuracies of density map learning (which fails to provide exact locations) and detection-based methods (which can have missing detections) by directly predicting points.

CrowdSat-Net (for VFR Satellite Imagery): CrowdSat-Net is a groundbreaking point-based CNN specifically engineered for very-fine-resolution (VFR) satellite imagery (~0.3 meters spatial resolution).This model addresses the unique challenges of satellite data, such as the blurring and loss of small object signals and high-frequency information during processing.It incorporates a Dual-Context Progressive Attention Network (DCPAN) to improve small-object feature representation and a High-Frequency Guided Deformable Upsampler (HFGDU) to restore lost high-frequency details.CrowdSat-Net transforms initial point labels into Focal Inverse Distance Transform (FIDT) maps for overlap-free head localization.

Performance (CrowdSat Dataset): CrowdSat-Net demonstrated superior performance on the CrowdSat dataset, outperforming other state-of-the-art point-based methods.

F1-score: 66.12%
Precision: 73.23%It shows consistent robustness in moderate-to-high crowd density scenarios but some degradation in extremely sparse (1-5 individuals) and extremely dense (800+ individuals) crowds due to false positives from background clutter or severe occlusion, respectively.

Visual Examples: The research describes figures showing examples of VFR satellite imagery used for crowd detection, where individuals are discernible to the naked eye.It also illustrates the model's performance across different crowd densities and provides visual comparisons between CrowdSat-Net and other methods across various scenes (e.g., traffic junctions, snowfields, dense urban regions).Visual localization performance in unseen foreign regions is also described.These descriptions indicate the model's ability to generate visual representations of crowd distribution and localization.

4. Key Datasets for Aerial Dense Crowd Counting

Importance of Specialized Datasets

The significant and rapid progress observed in crowd counting methodologies over recent years is largely attributable to two synergistic factors: advancements in deep Convolutional Neural Networks (CNNs) and, equally importantly, the increasing availability of diverse and challenging public crowd counting datasets.Datasets specifically curated for aerial imagery are absolutely crucial because they inherently capture the unique challenges of this domain, such as wide viewpoints, the extremely small size of objects, and complex background clutter, which are not adequately represented in ground-level datasets.

The trajectory of crowd counting dataset development directly mirrors the increasing sophistication of both the models and the real-world applications they aim to serve. Initially, datasets were relatively small and primarily captured from ground-level surveillance.As deep learning models matured and their capabilities expanded, the inherent limitations of these earlier datasets—particularly in handling extreme densities and aerial perspectives—became apparent.This realization directly spurred the creation of progressively more challenging and specialized datasets. NWPU-Crowdemerged with an unprecedented scale and density range, pushing models to handle extreme variability. Subsequently, drone-specific datasets like VisDrone-CCand then satellite-specific datasets like CrowdSatwere developed. Each new dataset systematically introduces increasingly complex attributes, such as higher resolutions, wider viewpoints, more extreme crowd densities, and data from novel platforms.This continuous innovation in dataset design is not merely a collection of more data; it actively drives the development of more robust, generalizable, and specialized computer vision models, demonstrating a clear and dynamic feedback loop where dataset challenges propel methodological advancements.

Overview of Benchmark Datasets

NWPU-Crowd

NWPU-Crowd stands as a large-scale benchmark dataset specifically designed for both crowd counting and localization. It comprises an impressive 5,109 images, featuring a cumulative total of over 2.1 million annotated heads, with annotations provided as both points and bounding boxes.A defining characteristic of NWPU-Crowd is its exceptionally wide density range, spanning from 0 to an astonishing 20,033 individuals per image, making it the dataset with the largest density variation currently available.The dataset also includes a rich variety of illumination scenes (normal, extreme brightness, low-luminance) and uniquely incorporates 351 "negative samples"—scenes devoid of people but containing densely arranged objects that might be mistaken for crowds (e.g., animal migrations, sculptures).Images in this dataset are typically high-resolution, with an average resolution of 2191x3209 pixels, and some reaching up to 4028x19044 pixels.

This dataset directly addresses the limitations of earlier, smaller crowd counting datasets that often led to overfitting in deep learning models, by providing a substantially larger scale for training robust CNN-based algorithms.Its accompanying benchmark website ensures impartial evaluation of different methods, fostering fair comparisons and accelerating methodological advancements.Furthermore, experiments conducted on NWPU-Crowd have been instrumental in identifying new, critical challenges in handling extreme density variations, unseen data, and complex background regions, thereby guiding future research directions.

UCF-QNRF (UCF_CC_50 is a smaller, older dataset)

UCF-QNRF is a large-scale, high-resolution dataset specifically developed to overcome the shortcomings of previous crowd counting datasets. It contains 1,535 images and approximately 1.25 million annotated people heads.This dataset is notable for its wider variety of scenes, diverse viewpoints, broad range of densities, and significant lighting variations, making it a highly challenging benchmark.

By providing high-quality, high-resolution images and extensive annotations, UCF-QNRF is particularly well-suited for training very deep Convolutional Neural Networks for dense crowd counting, density map estimation, and localization tasks.Its comprehensive nature pushes the boundaries of model generalization.

VisDrone-CC (VisDrone-CC2020/2021)

The VisDrone-CC dataset was specifically collected for the Vision Meets Drone Crowd Counting Challenge, directly addressing the unique challenges posed by drone-captured video frames, such as small object inference, background clutter, and wide viewpoints.The VisDrone-CC2020 dataset comprises 3,360 images, each with a resolution of 1920x1080 pixels, captured by various drone-mounted cameras across 70 distinct scenarios in four different Chinese cities.It categorizes crowd density into two levels: "Crowded" (more than 150 objects per frame) and "Sparse" (fewer than 150 objects per frame).The subsequent VisDrone-CC21 expanded the dataset to 5,468 images, uniquely including paired RGB and thermal imagery.

This dataset and its associated challenge have been instrumental in promoting advancements in drone-based crowd counting by providing a dedicated, realistic benchmark.Its specific focus on drone-captured video sequences makes it exceptionally relevant to the user's query regarding aerial snapshot pictures.

CrowdSat

CrowdSat is a pioneering dataset, marking the first-ever crowd detection dataset specifically built upon Very-Fine-Resolution (VFR) satellite imagery, typically offering spatial resolutions around ~0.3 meters.It encompasses over 120,000 manually labeled individuals sourced from multiple satellite platforms (including Google Earth, Beijing-3N, and Jilin-1 Gaofen-04A) across various regions in China.The dataset features diverse geographical landscapes, including built-up urban areas, snowy regions, beaches, and deserts.

CrowdSat unlocks unprecedented opportunities for large-scale crowd activity analysis, enabling more continuous monitoring and historical trend analysis compared to the often-infrequent updates of aerial imagery.It is designed to specifically facilitate research into large-scale crowd analysis and the discovery of historical human movement patterns, bridging a significant gap in available benchmarks.

5. Conclusions and Recommendations

For counting very dense crowds with thousands of people from drone or helicopter snapshot pictures, the most effective computer vision models are those based on density map regression or direct point prediction using deep Convolutional Neural Networks (CNNs). These approaches fundamentally address the limitations of traditional object detection in highly occluded and scale-variant aerial imagery.

For drone imagery, the FPNCC (Feature Pyramid Network for Crowd Counting) model stands out, having demonstrated superior performance in the VisDrone-CC2020 Challenge with an MAE of 11.66.Its success is attributed to its Learning to Scale (L2S) module, which effectively mitigates density imbalance, and its ability to handle diverse attributes like scale, illumination, and density levels.

For very-fine-resolution (VFR) satellite imagery, CrowdSat-Net is the recommended model. It is specifically designed for this novel data source, achieving high F1-scores (66.12%) and Precision (73.23%) on the CrowdSat dataset.Its Dual-Context Progressive Attention Network (DCPAN) and High-Frequency Guided Deformable Upsampler (HFGDU) modules are crucial for enhancing small-object feature representation and restoring high-frequency information, which are critical for satellite-based crowd detection.

Other strong contenders, such as MCNN and CSRNet, have demonstrated robust performance on various challenging datasets (ShanghaiTech, NWPU-Crowd, UCF-QNRF) by leveraging advanced CNN architectures, including multi-column designs and dilated convolutions, respectively.The adaptation of

YOLO-based frameworks with modules like CEM also shows promise for aerial crowd counting by enhancing small target detection and contextual understanding.

Finally, P2PNet offers a direct point-based approach for precise localization and counting in dense crowds.

The continuous development of specialized datasets, such as NWPU-Crowd, UCF-QNRF, VisDrone-CC, and CrowdSat, remains critical in pushing the boundaries of model performance and addressing the complex challenges of aerial dense crowd counting.