Monitoring avian biodiversity in regions such as Singapore and other Southeast Asian biodiversity hotspots is a fundamental yet challenging task, particularly when relying on acoustic data. Bird sound classification models are critical for passive acoustic monitoring (PAM) as they provide cost‐effective, scalable, and non‐invasive means to assess ecosystem health. However, the development of accurate audio‐based species classifiers is impeded by a paucity of high-quality labeled regional recordings, the occurrence of sensitive and endangered species with restricted datasets, and high species diversity coupled with acoustic similarity among coexisting birds. This literature review investigates machine learning methods applied to train bird sound classification models in data-sparse regional contexts. It places special emphasis on the challenges encountered in regions such as Singapore and explores approaches that mitigate data sparsity by leveraging transfer learning, semi-supervised or weakly supervised techniques, data augmentation, and region-specific fine-tuning of models like BirdNET (Bellafkir et al., 2023, Jamil et al., 2023).
A variety of machine learning paradigms have been adopted for bird sound classification, with deep learning frameworks emerging as the prevailing approach due to their ability to learn complex, non-linear representations from spectrogram images of audio recordings.
Deep convolutional neural networks (CNNs) are among the most widely used architectures for bird sound classification. CNNs have been successfully applied to spectrogram representations, which capture both temporal and frequency domain features of bird calls. The use of CNNs on spectrograms is fueled by their capacity to automatically learn robust features, thus reducing the need for manual feature engineering (Stowell et al., 2019). Advanced CNN architectures, including variants such as AlexNet, VGG16, ResNet50, and DenseNet, have been evaluated for their performance on both large-scale and limited regional datasets. For instance, methods leveraging deep CNNs with attention mechanisms, as seen in some of the recent studies, have demonstrated efficacy in fine-grained bird call classification despite the challenges posed by weak labels and environmental noise (Bellafkir et al., 2023).
Transfer learning is a critical strategy to address data scarcity by repurposing feature extractors pre-trained on large datasets (often from domains such as image classification [e.g., the ImageNet dataset]) for bird sound classification tasks. By using transfer learning, models can leverage general acoustic feature representations and effectively adapt to regional datasets with minimal labeled data (Das et al., 2023). The BirdNET framework exemplifies this approach by fine-tuning pre-trained CNN models on global bird sound datasets and subsequently adapting them to local recordings, resulting in improved classification accuracy even when training data is limited (Kahl et al., 2021, Zhong et al., 2021).
In regions with sparse labeled data, semi-supervised and weakly-supervised learning approaches have emerged as viable alternatives. These techniques utilize a mixture of labeled and abundant unlabeled audio data to create robust classifiers. For example, FixMatch, a semi-supervised learning algorithm, has been applied to bird sound classification by generating pseudo-labels on unlabeled recordings and improving performance when only a small portion of the dataset is annotated (Caprioli, 2022). Likewise, methods based on weak supervision leverage incomplete or imprecise labels by incorporating expert knowledge to fine-tune the model and to mitigate the noise and inconsistencies that often arise in crowdsourced audio recordings (Conde et al., 2021).
Data augmentation plays a crucial role in combating overfitting and enhancing the generalization capabilities of classifiers trained on limited regional data. Techniques such as time and pitch shifting, spectrogram axis shifting, mixup, and the addition of background noise (including, for instance, external bird audio recordings) have been applied to artificially expand the dataset. These augmentations help simulate variations encountered in real-world recordings, particularly in acoustically complex environments typical of tropical regions (Ansar et al., 2024, Nshimiyimana, 2024). By generating synthetic variations, data augmentation also alleviates the challenges of class imbalance, which is especially pertinent when rare species are represented by only a few samples (Das et al., 2023).
Training bird sound classification models in regions such as Singapore presents unique challenges that can be grouped into three main areas: limited availability of high-quality labeled audio data, the sensitive nature of datasets concerning endangered species, and the high acoustic similarity amid diverse species.
One of the predominant challenges in data-sparse regional contexts is the scarcity of high-quality labeled audio recordings. In many biodiversity hotspots, the production of expert-verified annotated datasets is both time-consuming and resource-intensive. For example, studies like SiulMalaya have addressed these challenges by combining citizen science data with expert annotations, but even then, classification accuracies remain modest due to the limited amount of available training data (Jamil et al., 2023). Limited datasets force researchers to work with sparse examples, and when these audio recordings are combined with diverse environmental noises, the task of effective bird sound classification becomes significantly more complex.
The conservation status of many avian species necessitates careful handling of the related acoustic data. Some rare or endangered species appear infrequently in recordings, resulting in severely imbalanced datasets. This imbalance not only raises issues of overfitting during model training but also increases the likelihood of false negatives, which could impede conservation efforts. As many of these species are of high ecological significance, models must be carefully designed to maintain sensitivity to rare calls while avoiding misclassifications caused by background noise or overlapping vocalizations (Zhong et al., 2021, Caprioli, 2022).
Southeast Asian ecosystems, particularly in regions like Singapore, are characterized by high species diversity. This diversity is accompanied by significant acoustic similarity among species, especially those that co-occur in dense habitats such as urban parks and rainforests. The overlap in frequency ranges and temporal patterns among bird calls increases the complexity of classification, necessitating models capable of discerning subtle differences. In these contexts, even minor variations introduced by different recording conditions or the presence of ambient noise can lead to misclassifications, further complicating the task (Bellafkir et al., 2023, Tang et al., 2024).
Researchers have developed several strategies to mitigate the inherent difficulties in training accurate bird sound classifiers using limited regional data. These strategies leverage advancements in transfer learning, semi-supervised learning, and data augmentation, among other techniques.
The concept of transfer learning involves the adaptation of models pre-trained on large global datasets to the specific conditions found in a regional context. For bird sound classification, large-scale datasets such as those collected via Xeno-canto or the BirdCLEF challenges serve as a robust foundation from which models can learn general acoustic features. These models can then be fine-tuned with available regional data to capture local environmental and species-specific characteristics. For instance, models such as ResNet50 and EfficientNet have been successfully adapted from global datasets to recognize calls in regionally limited contexts (Das et al., 2023, Kahl et al., 2021). BirdNET further exemplifies this strategy by employing region-specific fine-tuning that incorporates quality-based loss weighting and threshold calibration to adapt to the local acoustic domain, thereby improving performance on limited and noisy regional audio samples (Bellafkir et al., 2023, Ansar et al., 2024).
When labeled data are sparse, semi-supervised and weakly-supervised methods allow models to leverage the abundance of unlabeled recordings. Semi-supervised algorithms, such as FixMatch, combine a small, high-quality labeled dataset with a larger pool of unlabeled data by generating pseudo-labels and filtering them using confidence thresholds. This approach has been shown to improve accuracy by effectively expanding the training data without incurring the high costs of manual annotation (Caprioli, 2022). Similarly, weakly supervised techniques address label noise by working with incomplete or imprecise labels, adapting to the inherent uncertainty in field-collected recordings, and thereby enabling the training of robust classifiers even with limited labeled data (Conde et al., 2021).
Data augmentation strategies serve as another cornerstone for mitigating the challenges associated with limited regional datasets. By applying transformations such as time stretching, pitch shifting, adding synthetic background noise, and mixup methods, researchers can artificially expand the available dataset and introduce variability that helps the model generalize better to unseen data. Such techniques have been particularly effective in overcoming issues related to environmental variability and class imbalance (Ansar et al., 2024, Nshimiyimana, 2024). Additionally, the generation of synthetic data through proxy species or simulated audio environments can further enrich training datasets, thereby compensating for the scarcity of examples for rare or sensitive species.
An effective way to bridge the gap between global models and local conditions is through region-specific fine-tuning. Models that are initially pre-trained on large and diverse datasets capture general acoustic patterns that are then refined using limited regional recordings. Fine-tuning of models such as BirdNET on localized data allows the classifier to account for differences in environmental acoustics, species behavior, and recording equipment. This process not only improves recognition accuracy but also helps in adjusting model sensitivity to variations that are specific to regions like Singapore (Zhong et al., 2021, Rajan & Noumida, 2021).
In addition to traditional acoustic features derived from spectrograms, incorporating auxiliary meta information—such as textual descriptions of bird calls, ecological traits, and life-history data—can further bolster classification performance. Meta-information allows models to contextualize audio data by leveraging additional cues about species' habitats, morphology, and behavior. Recent studies have shown that concatenating AVONET features (including ecological and morphological traits) with life-history characteristics can improve zero-shot audio classification performance, thereby enhancing the model's ability to generalize from global datasets to region-specific contexts (Gebhard et al., 2024).
The convergence of methods such as transfer learning, semi-supervised techniques, and robust data augmentation offers a promising path forward for developing accurate bird sound classifiers in data-sparse regional contexts, particularly in Southeast Asia and Singapore. In practice, these methods can be integrated into a comprehensive pipeline that begins with the acquisition and pre-processing of raw audio data followed by the application of advanced CNN architectures. Pre-processing steps include cropping long recordings into manageable time windows, converting the audio into Mel spectrograms, and applying noise reduction filters tailored to the local acoustic environment (Bellafkir et al., 2023, LeBien et al., 2020).
Initial training on large-scale global datasets enables the model to learn generic acoustic features, which are then transferred to the target domain through fine-tuning with regional audio. This stage is critical in regions such as Singapore where environmental conditions, recording equipment quality, and species vocalizations vary significantly from those found in global datasets (Das et al., 2023, Kahl et al., 2021). Following the transfer learning stage, semi-supervised and weakly supervised techniques further expand the training dataset through pseudo-labeling of unlabeled recordings, thereby mitigating the effects of scarce annotated data (Caprioli, 2022, Conde et al., 2021).
Data augmentation remains an integral enhancement in such pipelines, where techniques ranging from simple time-pitch shifting to complex mixup training are utilized to simulate the variability of natural soundscapes. This is particularly important for training robust classifiers capable of distinguishing between acoustically similar species in biodiverse regions (Ansar et al., 2024, Nshimiyimana, 2024). Moreover, integrating meta-information associated with the birds, such as ecological attributes and life-history traits, can offer an additional layer of context. This approach is crucial when distinguishing between species with high acoustic similarity, ensuring that the classifier accounts for subtle yet biologically meaningful differences (Gebhard et al., 2024).
In scenarios where the acoustic recordings include data for rare or endangered species, methods such as proxy species generation and the use of synthetic data can further compensate for the limited number of examples available. Not only do these methods enrich the overall dataset, but they also help in training models that are resilient to the variances in call frequency and intensity found among rare species (Rajan & Noumida, 2021, Stowell et al., 2019).
Drawing from the diverse approaches detailed in the reviewed literature, it is evident that the combination of modern deep learning techniques with domain-specific adaptations is key to overcoming data sparsity in bird sound classification tasks. The effectiveness of transfer learning is particularly notable—pre-trained models that are fine-tuned with regional data can bridge the gap between heterogeneous audio domains and provide high classification accuracy despite sparse labeled examples (Das et al., 2023, Kahl et al., 2021).
Moving forward, several research directions appear promising. One area of active research is the further development of self-supervised learning algorithms that do not require any annotated labels at all during pre-training. With the advent of self-supervised models in other domains, there is potential to harness large volumes of unlabeled audio recordings from biodiversity hotspots, thus reducing reliance on manual annotations (Caprioli, 2022). Researchers might also explore architectures that combine CNN and recurrent neural network (RNN) layers to capture both spectral and temporal features more effectively, as these hybrid models have demonstrated improved performance in recognizing overlapping and sequential bird vocalizations (Stowell et al., 2019).
Furthermore, the integration of active learning strategies—where the model selectively queries the most uncertain examples for expert labeling—could help optimize the annotation process in data-sparse environments. Active learning has the potential to greatly reduce the labeling effort required, ensuring that only the most impactful data points are annotated, thereby enhancing overall classifier performance (Clink et al., 2024).
Region-specific studies are essential to validate and refine these methodologies. In the context of Singapore, a densely populated and ecologically complex urban setting, factors such as urban noise, seasonal changes, and the influence of diverse habitat types must be considered. Future work should involve collecting more localized audio data and implementing fine-tuning steps that emphasize these unique acoustic properties while incorporating community-driven citizen science initiatives to build larger, high-quality annotated datasets (Jamil et al., 2023, LeBien et al., 2020).
Another promising avenue is the use of ensemble modeling techniques, where multiple classifiers are combined to improve overall accuracy and robustness. Ensemble approaches have been shown to reduce the variance inherent in single-model predictions, thereby yielding more reliable results in challenging acoustic environments (Henkel & Singer, 2021, Rajan & Noumida, 2021).
Integration of multi-modal data also presents a strong opportunity. For instance, coupling audio data with environmental and visual metadata could provide additional discriminative power. This multi-modal approach is particularly relevant for habitats where bird calls are acoustically similar. By aligning acoustic signals with spatial, temporal, and ecological metadata, classifiers could achieve improved differentiation among species that are otherwise challenging to distinguish based solely on audio (Gebhard et al., 2024, Tang et al., 2024).
Based on the literature reviewed, researchers and practitioners working in data-sparse regional contexts such as Singapore should consider the following recommendations for developing robust bird sound classification systems:
Initiate model training on extensive, established global bird acoustic databases and fine-tune on a curated subset of local recordings. Pre-trained CNN architectures, especially those adapted into frameworks such as BirdNET, should form the backbone of regional classifiers due to their demonstrated ability to generalize across diverse acoustic domains (Das et al., 2023, Kahl et al., 2021).
Exploit the abundance of unlabeled audio data available via passive acoustic monitoring networks by adopting semi-supervised learning methods such as FixMatch or alternative weakly supervised algorithms. This approach will help bridge the gap caused by limited expert annotations, enabling models to learn from both high-quality labeled recordings and a larger pool of unlabeled data (Caprioli, 2022, Conde et al., 2021).
Use a comprehensive suite of data augmentation techniques to enhance training dataset diversity. Time and pitch shifting, noise injection, and mixup training can simulate various recording conditions encountered in urban and forested areas in Singapore, improving the model's ability to generalize under varying environmental conditions (Ansar et al., 2024, Nshimiyimana, 2024).
After transfer learning from global datasets, dedicate resources to fine-tuning the model on locally collected data. Incorporate region-specific acoustic characteristics and environmental factors into the training process. This step is crucial in ensuring that the model remains sensitive to the subtle inter-species variations and local noise conditions characteristic of Southeast Asian soundscapes (Zhong et al., 2021, Rajan & Noumida, 2021).
Supplement audio features with relevant meta information such as species morphological traits, ecological data, and temporal recording metadata. This integrative approach can help disambiguate calls from species with high acoustic similarity, enabling more precise classification in biodiversity-rich regions where multiple, overlapping signals are common (Gebhard et al., 2024, Tang et al., 2024).
Combine multiple machine learning models to form an ensemble that reduces variance and enhances the robustness of predictions. Additionally, implement active learning strategies to prioritize labeling of the most ambiguous recordings, thereby optimizing the use of limited expert resources (Henkel & Singer, 2021, Clink et al., 2024).
The literature reviewed herein clearly demonstrates that modern deep learning techniques—augmented by transfer learning, semi-supervised methods, and robust data augmentation—offer a promising solution to the problem of training bird sound classification models in data-sparse regional contexts. Regions like Singapore, characterized by high biodiversity and complex, acoustically noisy environments, present unique challenges that can be effectively addressed through the integration of global models with region-specific fine-tuning, active learning, and meta-data fusion. While the scarcity of high-quality labeled data remains a significant obstacle, the combination of these approaches promises to enhance monitoring capabilities, support conservation efforts for endangered species, and ultimately contribute to a more informed understanding of Southeast Asian biodiversity (Bellafkir et al., 2023, Das et al., 2023, Jamil et al., 2023, Stowell et al., 2019).
Future research is encouraged to expand on these methodologies by incorporating emerging techniques such as self-supervised representation learning and the further exploration of multi-modal data fusion. Such advancements are essential for the evolution of PAM systems that can overcome the inherent limitations of data-sparse environments. Researchers should also prioritize the creation of large-scale, expert-annotated regional datasets—potentially through collaborations that combine citizen science efforts with standardized recording protocols—to further refine model performance and generalization in tropical urban and forest settings (LeBien et al., 2020, Tang et al., 2024, Zhong et al., 2021).
In summary, the integration of transfer learning, semi-supervised learning, data augmentation, and region-specific fine-tuning constitutes the cornerstone of effective bird sound classification in Southeast Asia. This approach not only addresses the challenge of limited annotated data but also facilitates the recognition of highly similar and overlapping bird calls in complex soundscapes, ultimately contributing to robust and scalable biodiversity monitoring systems. Such systems are essential for informing conservation strategies and ensuring that even rare or endangered species are appropriately monitored, thereby supporting the broader goals of ecosystem management and biodiversity conservation (Das et al., 2023, Kahl et al., 2021, Rajan & Noumida, 2021).
By adopting and further refining these strategies, practitioners in regions like Singapore can transform sparse audio datasets into valuable, actionable insights that drive the next generation of environmental monitoring and conservation efforts. This body of work, drawing on advances from both global and region-specific studies, offers a comprehensive framework for overcoming data limitations and achieving high-performance bird sound classification in challenging ecological contexts.
In conclusion, accurate bird sound classification in data-sparse regional environments is not only feasible but also essential for effective biodiversity monitoring. The successful integration of advanced machine learning techniques with contextual domain knowledge ensures that even in regions with limited data, reliable and scalable models can be developed. Researchers must continue to explore and refine these integrated approaches, as they hold the key to bridging the gap between global methodologies and local environmental challenges, ultimately fostering a deeper understanding of the rich and diverse avifauna in Southeast Asia (Bellafkir et al., 2023, Ansar et al., 2024, Stowell et al., 2019).
Through collaborative efforts that combine expertise in machine learning, ecology, and local field studies, the future of bird sound classification in biodiversity hotspots is bright. Such collaborative initiatives will pave the way for more effective deployment of PAM systems, ensuring that limited regional datasets are leveraged to their fullest potential, thereby contributing significantly to conservation and ecological research in dynamic and complex urban and natural environments.
Ansar, W., Chatterjee, A., Goswami, S., & Chakrabarti, A. (2024). An EfficientNet-based ensemble for bird-call recognition with enhanced noise reduction. SN Computer Science, 5, 265. (4 citations, peer-reviewed)
Bellafkir, H., Vogelbacher, M., Schneider, D., Kizik, V., Mühling, M., & Freisleben, B. (2023). Bird species recognition in soundscapes with self-supervised pre-training. Communications in Computer and Information Science, 60-74. (3 citations, peer-reviewed)
Caprioli, E. (2022). A semi-supervised approach to bird song classification. Master's thesis, Norwegian University of Science and Technology (NTNU). Advisor: Downing, K.
Clink, D. J., Cross-Jaya, H., Kim, J., Ahmad, A. H., Hong, M., Sala, R., Birot, H., Agger, C., Vu, T. T., Thi, H. N., Chi, T. N., & Klinck, H. (2024). Benchmarking automated detection and classification approaches for monitoring of endangered species: a case study on gibbons from Cambodia. bioRxiv. (0 citations)
Conde, M. V., Shubham, K., & Agnihotri, P. (2021). Weakly-supervised classification and detection of bird sounds in the wild: a BirdCLEF 2021 solution. ArXiv.
Das, N., Padhy, N., Dey, N., Bhattacharya, S., & Tavares, J. M. R. S. (2023). Deep transfer learning-based automated identification of bird song. International Journal of Interactive Multimedia and Artificial Intelligence, 8, 33. (3 citations)
Gebhard, A., Triantafyllopoulos, A., Bez, T., Christ, L., Kathan, A., & Schuller, B. W. (2024). Exploring meta information for audio-based zero-shot bird classification. ICASSP 2024 - IEEE International Conference on Acoustics, Speech and Signal Processing, 1211-1215. (6 citations)
Henkel, C., & Singer, P. (2021). Recognizing bird species in diverse soundscapes under weak supervision. ArXiv.
Jamil, N., Norali, A. N., Ramli, M. I., Shah, A. K. M. K., & Mamat, I. (2023). Siulmalaya: an annotated bird audio dataset of Malaysia lowland forest birds for passive acoustic monitoring. Bulletin of Electrical Engineering and Informatics, 12, 2269-2281. (3 citations)
Kahl, S., Wood, C. M., Eibl, M., & Klinck, H. (2021). BirdNET: A deep learning solution for avian diversity monitoring. Ecological Informatics, 61, 101236. (648 citations, peer-reviewed)
LeBien, J., Zhong, M., Campos-Cerqueira, M., Velev, J. P., Dodhia, R., Ferres, J. L., & Aide, T. M. (2020). A pipeline for identification of bird and frog species in tropical soundscape recordings using a convolutional neural network. Ecological Informatics, 59, 101113. (161 citations, peer-reviewed)
Nshimiyimana, A. (2024). Acoustic data augmentation for small passive acoustic monitoring datasets. Multimedia Tools and Applications, 83, 63397-63415. (3 citations, peer-reviewed)
Rajan, R., & Noumida, A. (2021). Multi-label bird species classification using transfer learning. International Conference on Communication, Control and Information Sciences (ICCISc), 1-5. (23 citations)
Stowell, D., Wood, M. D., Pamuła, H., Stylianou, Y., & Glotin, H. (2019). Automatic acoustic detection of birds through deep learning: the first bird audio detection challenge. Methods in Ecology and Evolution, 10, 368-380. (420 citations, highest quality peer-reviewed journal)
Tang, Y., Liu, C., & Yuan, X. (2024). Recognition of bird species with birdsong records using machine learning methods. PLOS ONE, 19, e0297988. (4 citations, peer-reviewed)
Zhong, M., Taylor, R., Bates, N., Christey, D., Basnet, H., Flippin, J., Palkovitz, S., Dodhia, R., & Ferres, J. L. (2021). Acoustic detection of regionally rare bird species through deep convolutional neural networks. Ecological Informatics, 64, 101333. (49 citations, peer-reviewed)
Subscribe to our newsletter to receive new articles and updates directly in your inbox.