 Soundscapes in the environment contain a vast amount of information. By listening to bird calls, you can learn what species live there, how many of that species live there, and by proxy, the health of struggling environments that may be hidden by traditional techniques of species monitoring, like camera traps and radio tracking. At least that's the idea in theory. In 2019, scientists from the Sandy-Gozoo Wildlife Alliance placed 35 microphones in the Peruvian Amazon for about a month, collecting over 1500 hours of audio. On average, a person takes at least three seconds to manually label one second of audio, meaning it would take 4500 hours or 187 days to label the full dataset. Therefore, manual labeling is a very expensive and time-consuming process. To automate this, we can utilize recent developments in machine learning and bird classification. However, most bird classifiers are not largely trained on Amazonian species, instead focusing on Europe and North America. To use these models, we could retrain them to better identify Amazonian birds. To retrain models, you need labeled data, which we don't have since we're trying to produce it using the same ML pipelines. So, how can you retrain models with minimal human effort while maintaining quality? Well, there is a way around this problem, publicly available data. Citizen science platforms like Xenocanto have huge repositories of audio recordings organized by species. However, this audio data is weakly labeled, with one species name for an entire clip. It does not tell us how often birds call, and there may be multiple species in single recording, and that could confuse a machine learning algorithm. What we need is strongly labeled data, where the time of each individual species call is known. However, if we use a binary detector to identify bird-like sounds, then more often than not, the sounds are the weakly labeled species. This creates a large number of moderate quality labels to train on. This summer, we worked on determining if this is a viable strategy for model training. Before we can use the models, we need some ground truth to test their ability, meaning human labels. This is where Pyronote, our web-based manual labeling platform, comes in. To do this, we selected 10 random Amazonian species with a large number of clips on Xenocanto, with 2000 seconds of audio sampled for each species. We then had 21 volunteers from the 2022 UCSC Cosmos Machine Learning Cohort label this data with the weakly labeled species, or, if they were unsure, generic bird labels. In total, they created about 500 annotations per species. This data will serve as our ground truth. Our team looked at a few well-performing bird detectors, bird net, microphone, and tweeting net. Even if they cannot classify species in remote areas well, they can still detect obvious enough examples to use to label data of birds. To classify the species, we can apply the weak label for the species, or use bird net to classify its known species to make more accurate labels. With this now strongly labeled data, we can train models to go from unlabeled to strongly labeled data. These detectors can be used in combination with a classifier to create weak to strong labeled pipelines, or WTS. We have tested and developed the following weak to strong pipelines, and apply these to all the data we have for these species. Then, we can compare the labeled Cosmos sample to the weak to strong labels to evaluate our pipeline performances. Our key evaluation metrics are precision, recall, and ROC curves. If we have higher precision, this means the model frequently correctly identifies birds. If we have higher recall, the model detects most bird calls in the clip. Finally, ROC curves give a sense of model accuracy. A model guessing at random will have a 50% chance of getting it right. An ideal model tends toward the top left corner. The higher the AUC score, or area under the curve, the better the classifier. For the weak to strong labeled pipelines, we have high precision and low recall for each species of interest across the board. This means we have few false positives, but a lot of false negatives. When we train models on the weak to strong label output, this might mean data is less likely to be mislabeled, but also lead to some calls being missed. Many of our pipelines have AUC scores around 0.6 to 0.8, indicating there are average classifiers. Ideally, what we want to show is our retrained models should have higher AUC scores than our weak to strong labeled models. Now that we have pipelines to create training data, let's go ahead and train some models. We ran each of the six weak to strong labeled pipelines over all of the Xenocanto data we had for our 10 species of interest, thereby labeling tens of thousands of clips automatically. We then train separate instances of something called open soundscape, an open source retrainable bioacusic model. This way, we can compare the resulting models performance from each weak to strong labeled pipeline, thereby determining how good each of those pipelines are at generating high quality training data. Once we have retrained the model, we can revisit our evaluation metrics. Here, we see that the ROC curves and AUC scores for retrained models are much better than those of the WTS pipelines. This is promising. However, our precision lowered and our recalls higher, implying the retrained model has more false positives and leaving room for future improvement. This experiment suggests that our pipeline can train a decently performing model on data that humans have never strongly labeled, with only time taken by citizen scientists to label their clips on Xenocanto and the one hour session from cosmic students, this is a significant decrease in time spent labeling versus manually labeling the entire dataset. Further, we have a model we can apply to new data, further eliminating future need for human labors, with the exception of getting more ground truth. To improve model accuracy, we look to improve the quality of our annotations. The following work is still in progress. This summer, I worked on using embeddings or numerical representations of short audio clips, outputted by an existing model to increase the precision of annotations from other models. This would allow for new training data to be stronger due to fewer falsely identified bird calls. So far, I've used a clustering model that groups the embeddings in a multi-dimensional space. I then filtered out all of the embeddings labeled as noise. By filtering out noisy annotations, we ensure that we train on a larger amount of data that has fewer falsely identified bird calls, thereby improving the performance of training models. Another thing we can do is get better models to ensure our ground truth is of better and better quality. Since manually validating human labeled annotations is a slow process, it would be nice to have an automated system in PyreNote that can measure the quality of annotations produced. One way is having multiple users labeling the same clip and seeing how much they agree. If they disagree, we can have additional users labeled the clip so we can better identify the ground truth. This summer, I experimented with pairwise agreement metrics. I tested three different methods, clustering annotations based on start and end, majority vote on time chunks, and using intersection over union. What I found is majority vote in intersection over union tends to give lower user agreement because they are skewed by outliers. Clustering is much more reliable than handling outliers, giving accurate user agreement metrics. Hence, clustering will be implemented in the PyreNote to help us improve the quality of the data we label. The work of the project is currently ongoing, so we still have some work to do. We hope to make the model more accurate via birdnet embeddings, retraining other models like tweenet and birdnet, and utilizing user agreement methods in PyreNote to improve our ground truth. By continuing to flesh out this methodology, we hope we can apply our models on field data this year and be able to accurately label data. In doing so, we make it continuously easier to monitor the rapidly changing biodiversity of our planet and let these species be heard.