BirdCLEF 2021 - Birdcall Identification

BirdCLEF 2021 is a data science competition for identifying bird calls in soundscape recordings. The goal is to be able to identify the species of bird making a call in a 5-second window in a soundscape audio track. The competition started on April 1, 2021, and the final submission deadline will be May 31, 2021.

The LifeCLEF 2021 conference will be held later this year from September 21-24, 2021. Placing in this contest could result in some exciting things.

2021-04-06

I’ve done a little bit of research into the topic. The training dataset contains over 350 different bird species, where each species has over one hundred one-minute audio clips that feature the bird call. Since the goal is to identify a 5-second clip, it seems like this problem boils down to building a cleaned-up training set from the audio. My approach will go something like this:

It will take some time to build a simple model, but I aim to spend about a month on this task. I have several other things going on outside of work that I need to dedicate a little bit of time. I think this should take about 20-30 hours to get to my first submission.

I’ve done some initial investigation into motif-mining for time-series data. I’m going to take the approach of using the Matrix Profile, which has a lot of desirable properties. In particular:

The first two links describe SiMPle, which adapts the Matrix Profile to music analysis. It uses the chroma energy normalized statistics (CENS) to compute the profile using 2 to 10 CENS depending on the task.

In the last link, section 3.3.3 on Music Processing Case Study is the most relevant. It uses mSTAMP (a multi-dimensional generalization of the Matrix Profile) to discover multi-dimensional motifs from the Mel-spectrogram taken with 46ms STFT, 23ms STFT hop, and 32 Mel-scale triangular filters. It is able to distinguish both the chorus and the drum-beat, depending on the dimensionality of the pattern.

I did a little bit of testing using librosa and matrixprofile. I tried this out in a notebook to see what would come out.

import librosa
import matrixprofile as mp

%matplotlib inline

path = "../input/birdclef-2021/train_short_audio/acafly/XC109605.ogg"
data, sample_rate = librosa.load(path)

cens = librosa.feature.chroma_cens(data, sample_rate, n_chroma=36)
profile, figures = mp.analyze(y)

The matrixprofile library is complete, but it doesn’t contain algorithms for either mSTAMP or SiMPle. The R-library tsmp has implementations for both of these, but the audio-processing libraries are limited (either audio or tuneR). I may end up doing the audio processing with librosa and the matrixprofile computation and motif mining in R as a pipeline. When this gets out of hand, or when I need to run the process in Kaggle, I can re-implement the SiMPle algorithm in Python.

2021-04-07

est: 1-2 hours

I spent my time today getting the environment set up on my local machine. I downloaded the 32GB dataset last night and tried unpacking it on my 1TB HDD before eventually giving it up since it was unpacking at a rate of 13mb/s or so. I moved this to my SSD.

unzipping time

I set up my environment, which is Jupyter Lab configured with Python and R. On the Python end, I have a small script that dumps the .ogg files into serialized NumPy matrices (.npx). I wanted to reproduce the setup used with audio spectrograms via CENS at a rate of 2-10 frames per second. It turns out that the hop_length has to be a multiple of 262^6.

This function takes care of quantizing to the nearest integer multiple.

def cens_per_sec(sample_rate, target):
    """Ensure the hop length is a multiple of 2**6"""
    return (sample_rate // (target * (2 ** 6))) * (2 ** 6)

I also set up R with Jupyter Lab by following this guide for installing the R kernel. I had to add the directory containing R.exe into my PATH, and ensured that the current shell had my virtual environment configured. Then:

install.packages("devtools")
devtools::install_github("IRkernel/IRkernel")
IRkernel::installspec()

Afterwards, I just load this into an R notebook so I can run some pre-existing Matrix Profile algorithms. I am able to load the numpy serialized data through RcppCNPy.

2021-04-08

I only made minor progress today. I plot the data after running it through the SiMPle algorithm:

library(RcppCNPy)
library(tsmp)

fmat <- npyLoad("../data/cens/train_short_audio/acafly/XC109605.npy")
smp <- simple_fast(t(fmat), window_size=41, verbose=0)
plot(smp)

plot

It turns out that the object class name is not usable with the majority of the tsmp library because a typo in the class name, despite it having a perfectly good Matrix Profile and Profile Index object for determining motifs and discords.

SiMPle Matrix Profile
---------------------
Profile size = 606
Dimensions = 12
Window size = 41
Exclusion zone = 21

Looking at the source code for find-motifs in the tsmp library, we see that the class type is enforced by string names…

  if (!("MatrixProfile" %in% class(.mp))) {
    stop("First argument must be an object of class `MatrixProfile`.")
  }

🤦 There’s no space in the name. I forked the library to acmiyaguchi/tsmp and will make a PR to fix up this behavior for this algorithm. I rather not have to deal with some of the common functionality by hand, even if it’s simple like finding the pairs of indices that correspond to a motif.

I’ve also been thinking about how to to use the motifs to search for 5 second clips that I can use to build up my training dataset. Since I’ll be getting only a single motif per audio track, I’m going to fetch many motifs from the species dataset to get a motif per track. I can analyze these motifs to see how well they stack up to each others.

After I find these initial motifs, I can comb through the rest of the audio to determine what other clips can be found. I can use the median-absolute deviation effectively slide over all of the tracks and to find the positions where there are likely matches in the audio.

2021-04-09

While the mechanism wasn’t exactly what I thought it would be, it was similar enough. I learned a little bit about the inheritance mechanisms in R.

devtools::test(filter="simple")

Afterwards, I created a PR and installed my package from GitHub (after killing all existing kernels to avoid permission issues).

devtools::install_github("acmiyaguchi/tsmp", ref="simple")

Unfortunately, this did not work on a realistic matrix because the find_motif method also does nearest-neighbor computation, which requires calling MASS to determine which indices are most related to each motif. The SiMPle MASS algorithm is domain specific, so it did not fit directly into the existing API for other Matrix Profiles :(.

I decided to abandon the find_motif helper and just to use the motif corresponding to the minimum in the profile. I can get to this using min_mp_index which works fine on SiMPle. I would have used the thumbnail that I extracted from the motif pair (e.g. the 5 seconds of audio corresponding to the motif to find nearest-neighbors), but when I try to compute the join matrix-profile, it raises an index error:

Error in last_product[1:(data_size - window_size), j]: subscript out of bounds

This is a distraction at this point. Since I get two examples per track and there are roughly a hundred tracks per species, this should be enough to build an initial model. Once I get an initial model, I can try out a few different methods to get more training data. I really wish I had access to the join matrix-profile, but I can try other methods like training a classifier on the spectrograms using a sliding window of data.

I built an R script that I’ll be calling via Python to find the index of the motifs. These links were relevant as I put it together.

Then I was able to build a notebook to listen to the motifs. Jupyter has a nice API to embed audio in a notebook. Here is a pair off motifs from audio of the Acadian Flycatcher.

Listen the motif:
Listen to the motif's pair:

2021-04-10

I scripted together the above to extract motifs from the acafly directory. In each of these, I have the following files:

Mode                 LastWriteTime         Length Name
----                 -------------         ------ ----
-a----         4/11/2021  12:31 AM            362 metadata.json
-a----         4/11/2021  12:31 AM         435320 motif.0.npy
-a----         4/11/2021  12:31 AM          32156 motif.0.ogg
-a----         4/11/2021  12:31 AM         435320 motif.1.npy
-a----         4/11/2021  12:31 AM          31789 motif.1.ogg

The ogg files contains the 5 second clip of the main motif. The numpy files are the CENs transformed data for the motif. The metadata file contains information about the different parameters used to extract the motif:

{
  "source_name": "train_short_audio/acafly/XC533302.ogg",
  "cens_sample_rate": 10,
  "matrix_profile_window": 50,
  "cens_0": 50,
  "cens_1": 1,
  "motif_0_i": 108797,
  "motif_0_j": 217595,
  "motif_1_i": 2175,
  "moitif_1_j": 110973,
  "sample_rate": 22050,
  "duration_seconds": 17.07,
  "duration_cens": 173,
  "duration_samples": 376441
}

It looks like I’ll have to go through each file by hand to determine whether it’s a legitimate bird call or not. I’ve noticed that some clips have a motif that is noise or another natural sound. It looks like I’m going to become a bird-call novice. It should be straightforward to build a web application to do the labeling of the data. How much of the code I’ll be able to re-use is up in the air, but I rather not have to deal with all the information in a spreadsheet.

Once I can deal with these particular audio files, I can start to build an even more extensive dataset to determine whether a file has the bird chirp in it or not. I’ll try to aim for something like an extra 100 examples that I can trawl randomly sampled from the set. Building a simple classifier will be good practice for building more complex ones later.

I think once I build this initial dataset, it will be useful to come back and reimplement (or fix) the SiMPle algorithm so I can do two things:

The classifier I build on this first one will help filter out the noisy values, while the latter will help me find more examples throughout the dataset. This approach should surely scale for finding good examples.

2021-04-11

I took the motifs that I generated for acafly and osprey and created a web application that I’m going to use to explore and label the data. I used sveltekit to build this, since I’ve had experience with both svelte and sapper. I’m pleased with the template and how little frill is involved. It’s going to take some time before I get used to the paradigm.

plot

figure: screenshot of the the current page

There are minor issues like not being able to import the node-glob module and having to explicitly set preload="none" on the audio context, but otherwise it’s been smooth sailing.

I think creating a training dataset based on these motifs and the system for labeling the data will be my main contribution to the competition. I’m not so confident that I’ll be able to build better models than the previous year. I’m probably going to rely on an ensemble model that uses many simple linear models under the hood, because I’m not sure I can figure out the intricacies of CNN models.

I also read a couple of the previous BirdCLEF papers, which I’ll leave a link here for reference.

2021-04-11 Adding metadata to the website

est: 2.5 hours

Did a little more on the site today. First I took a look at the training metadata and created documents that I can use for building the site. In the info.json:

{
  "primary_label": "tropew1",
  "scientific_name": "Contopus cinereus",
  "common_name": "Tropical Pewee"
}

And an entry from the metadata.json:

{
  "name": "XC11419",
  "secondary_labels": [],
  "type": ["call"],
  "date": "2004-06-05",
  "latitude": -22.5001,
  "longitude": -42.7167,
  "rating": 4.5,
  "url": "https://www.xeno-canto.org/11419"
}

Then, I went about joining these pieces of information in the application. I brought in a table component so the number of things being rendered on the page at once is limited to 10 or so entries.

plot

The next thing that I need to do is add a way to label whether an entry is good or bad and have a way to export that. I want to start quickly building a classifier to determine whether a clip is a bird call or not, so I can start to go toward my first submission.