In this article, we’ll talk about Audio processing in Python. Let’s diverge a little from our Natural language processing and text analysis aspects of Python and ML. Today, I’m going to discuss a Python audio processing library called librosa.
Table of Contents
What is librosa?
Librosa is a Python package for the analysis of music and audio. It provides the building blocks required to construct structures for the retrieval of music knowledge.
Audio Processing in Python
Now that you know the library that we’re going to use for our audio processing task, let’s move ahead to working with the library and process an mp3 audio file.
1. Installing Librosa for Audio Processing in Python
We can easily install librosa with the pip command:
pip install librosa
Let’s load in a short mp3 file (You can use any mp3 file for this demonstration):
y, sr = librosa.load('/content/Kids Cheering - Gaming Sound Effect (HD) (128 kbps).mp3')
2. Processing audio as time series
In the above line, the load function reads the audio mp3 as a time series. Here, sr stands for sample_rate.
If you want a refresher on time series, go here: Time Series Data and Machine Learning.
- Time series is represented by an array.
- The sample rate is the number of samples per second of audio.
Audio is mixed to mono by default. You then resample it at load time to 22050 Hz. By offering additional reasons for librosa.load, this action can be overridden.
3. Retrieve the features of an audio file
There are some important features of an audio sample, that we’ll quickly discuss:
There is a very simple fundamental rhythm in some forms of musical patterns, while others have a more nuanced or inferred one.
- Tempo: is the pace at which your patterns replicate. You measure tempo in beats per minute (BPM). So if we talk about a piece of music being at 120 BPM, we say that every minute there are 120 beats (pulses).
- Beat: a period of time. It is basically the rhythm that you will clap to in a song. You get four beats in your bar in 4/4 time, for instance.
- Bar: a bar is a logical set of beats. Usually, bars get 3 or 4 beats, although other possibilities are possible.
- Step: In composition programs, I typically see this. It is normal to have a sequence of notes, such as 8 sixteenth notes, that are all of the same lengths. The difference between each note is the move. If you found this, you would like to walk on the sixteenth notes. Usually, you set eighth notes or triplets or quarter notes for your move.
- Rhythm: This is a list of musical sounds. In a statement, take all the notes and that is the rhythm.
We can get the tempo and beats from the audio:
tempo, beat_frames = librosa.beat.beat_track(y=y, sr=sr)
4. Mel Frequency Cepstral Coefficients (MFCC)
Mel Frequency Cepstral Coefficients – one of the most important features in audio processing. It’s a topic of its own so instead, here’s the Wikipedia page for you to refer to.
The MFCC is a matrix of values that capture the timbral aspects of a musical instrument, like how wood guitars and metal guitars sound a little different. This is not captured by other measures as it is most similar to human hearing.
mfcc = librosa.feature.mfcc(y=y, sr=sr, hop_length=hop_length, n_mfcc=13) import seaborn as sns mfcc_delta = librosa.feature.delta(mfcc) sns.heatmap(mfcc_delta)
Here we are creating a heatmap using the mfcc data, which you can see gives us the output as below:
Normalizing the mfcc into a chromagram, we get:
chromagram = librosa.feature.chroma_cqt(y=y_harmonic, sr=sr) sns.heatmap(chromagram)
I assume you got some of the ideas behind extracting audio data for different deep learning algorithms for feature extraction activities.
Continue to follow our machine learning in Python tutorials. We have a lot more to come up in the near future. If you are a beginner in Python and accidentally landed here (you won’t be the first!), take a look at the Python tutorial for beginners.