Music genre classification with the GTZAN dataset

2020-01-06 15:12

Carl Thomé

Music genre classification with the GTZAN dataset¶

Genres are convenient for browsing music libraries. Generally speaking, songs belonging to the same musical genre feature similar instrumentation, rhytmic/harmonic structure and lyrical themes. If we allow ourselves to keep the genres simple, we can train a classifier to predict track-level genre membership purely from audio content. In this notebook we'll do the hello world of genre tagging, popularized by the seminal paper "Musical genre classification of audio signals" by George Tzanetakis and Perry Cook in 2002. I've quoted their abstract below if you're interested:

Musical genres are categorical labels created by humans to characterize pieces of music. A musical genre is characterized by the common characteristics shared by its members. These characteristics typically are related to the instrumentation, rhythmic structure, and harmonic content of the music. Genre hierarchies are commonly used to structure the large collections of music available on the Web. Currently musical genre annotation is performed manually. Automatic musical genre classification can assist or replace the human user in this process and would be a valuable addition to music information retrieval systems. In addition, automatic musical genre classification provides a framework for developing and evaluating features for any type of content-based analysis of musical signals. In this paper, the automatic classification of audio signals into an hierarchy of musical genres is explored. More specifically, three feature sets for representing timbral texture, rhythmic content and pitch content are proposed. The performance and relative importance of the proposed features is investigated by training statistical pattern recognition classifiers using real-world audio collections. Both whole file and real-time frame-based classification schemes are described. Using the proposed feature sets, classification of 61% for ten musical genres is achieved. This result is comparable to results reported for human musical genre classification.

In [ ]:

import os
import enum
import glob

import tqdm
import tensorflow as tf
import librosa as lr
import numpy as np
import pandas as pd

assert tf.__version__ == "2.0.0" and tf.test.is_gpu_available()

Exploratory data analysis (EDA)¶

In [ ]:

def load_metadata() -> pd.DataFrame:
    paths = glob.glob("../input/gtzan-genre-collection/genres/*/*.au")
    df = pd.DataFrame({"path": paths})
    df["genre"] = df.path.apply(lambda x: x.rsplit("/")[-2])
    df["duration"] = df.path.apply(lambda x: lr.get_duration(filename=x))
    df["samplerate"] = df.path.apply(lr.get_samplerate)
    return df


metadata = load_metadata()
groups = metadata.groupby("genre")
pd.DataFrame({
    'Tracks': groups.path.count(),
    'Duration': groups.duration.sum()
}).plot.bar(subplots=True, legend=None);

As we can see above, the dataset is balanced upfront so we don't have to worry about majority/minority genres. Let's immediately set aside a test set so we can estimate our performance on unseen tracks, which is ultimately how we would expect our classifier to be used in production. We will not touch these test tracks during development.

In [ ]:

test = metadata.groupby("genre").head(10)
train = metadata.drop(index=test.index)
assert set(test.path).isdisjoint(train.path)
d = test.groupby("genre").duration
pd.DataFrame({'Total Duration': d.sum(), 'Total Tracks': d.count()})

ETL¶

Historically there have been many audio features that could be useful. Today however, traditional features like the zero-crossing rate or the spectral centroid have primarily been replaced by automatic feature extraction from filtered spectrograms (or even directly from waveforms, in the most bleeding edge of research).

Based on what everyone else seems to be doing today, my empirical hunch is that a spectrogram representation is a good starting point for genre classification. We'll let the model "blur/sharpen" adjacent time-frequency coefficients as deemed beneficial by gradient descent, as we're minimizing misclassifications.

While learning mid-level features on the fly like this is a bit hard to digest, it's the core principle of deep learning, in which we let the model itself choose useful feature representations for the end goal of the model. There are naturally many best practices and details involved within this framework but we'll keep things simple for now.

In [ ]:

from librosa.display import specshow


def load(path: str, frames=128) -> np.ndarray:
    """Load .au file into a log-scaled mel spectrogram."""
    duration = None
    offset = None
    if frames is not None:
        duration = (512 * frames - 1) / 22050
        offset = np.random.uniform(0.0, lr.get_duration(filename=path) - duration)
    waveform, samplerate = lr.load(path, offset=offset, duration=duration)
    assert samplerate == 22050
    mel = lr.feature.melspectrogram(waveform)
    logmel = lr.power_to_db(mel)
    return logmel


row = metadata.sample().to_dict('records')[0]
spectrogram = load(row['path'])
specshow(spectrogram, x_axis='time', y_axis='hz')
row['path'], spectrogram.min(), spectrogram.max(), spectrogram.mean()

In [ ]:

Genre = enum.Enum("Genre", metadata.genre.unique().tolist())


def embed(genre):
    try:
        genre = genre.decode()
    except AttributeError:
        pass
    i = Genre[genre].value - 1
    return tf.one_hot(i, len(Genre))


for x in Genre:
    print(embed(x.name), x.name)

In [ ]:

@tf.function(input_signature=[tf.TensorSpec([], tf.string)])
def embed_tf(genre):
    onehot = tf.numpy_function(embed, [genre], tf.float32)
    onehot.set_shape((10,))
    return onehot


@tf.function(input_signature=[tf.TensorSpec([], tf.string)])
def load_tf(path):
    spectrogram = tf.numpy_function(load, (path,), tf.float32)
    spectrogram.set_shape((128, 128))
    return spectrogram


def load_example(row):
    row['spectrogram'] =  load_tf(row["path"])
    row['genre'] = embed_tf(row["genre"])
    return row


def create_dataset(metadata: pd.DataFrame) -> tf.data.Dataset:
    df = metadata.copy()
    df['index'] = df.index
    dataset = (
        tf.data.Dataset.from_tensor_slices(dict(df))
        .shuffle(len(metadata))
        .map(load_example, -1)
        .prefetch(-1)
    )
    return dataset


# Materialize dataset in RAM for later reuse.
random_crops = 10
cached_dataset = (
    create_dataset(metadata)
    .repeat(random_crops)
    .cache()
)
for row in tqdm.tqdm(cached_dataset):
    pass

As a performance trick we'll reuse a global, RAM cached dataset for every epoch/subset. I wouldn't recommend this for larger datasets, but since GTZAN is so small this is totally fine. There is a substantial speedup after the first epoch due to the memory cache (instead of roughly 2 minutes of audio decoding into logmel spectrograms we spend around one second for the full dataset). For bigger datasets, we should render TFRecord:s to S3 with Apache Beam instead.

In [ ]:

for row in tqdm.tqdm(cached_dataset):
    pass

In [ ]:

@tf.function
def collate(row):
    features = row['spectrogram']
    labels = row['genre']
    return features, labels


def split_dataset(training_index, validation_index):

    def isin(row, rows):
        i = tf.cast(row['index'], tf.int64)
        return tf.reduce_any(tf.equal(i, rows))

    def is_train(row):
        return isin(row, training_index)

    def is_val(row):
        return isin(row, validation_index)

    training_dataset = (
        cached_dataset
        .filter(is_train)
        .shuffle(4096)
        .batch(32)
        .map(collate)
    )

    validation_dataset = (
        cached_dataset
        .filter(is_val)
        .batch(32)
        .map(collate)
    )

    return training_dataset, validation_dataset


datasets = split_dataset(metadata.index, metadata.index)
for dataset in datasets:
    for row in tqdm.tqdm(dataset):
        pass

Model training¶

In [ ]:

from tensorflow.keras.layers import *


def create_model(name: str) -> tf.keras.Model:
    x = tf.keras.Input([128, 128], name="spectrogram")
    inputs = [x]

    x = tf.expand_dims(x, axis=-1)
    for i in range(5):
        x = Conv2D(32*2**i, 3, use_bias=False)(x)
        x = BatchNormalization()(x)
        x = Activation("relu")(x)
        x = MaxPool2D()(x)
    x = GlobalAveragePooling2D()(x)
    x = Dense(10, activation="softmax", name="genre")(x)
    outputs = [x]

    model = tf.keras.Model(inputs, outputs, name=name)
    model.compile(
        optimizer=tf.optimizers.Adam(), 
        loss=tf.losses.CategoricalCrossentropy(),
        metrics=[tf.metrics.CategoricalAccuracy()])
    return model


create_model('test').summary()

In [ ]:

from tensorflow.keras.callbacks import *
from sklearn.model_selection import StratifiedKFold

folder = StratifiedKFold(5, shuffle=True)
for i, (t, v) in enumerate(folder.split(train.index, train.genre)):
    training_index = train.iloc[t].index
    validation_index = train.iloc[v].index

    dataset, validation_dataset = split_dataset(training_index, validation_index)
    model = create_model(name=f'genre_classifier_{i}')
    model.fit(
        dataset,
        validation_data=validation_dataset,
        epochs=100,
        callbacks=[
            ReduceLROnPlateau(factor=0.5, patience=3),
            EarlyStopping(patience=5),
            ModelCheckpoint(model.name, save_weights_only=True, save_best_only=True),
        ],
    )

Making predictions¶

In [ ]:

models = [create_model(f'genre_classifier_{i}') for i in range(folder.get_n_splits())]
for model in models:
    model.load_weights(model.name)

In [ ]:

random_crops = 8
batch_size = 128
test_dataset = create_dataset(test).repeat(random_crops).batch(batch_size)

results = []
for minibatch in tqdm.tqdm(test_dataset):
    features, labels = collate(minibatch)
    paths = minibatch['path'].numpy()

    for model in models:
        y = model.predict_on_batch(features).numpy()
        for y_true, y_pred, path in zip(labels.numpy(), y, paths):
            result = {
                'model': model.name,
                'path': path,
                'y_pred': y_pred,
                'y_true': y_true,
            }
            results.append(result)

results = pd.DataFrame(results)

In [ ]:

y_pred = results.groupby(['model', 'path']).y_pred.apply(np.mean).apply(np.argmax).groupby('path').median()
y_true = results.groupby('path').y_true.apply(np.sum).apply(np.argmax)

In [ ]:

from sklearn.metrics import classification_report

print(classification_report(y_true, y_pred, target_names=[x.name for x in Genre]))

In [ ]:

from seaborn import heatmap
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_true, y_pred)
ax = heatmap(cm, annot=True, square=True, xticklabels=Genre, yticklabels=Genre, cmap='hot')
ax.set_xlabel("Predicted")
ax.set_ylabel("Labeled")

Inspecting missclassifications¶

As a final quality check, lets listen to the "missclassfied" songs. Since this dataset is small enough, we can actually afford ourselves the luxury of inspecting all the mistakes on the test set. However, when doing this we have to be very careful about revisiting our modelling and exploiting any insights, as we then would be unable to estimate our actual performance on unseen data unless we gathered a fresh test set.

Anyhow, to my ears many of these are very reasonable mistakes and more telling of the hopelessness of assigning songs to mutually exclusive genre buckets. It's probably a better approach to treat music genres as either multi-label tags, or even as a ranking problem as genres aren't equidistant categories.

Qualitatevely, we do seem to underperform slightly on reggae however. More folds, some common augmentation strategies like time-stretching, pitch-shifting and reverberation and of course more training examples, would potentially fix this.

It's important to note however that tracks often belong to several genres, and sometimes different parts of a track belong to different genres. Genres are also often considered as part of a hierarchy, sometimes in a historical context and sometimes in a musical context. It's likely better to treat genre classification as a ranking problem with non-exclusive genre tags.

https://arxiv.org/abs/1306.1461

The GTZAN dataset appears in at least 100 published works, and is the most-used public dataset for evaluation in machine listening research for music genre recognition (MGR). Our recent work, however, shows GTZAN has several faults (repetitions, mislabelings, and distortions), which challenge the interpretability of any result derived using it. In this article, we disprove the claims that all MGR systems are affected in the same ways by these faults, and that the performances of MGR systems in GTZAN are still meaningfully comparable since they all face the same faults. We identify and analyze the contents of GTZAN, and provide a catalog of its faults. We review how GTZAN has been used in MGR research, and find few indications that its faults have been known and considered. Finally, we rigorously study the effects of its faults on evaluating five different MGR systems. The lesson is not to banish GTZAN, but to use it with consideration of its contents.

Two things are interesting to note here.

With DNNs on spectrograms we can beat the original 61%.
The dataset is balanced and all genres are treated as equidistant classes, which is probably not what a human would want. If the track was labeled as a metal song, it's less wrong to predict rock than jazz.

Anyway, let's start with some Python imports.

In [ ]:

from IPython.display import Audio

mistakes = y_true != y_pred
missclassified = pd.DataFrame({'y_true': y_true[mistakes], 'y_pred': y_pred[mistakes]})

for row in missclassified.itertuples():
    name = os.path.basename(row.Index.decode())
    waveform, samplerate = lr.load(row.Index, duration=5.0)
    guessed_genre = Genre(row.y_pred + 1).name
    labeled_genre = Genre(row.y_true + 1).name
    display(
        f"{name}: predicted {guessed_genre} but should have been {labeled_genre}.",
        Audio(waveform, rate=samplerate),
    )