noise2music-inspired-automatic-music-captioning

2024-12-09 23:56

Carl Thomé

noise2music-inspired automatic music captioning¶

In noise2music, the training dataset is created by pseudo-labeling a vast collection of unlabeled music audio using two advanced deep learning models. A large language model generates a diverse set of general music-related descriptive sentences to serve as potential captions. These captions are then matched to individual music clips through zero-shot classification, leveraging a pre-trained joint embedding model designed for music and text.

So being curious, let's try the following:

Generate a lot of music descriptions with a Meta Llama 3.2 LLM.
Embed the generated music descriptions with a LAION CLAP text encoder.
Index the text embeddings for nearest neighbor retrieval with FAISS.
Use the corresponding audio encoder to embed an audio example.
Use the audio embedding as search query for retrieving text embeddings.

Could this simple method produce reasonable audio captions?

In [1]:

pip install -q datasets faiss-cpu

In [2]:

import torch
import faiss
import transformers
import datasets
import polars as pl
import librosa as lr
import numpy as np
import tqdm.auto as tqdm
import seaborn as sns

# Configure plotting.
pl.Config.set_fmt_str_lengths(256)
sns.set_style("ticks")
sns.set_theme("notebook")

# Download some example audio files.
dataset = datasets.load_dataset("marsyas/gtzan", trust_remote_code=True)

# Download a pretrained text generation model.
text_generator = transformers.pipeline(
    task="text-generation",
    model="meta-llama/Llama-3.2-1B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# Download a pretrained CLAP model.
clap_model = transformers.ClapModel.from_pretrained("laion/larger_clap_general")
clap_processor = transformers.ClapProcessor.from_pretrained("laion/larger_clap_general")

Device set to use cuda:0

In [6]:

descriptions = pl.read_parquet("music_descriptions.parquet")
descriptions

Out[6]:

shape: (100, 1)

generated_text
str
"The piece I'm describing is a hauntingly beautiful electronic pop track that blends elements of ambient, synth-pop, and indie electronic music. It begins with a delicate"
"This piece of music has a prominent, pulsing electronic beat, accompanied by the thumping bassline of a drum machine. The synths provide a bright,"
"The piece I'm describing is a blend of electronic dance music and hip-hop, with a prominent bassline and driving beat. The instrumentation features a combination of synthes"
"The piece in question is a fusion of electronic and rock elements, featuring a prominent bass line, driving drum patterns, and intricate synthesizer work. The instrumentation is"
"This piece of music is a blend of indie rock and electronic elements. It begins with a driving beat, accompanied by the prominent use of synthesizers and a puls"
…
"Imagine a mesmerizing piece of music that combines elements of electronic dance music, indie rock, and world music. The sound is a dynamic blend of pulsating syn"
"Imagine a contemporary electronic dance track with a haunting, atmospheric quality. The foundation is provided by a prominent bassline, driving a rhythmic pattern that's both driving"
"This piece of music is a high-energy electronic dance track. It features a prominent synthesizer riff, often accompanied by a driving kick drum and energetic percussion. The"
"The piece I've chosen is a 2010 electronic dance track by Swedish DJ, Avicii. This song has a typical electronic dance music (ED"
"This piece of music is a fusion of electronic and orchestral elements, featuring a prominent synthesizer as the main instrument. It begins with a soft, pulsing"

In [4]:

# Load existing descriptions.
try:
    descriptions = pl.read_parquet("music_descriptions.parquet")
except FileNotFoundError:
    pass

# Generate a lot of music descriptions.
messages = [
    {"role": "system", "content": "You are a music reviewer who is specific, brief and accurate. You work with helping people find music."},
    {"role": "user", "content": "Imagine any random piece of popular music and describe how it sounds in passive voice. Mention instruments, genres and vibes. Don't mention titles or artists."},
]
for _ in tqdm.trange(100, desc="Generating descriptions"):
    descriptions = text_generator(
        messages,
        num_return_sequences=100,
        return_full_text=False,
        do_sample=True,
        num_beams=1,
        max_new_tokens=32,
    )


descriptions = pl.DataFrame(descriptions)

# Save the descriptions to file.
descriptions.write_parquet("music_descriptions.parquet")

# Show a few examples.
descriptions.sample(n=3)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.

Out[4]:

shape: (3, 1)

generated_text
str
"This piece of music is a blend of electronic dance music (EDM) and indie rock, with a dash of synth-pop. It features a prominent, puls"
"This piece of music is a fusion of electronic and indie rock elements. It starts with a prominent piano melody, accompanied by a minimalist drum pattern and a soft,"
"This piece of music is a fusion of electronic and orchestral elements, featuring a prominent synthesizer as the main instrument. It begins with a soft, pulsing"

In [ ]:

num_dimensions = clap_model.config.projection_dim
index = faiss.IndexHNSWFlat(num_dimensions)

In [ ]:

# Tokenize text descriptions.
inputs = clap_processor(text=descriptions["generated_text"].to_list(), return_tensors="pt", padding=True)

# Populate local vector database.
batch_size = 8
for i in tqdm.trange(0, len(inputs["input_ids"]), batch_size, desc="Indexing descriptions"):
    input_ids = inputs["input_ids"][i:i + batch_size]
    attention_mask = inputs["attention_mask"][i:i + batch_size]

    # Embed the tokens.
    text_embeddings = clap_model.get_text_features(input_ids, attention_mask)

    # Add embeddings to the index.
    index.add(text_embeddings.numpy(force=True))

In [ ]:

# Load an example audio file.
audio_file = lr.example("trumpet")
waveform, samplerate = lr.load(audio_file, sr=clap_processor.feature_extractor.sampling_rate)

# Compute audio embedding.
inputs = clap_processor(audios=waveform, return_tensors="pt", sampling_rate=clap_processor.feature_extractor.sampling_rate)
audio_embedding = clap_model.get_audio_features(**inputs)
audio_embedding.shape

In [ ]:

# Search for similar embeddings.
max_results = 1000
similarities, neighbor_ids = index.search(audio_embedding.numpy(force=True), k=max_results)
similarities.shape

In [ ]:

# Lookup underlying text descriptions.
matches = descriptions[neighbor_ids[0][:max_results]].with_columns(pl.Series("scores", similarities[0][:max_results]))
matches.top_k(5, by="scores")

In [ ]:

sns.histplot(similarities[0], bins=max_results//100);