audioldm-stable-audio-open-musicgen

2024-09-09 08:09

Carl Thomé

Comments

Source

Original site

Tiny test of recent text-to-music (TTM) models¶

To run this notebook you need to do three things:

Make sure the Colab runtime has a NVIDIA GPU available because CUDA is assumed.
Request access to Stable Audio Open and create a corresponding access token to paste into the Hugging Face login screen below.
Pray to the software dependency gods that the pip install below still works.

Setup¶

In [1]:

pip install diffusers transformers torchsde

Requirement already satisfied: diffusers in /usr/local/lib/python3.10/dist-packages (0.30.2)
Requirement already satisfied: transformers in /usr/local/lib/python3.10/dist-packages (4.44.2)
Requirement already satisfied: torchsde in /usr/local/lib/python3.10/dist-packages (0.2.6)
Requirement already satisfied: importlib-metadata in /usr/local/lib/python3.10/dist-packages (from diffusers) (8.4.0)
Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from diffusers) (3.15.4)
Requirement already satisfied: huggingface-hub>=0.23.2 in /usr/local/lib/python3.10/dist-packages (from diffusers) (0.24.6)
Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from diffusers) (1.26.4)
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.10/dist-packages (from diffusers) (2024.5.15)
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from diffusers) (2.32.3)
Requirement already satisfied: safetensors>=0.3.1 in /usr/local/lib/python3.10/dist-packages (from diffusers) (0.4.4)
Requirement already satisfied: Pillow in /usr/local/lib/python3.10/dist-packages (from diffusers) (9.4.0)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from transformers) (24.1)
Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from transformers) (6.0.2)
Requirement already satisfied: tokenizers<0.20,>=0.19 in /usr/local/lib/python3.10/dist-packages (from transformers) (0.19.1)
Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.10/dist-packages (from transformers) (4.66.5)
Requirement already satisfied: scipy>=1.5 in /usr/local/lib/python3.10/dist-packages (from torchsde) (1.13.1)
Requirement already satisfied: torch>=1.6.0 in /usr/local/lib/python3.10/dist-packages (from torchsde) (2.4.0+cu121)
Requirement already satisfied: trampoline>=0.1.2 in /usr/local/lib/python3.10/dist-packages (from torchsde) (0.1.2)
Requirement already satisfied: fsspec>=2023.5.0 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.23.2->diffusers) (2024.6.1)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.23.2->diffusers) (4.12.2)
Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch>=1.6.0->torchsde) (1.13.2)
Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch>=1.6.0->torchsde) (3.3)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch>=1.6.0->torchsde) (3.1.4)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.10/dist-packages (from importlib-metadata->diffusers) (3.20.1)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->diffusers) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->diffusers) (3.8)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->diffusers) (2.0.7)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->diffusers) (2024.8.30)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch>=1.6.0->torchsde) (2.1.5)
Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from sympy->torch>=1.6.0->torchsde) (1.3.0)

In [2]:

import numpy
import scipy
import torch
import pandas as pd
import soundfile as sf
import IPython.display as ipd

In [3]:

from huggingface_hub import login

login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Models¶

In [4]:

# This prompt is used for all models below so we can compare how they sound.
prompt = "relaxing piano music with a banjo solo and lo-fi beats"

AudioLDM¶

In [5]:

from diffusers import AudioLDMPipeline

repo_id = "cvssp/audioldm-s-full-v2"
audioldm = AudioLDMPipeline.from_pretrained(repo_id, torch_dtype=torch.float16)
audioldm = audioldm.to("cuda")

audio = audioldm(prompt, num_inference_steps=10, audio_length_in_s=30.0).audios[0]

sf.write("audioldm.ogg", audio, samplerate=16000)
ipd.Audio("audioldm.ogg")

/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:89: UserWarning: 
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
  warnings.warn(

Loading pipeline components...:   0%|          | 0/6 [00:00<?, ?it/s]

  0%|          | 0/10 [00:00<?, ?it/s]

Out[5]:

MusicGen¶

In [6]:

from transformers import AutoProcessor, MusicgenForConditionalGeneration

processor = AutoProcessor.from_pretrained("facebook/musicgen-small")
musicgen = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-small")
musicgen = musicgen.to("cuda")

inputs = processor(text=[prompt], padding=True, return_tensors="pt").to("cuda")
audio = musicgen.generate(**inputs, do_sample=True, guidance_scale=3, max_new_tokens=1503)[0].T

sf.write("musicgen.ogg", audio.numpy(force=True), 32000)
ipd.Audio("musicgen.ogg")

/usr/local/lib/python3.10/dist-packages/torch/nn/utils/weight_norm.py:134: FutureWarning: `torch.nn.utils.weight_norm` is deprecated in favor of `torch.nn.utils.parametrizations.weight_norm`.
  WeightNorm.apply(module, name, dim)
/usr/local/lib/python3.10/dist-packages/transformers/models/encodec/modeling_encodec.py:120: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  self.register_buffer("padding_total", torch.tensor(kernel_size - stride, dtype=torch.int64), persistent=False)
`torch.nn.functional.scaled_dot_product_attention` does not support having an empty attention mask. Falling back to the manual attention implementation. This warning can be removed using the argument `attn_implementation="eager"` when loading the model.Note that this probably happens because `guidance_scale>1` or because you used `get_unconditional_inputs`. See https://github.com/huggingface/transformers/issues/31189 for more information.

Out[6]:

Stable Audio Open¶

In [7]:

from diffusers import StableAudioPipeline

stableaudio = StableAudioPipeline.from_pretrained("stabilityai/stable-audio-open-1.0", torch_dtype=torch.float16)
stableaudio = stableaudio.to("cuda")

generator = torch.Generator("cuda").manual_seed(0)

audio = stableaudio(
    prompt,
    negative_prompt="Low quality.",
    num_inference_steps=200,
    audio_end_in_s=30.0,
    num_waveforms_per_prompt=3,
    generator=generator,
).audios

output = audio[0].T.float().numpy(force=True)
sf.write("stableaudio.ogg", output, 44100)
ipd.Audio("stableaudio.ogg")

Loading pipeline components...:   0%|          | 0/6 [00:00<?, ?it/s]

  0%|          | 0/200 [00:00<?, ?it/s]

/usr/local/lib/python3.10/dist-packages/torchsde/_brownian/brownian_interval.py:608: UserWarning: Should have tb<=t1 but got tb=500.00006103515625 and t1=500.0.
  warnings.warn(f"Should have {tb_name}<=t1 but got {tb_name}={tb} and t1={self._end}.")
/usr/local/lib/python3.10/dist-packages/torchsde/_brownian/brownian_interval.py:599: UserWarning: Should have ta>=t0 but got ta=0.29999998211860657 and t0=0.3.
  warnings.warn(f"Should have ta>=t0 but got ta={ta} and t0={self._start}.")
/usr/local/lib/python3.10/dist-packages/torchsde/_brownian/brownian_interval.py:599: UserWarning: Should have ta>=t0 but got ta=0.0 and t0=0.3.
  warnings.warn(f"Should have ta>=t0 but got ta={ta} and t0={self._start}.")
/usr/local/lib/python3.10/dist-packages/torchsde/_brownian/brownian_interval.py:602: UserWarning: Should have tb>=t0 but got tb=0.29999998211860657 and t0=0.3.
  warnings.warn(f"Should have {tb_name}>=t0 but got {tb_name}={tb} and t0={self._start}.")

Out[7]:

Comparison table¶

In [8]:

import base64


def embed_audio(src):
    with open(src, "rb") as f:
        data = f.read()
    code = base64.b64encode(data).decode()
    html = f'<audio controls src="data:audio/ogg;base64,{code}" />'
    return html


df = pd.DataFrame([
    {"model": "AudioLDM", "prompt": prompt, "audio": embed_audio("audioldm.ogg")},
    {"model": "Stable Audio Open", "prompt": prompt, "audio": embed_audio("stableaudio.ogg")},
    {"model": "MusicGen", "prompt": prompt, "audio": embed_audio("musicgen.ogg")}
])

ipd.HTML(df.to_html(escape=False))

Out[8]:

	model	prompt
0	AudioLDM	relaxing piano music with a banjo solo and lo-fi beats
1	Stable Audio Open	relaxing piano music with a banjo solo and lo-fi beats
2	MusicGen	relaxing piano music with a banjo solo and lo-fi beats