Skip to main content

audioldm-stable-audio-open-musicgen

Open In Colab

Tiny test of recent text-to-music (TTM) models

To run this notebook you need to do three things:

  1. Make sure the Colab runtime has a NVIDIA GPU available because CUDA is assumed.
  2. Request access to Stable Audio Open and create a corresponding access token to paste into the Hugging Face login screen below.
  3. Pray to the software dependency gods that the pip install below still works.

Setup

In [1]:
pip install diffusers transformers torchsde
Requirement already satisfied: diffusers in /usr/local/lib/python3.10/dist-packages (0.30.2)
Requirement already satisfied: transformers in /usr/local/lib/python3.10/dist-packages (4.44.2)
Requirement already satisfied: torchsde in /usr/local/lib/python3.10/dist-packages (0.2.6)
Requirement already satisfied: importlib-metadata in /usr/local/lib/python3.10/dist-packages (from diffusers) (8.4.0)
Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from diffusers) (3.15.4)
Requirement already satisfied: huggingface-hub>=0.23.2 in /usr/local/lib/python3.10/dist-packages (from diffusers) (0.24.6)
Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from diffusers) (1.26.4)
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.10/dist-packages (from diffusers) (2024.5.15)
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from diffusers) (2.32.3)
Requirement already satisfied: safetensors>=0.3.1 in /usr/local/lib/python3.10/dist-packages (from diffusers) (0.4.4)
Requirement already satisfied: Pillow in /usr/local/lib/python3.10/dist-packages (from diffusers) (9.4.0)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from transformers) (24.1)
Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from transformers) (6.0.2)
Requirement already satisfied: tokenizers<0.20,>=0.19 in /usr/local/lib/python3.10/dist-packages (from transformers) (0.19.1)
Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.10/dist-packages (from transformers) (4.66.5)
Requirement already satisfied: scipy>=1.5 in /usr/local/lib/python3.10/dist-packages (from torchsde) (1.13.1)
Requirement already satisfied: torch>=1.6.0 in /usr/local/lib/python3.10/dist-packages (from torchsde) (2.4.0+cu121)
Requirement already satisfied: trampoline>=0.1.2 in /usr/local/lib/python3.10/dist-packages (from torchsde) (0.1.2)
Requirement already satisfied: fsspec>=2023.5.0 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.23.2->diffusers) (2024.6.1)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.23.2->diffusers) (4.12.2)
Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch>=1.6.0->torchsde) (1.13.2)
Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch>=1.6.0->torchsde) (3.3)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch>=1.6.0->torchsde) (3.1.4)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.10/dist-packages (from importlib-metadata->diffusers) (3.20.1)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->diffusers) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->diffusers) (3.8)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->diffusers) (2.0.7)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->diffusers) (2024.8.30)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch>=1.6.0->torchsde) (2.1.5)
Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from sympy->torch>=1.6.0->torchsde) (1.3.0)
In [2]:
import numpy
import scipy
import torch
import pandas as pd
import soundfile as sf
import IPython.display as ipd
In [3]:
from huggingface_hub import login

login()
VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Models

In [4]:
# This prompt is used for all models below so we can compare how they sound.
prompt = "relaxing piano music with a banjo solo and lo-fi beats"

AudioLDM

In [5]:
from diffusers import AudioLDMPipeline

repo_id = "cvssp/audioldm-s-full-v2"
audioldm = AudioLDMPipeline.from_pretrained(repo_id, torch_dtype=torch.float16)
audioldm = audioldm.to("cuda")

audio = audioldm(prompt, num_inference_steps=10, audio_length_in_s=30.0).audios[0]

sf.write("audioldm.ogg", audio, samplerate=16000)
ipd.Audio("audioldm.ogg")
/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:89: UserWarning: 
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
  warnings.warn(
Loading pipeline components...:   0%|          | 0/6 [00:00<?, ?it/s]
  0%|          | 0/10 [00:00<?, ?it/s]
Out[5]:

MusicGen

In [6]:
from transformers import AutoProcessor, MusicgenForConditionalGeneration

processor = AutoProcessor.from_pretrained("facebook/musicgen-small")
musicgen = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-small")
musicgen = musicgen.to("cuda")

inputs = processor(text=[prompt], padding=True, return_tensors="pt").to("cuda")
audio = musicgen.generate(**inputs, do_sample=True, guidance_scale=3, max_new_tokens=1503)[0].T

sf.write("musicgen.ogg", audio.numpy(force=True), 32000)
ipd.Audio("musicgen.ogg")
/usr/local/lib/python3.10/dist-packages/torch/nn/utils/weight_norm.py:134: FutureWarning: `torch.nn.utils.weight_norm` is deprecated in favor of `torch.nn.utils.parametrizations.weight_norm`.
  WeightNorm.apply(module, name, dim)
/usr/local/lib/python3.10/dist-packages/transformers/models/encodec/modeling_encodec.py:120: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  self.register_buffer("padding_total", torch.tensor(kernel_size - stride, dtype=torch.int64), persistent=False)
`torch.nn.functional.scaled_dot_product_attention` does not support having an empty attention mask. Falling back to the manual attention implementation. This warning can be removed using the argument `attn_implementation="eager"` when loading the model.Note that this probably happens because `guidance_scale>1` or because you used `get_unconditional_inputs`. See https://github.com/huggingface/transformers/issues/31189 for more information.
Out[6]:

Stable Audio Open

In [7]:
from diffusers import StableAudioPipeline

stableaudio = StableAudioPipeline.from_pretrained("stabilityai/stable-audio-open-1.0", torch_dtype=torch.float16)
stableaudio = stableaudio.to("cuda")

generator = torch.Generator("cuda").manual_seed(0)

audio = stableaudio(
    prompt,
    negative_prompt="Low quality.",
    num_inference_steps=200,
    audio_end_in_s=30.0,
    num_waveforms_per_prompt=3,
    generator=generator,
).audios

output = audio[0].T.float().numpy(force=True)
sf.write("stableaudio.ogg", output, 44100)
ipd.Audio("stableaudio.ogg")
Loading pipeline components...:   0%|          | 0/6 [00:00<?, ?it/s]
  0%|          | 0/200 [00:00<?, ?it/s]
/usr/local/lib/python3.10/dist-packages/torchsde/_brownian/brownian_interval.py:608: UserWarning: Should have tb<=t1 but got tb=500.00006103515625 and t1=500.0.
  warnings.warn(f"Should have {tb_name}<=t1 but got {tb_name}={tb} and t1={self._end}.")
/usr/local/lib/python3.10/dist-packages/torchsde/_brownian/brownian_interval.py:599: UserWarning: Should have ta>=t0 but got ta=0.29999998211860657 and t0=0.3.
  warnings.warn(f"Should have ta>=t0 but got ta={ta} and t0={self._start}.")
/usr/local/lib/python3.10/dist-packages/torchsde/_brownian/brownian_interval.py:599: UserWarning: Should have ta>=t0 but got ta=0.0 and t0=0.3.
  warnings.warn(f"Should have ta>=t0 but got ta={ta} and t0={self._start}.")
/usr/local/lib/python3.10/dist-packages/torchsde/_brownian/brownian_interval.py:602: UserWarning: Should have tb>=t0 but got tb=0.29999998211860657 and t0=0.3.
  warnings.warn(f"Should have {tb_name}>=t0 but got {tb_name}={tb} and t0={self._start}.")
Out[7]:

Comparison table

In [8]:
import base64


def embed_audio(src):
    with open(src, "rb") as f:
        data = f.read()
    code = base64.b64encode(data).decode()
    html = f'<audio controls src="data:audio/ogg;base64,{code}" />'
    return html


df = pd.DataFrame([
    {"model": "AudioLDM", "prompt": prompt, "audio": embed_audio("audioldm.ogg")},
    {"model": "Stable Audio Open", "prompt": prompt, "audio": embed_audio("stableaudio.ogg")},
    {"model": "MusicGen", "prompt": prompt, "audio": embed_audio("musicgen.ogg")}
])

ipd.HTML(df.to_html(escape=False))
Out[8]:
model prompt audio
0 AudioLDM relaxing piano music with a banjo solo and lo-fi beats
1 Stable Audio Open relaxing piano music with a banjo solo and lo-fi beats
2 MusicGen relaxing piano music with a banjo solo and lo-fi beats