audioldm-stable-audio-open-musicgen
Tiny test of recent text-to-music (TTM) models¶
To run this notebook you need to do three things:
- Make sure the Colab runtime has a NVIDIA GPU available because CUDA is assumed.
- Request access to Stable Audio Open and create a corresponding access token to paste into the Hugging Face login screen below.
- Pray to the software dependency gods that the
pip
install below still works.
Setup¶
In [1]:
pip install diffusers transformers torchsde
In [2]:
import numpy
import scipy
import torch
import pandas as pd
import soundfile as sf
import IPython.display as ipd
In [3]:
from huggingface_hub import login
login()
Models¶
In [4]:
# This prompt is used for all models below so we can compare how they sound.
prompt = "relaxing piano music with a banjo solo and lo-fi beats"
AudioLDM¶
In [5]:
from diffusers import AudioLDMPipeline
repo_id = "cvssp/audioldm-s-full-v2"
audioldm = AudioLDMPipeline.from_pretrained(repo_id, torch_dtype=torch.float16)
audioldm = audioldm.to("cuda")
audio = audioldm(prompt, num_inference_steps=10, audio_length_in_s=30.0).audios[0]
sf.write("audioldm.ogg", audio, samplerate=16000)
ipd.Audio("audioldm.ogg")
Out[5]:
MusicGen¶
In [6]:
from transformers import AutoProcessor, MusicgenForConditionalGeneration
processor = AutoProcessor.from_pretrained("facebook/musicgen-small")
musicgen = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-small")
musicgen = musicgen.to("cuda")
inputs = processor(text=[prompt], padding=True, return_tensors="pt").to("cuda")
audio = musicgen.generate(**inputs, do_sample=True, guidance_scale=3, max_new_tokens=1503)[0].T
sf.write("musicgen.ogg", audio.numpy(force=True), 32000)
ipd.Audio("musicgen.ogg")
Out[6]:
Stable Audio Open¶
In [7]:
from diffusers import StableAudioPipeline
stableaudio = StableAudioPipeline.from_pretrained("stabilityai/stable-audio-open-1.0", torch_dtype=torch.float16)
stableaudio = stableaudio.to("cuda")
generator = torch.Generator("cuda").manual_seed(0)
audio = stableaudio(
prompt,
negative_prompt="Low quality.",
num_inference_steps=200,
audio_end_in_s=30.0,
num_waveforms_per_prompt=3,
generator=generator,
).audios
output = audio[0].T.float().numpy(force=True)
sf.write("stableaudio.ogg", output, 44100)
ipd.Audio("stableaudio.ogg")
Out[7]:
Comparison table¶
In [8]:
import base64
def embed_audio(src):
with open(src, "rb") as f:
data = f.read()
code = base64.b64encode(data).decode()
html = f'<audio controls src="data:audio/ogg;base64,{code}" />'
return html
df = pd.DataFrame([
{"model": "AudioLDM", "prompt": prompt, "audio": embed_audio("audioldm.ogg")},
{"model": "Stable Audio Open", "prompt": prompt, "audio": embed_audio("stableaudio.ogg")},
{"model": "MusicGen", "prompt": prompt, "audio": embed_audio("musicgen.ogg")}
])
ipd.HTML(df.to_html(escape=False))
Out[8]: