<a href="https://colab.research.google.com/gist/carlthome/5f96f8d5777597982dd0fedc532f1647/audioldm-stable-audio-open-musicgen.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tiny test of recent text-to-music (TTM) models

To run this notebook you need to do three things:
1. Make sure the Colab runtime has a NVIDIA GPU available because CUDA is assumed.
1. Request access to [Stable Audio Open](https://huggingface.co/stabilityai/stable-audio-open-1.0) and create a corresponding [access token](https://huggingface.co/settings/tokens) to paste into the Hugging Face login screen below.
1. Pray to the software dependency gods that the `pip` install below still works[.](https://nixos.org/)

## Setup

In [1]:
pip install diffusers transformers torchsde



In [2]:
import numpy
import scipy
import torch
import pandas as pd
import soundfile as sf
import IPython.display as ipd

In [3]:
from huggingface_hub import login

login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.svâ€¦

## Models

In [4]:
# This prompt is used for all models below so we can compare how they sound.
prompt = "relaxing piano music with a banjo solo and lo-fi beats"

### AudioLDM

In [5]:
from diffusers import AudioLDMPipeline

repo_id = "cvssp/audioldm-s-full-v2"
audioldm = AudioLDMPipeline.from_pretrained(repo_id, torch_dtype=torch.float16)
audioldm = audioldm.to("cuda")

audio = audioldm(prompt, num_inference_steps=10, audio_length_in_s=30.0).audios[0]

sf.write("audioldm.ogg", audio, samplerate=16000)
ipd.Audio("audioldm.ogg")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading pipeline components...:   0%|          | 0/6 [00:00<?, ?it/s]

  0%|          | 0/10 [00:00<?, ?it/s]

### MusicGen

In [6]:
from transformers import AutoProcessor, MusicgenForConditionalGeneration

processor = AutoProcessor.from_pretrained("facebook/musicgen-small")
musicgen = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-small")
musicgen = musicgen.to("cuda")

inputs = processor(text=[prompt], padding=True, return_tensors="pt").to("cuda")
audio = musicgen.generate(**inputs, do_sample=True, guidance_scale=3, max_new_tokens=1503)[0].T

sf.write("musicgen.ogg", audio.numpy(force=True), 32000)
ipd.Audio("musicgen.ogg")

  WeightNorm.apply(module, name, dim)
  self.register_buffer("padding_total", torch.tensor(kernel_size - stride, dtype=torch.int64), persistent=False)


### Stable Audio Open

In [7]:
from diffusers import StableAudioPipeline

stableaudio = StableAudioPipeline.from_pretrained("stabilityai/stable-audio-open-1.0", torch_dtype=torch.float16)
stableaudio = stableaudio.to("cuda")

generator = torch.Generator("cuda").manual_seed(0)

audio = stableaudio(
    prompt,
    negative_prompt="Low quality.",
    num_inference_steps=200,
    audio_end_in_s=30.0,
    num_waveforms_per_prompt=3,
    generator=generator,
).audios

output = audio[0].T.float().numpy(force=True)
sf.write("stableaudio.ogg", output, 44100)
ipd.Audio("stableaudio.ogg")

Loading pipeline components...:   0%|          | 0/6 [00:00<?, ?it/s]

  0%|          | 0/200 [00:00<?, ?it/s]



## Comparison table



In [8]:
import base64


def embed_audio(src):
    with open(src, "rb") as f:
        data = f.read()
    code = base64.b64encode(data).decode()
    html = f'<audio controls src="data:audio/ogg;base64,{code}" />'
    return html


df = pd.DataFrame([
    {"model": "AudioLDM", "prompt": prompt, "audio": embed_audio("audioldm.ogg")},
    {"model": "Stable Audio Open", "prompt": prompt, "audio": embed_audio("stableaudio.ogg")},
    {"model": "MusicGen", "prompt": prompt, "audio": embed_audio("musicgen.ogg")}
])

ipd.HTML(df.to_html(escape=False))

Unnamed: 0,model,prompt,audio
0,AudioLDM,relaxing piano music with a banjo solo and lo-fi beats,
1,Stable Audio Open,relaxing piano music with a banjo solo and lo-fi beats,
2,MusicGen,relaxing piano music with a banjo solo and lo-fi beats,
