AudioMuse-AI-DCLAP

AudioMuse-AI Distilled CLAP (DCLAP) distill the LAION CLAP audio tower into a tiny 7 M‑parameter model while retaining its 512‑dimensional embedding space. The result is a very fast text‑to‑music search engine that can run even on a Raspberry PI 5 8GB ram and SSD.

only the audio branch is distilled; the original text tower is reused (exported from LAION CLAP in .onnx).

Distillation

The goal is simple: distill LAION CLAP's behaviour while cutting the audio tower from ~80 M params to ~7 M and running 5–6× faster (performance test done on a Raspberry PI 5 8GB ram and SSD). A pre‑trained CLAP (music_audioset_epoch_15_esc_90.14.pt) supervises a tiny student, initialised with mn10as weights from EfficientAT. When this student plateaus on validation cosine, we freeze it and add a second, even smaller student seeded with EdgeNeXt weights. The newcomer is trained to push the cosine higher until no more gain shows.

The two students are finally fused via a gate that outputs a 512‑dimensional weight vector, deciding—for each feature—how much to trust the second model versus the frozen first.

The dataset for the distillation proces consists of various Creative Commons and public‑domain songs; see the dataset_license directory for full details.

The code used for training is in student_clap folder.

Quick Start

This command are tested on Linux and MacOS, to be executed on windows they will require some adaptation.

First create a python envirorment and install the dependencies:

python3 -m venv venv && source venv/bin/activate 
pip install onnxruntime librosa numpy transformers

Also get the required data from the release of this project:

File	Description
`model_epoch_36.onnx`	Student audio encoder
`model_epoch_36.onnx.data`	External weights — must be in same folder
`clap_text_model.onnx`	Teacher text encoder, from LAION CLAP exported in .onnx

Using this model the output expected is a 512-dim L2-normalized embeddings on both text and songs audio on which you can run the similarity with dot product.

The songs must be splitted in 10s segment with 50% overlap. Each segment is inferenced singularly and then all the segment avaraged together before the similarity check against the text.

The text model is exported in .onnx to have all on the same technology but no change was made. The tokenizer is directly downloaded in the code with AutoTokenizer.from_pretrained("laion/clap-htsat-unfused") that is a roberta-base tokenizer

Below a code example run on CPU:

import numpy as np
import librosa
import onnxruntime as ort
from transformers import AutoTokenizer

# === CONFIGURATION ===
AUDIO_PATH = "your_audio.wav"          # Path to your audio file
TEXT_QUERY = "Calm piano song"         # Your playlist/mood query string
STUDENT_AUDIO_MODEL = "model_epoch_36.onnx"    # Student audio encoder
TEXT_MODEL = "clap_text_model.onnx"            # Paired ONNX text encoder
SR = 48000
SEGMENT_LENGTH = 480000   # 10 seconds at 48 kHz
HOP_LENGTH = 240000       # 50% overlap
N_MELS = 128
N_FFT = 2048
HOP_LENGTH_MELS = 480

# -- Use the SAME tokenizer as used at student training time --
tokenizer = AutoTokenizer.from_pretrained("laion/clap-htsat-unfused")  # roberta-base tokenizer

def extract_segmented_logmels(audio_path: str):
    y, _ = librosa.load(audio_path, sr=SR, mono=True)

    # Quantize to int16 and back (matching PyTorch CLAP preprocessing)
    y = np.clip(y, -1.0, 1.0)
    y = (y * 32767.0).astype(np.int16)
    y = (y / 32767.0).astype(np.float32)

    total = len(y)
    segments = []

    if total <= SEGMENT_LENGTH:
        # Short audio: zero-pad to exactly 10 s
        segments.append(np.pad(y, (0, SEGMENT_LENGTH - total)))
    else:
        for start in range(0, total - SEGMENT_LENGTH + 1, HOP_LENGTH):
            segments.append(y[start:start + SEGMENT_LENGTH])
        # Capture the tail segment so the end of the track is never dropped
        last_start = len(segments) * HOP_LENGTH
        if last_start < total:
            segments.append(y[-SEGMENT_LENGTH:])

    mel_segments = []
    for segment in segments:
        mel = librosa.feature.melspectrogram(
            y=segment, sr=SR, n_fft=N_FFT, hop_length=HOP_LENGTH_MELS,
            win_length=N_FFT, window='hann', center=True, pad_mode='reflect',
            power=2.0, n_mels=N_MELS, fmin=0, fmax=14000,
        )
        log_mel = librosa.power_to_db(mel, ref=1.0, amin=1e-10, top_db=None)
        mel_segments.append(log_mel[np.newaxis, np.newaxis, :, :].astype(np.float32))  # (1, 1, n_mels, time)

    return mel_segments

def cosine_similarity(v1, v2):
    v1 = v1.reshape(-1)
    v2 = v2.reshape(-1)
    return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))

# -- Load Student ONNX Models --
audio_model = ort.InferenceSession(STUDENT_AUDIO_MODEL)
text_model = ort.InferenceSession(TEXT_MODEL)

# -- AUDIO INFERENCE: feed one segment at a time (model has fixed batch=1) --
mel_segments = extract_segmented_logmels(AUDIO_PATH)
audio_embs = []
for mel_spec in mel_segments:
    emb = audio_model.run(None, {"mel_spectrogram": mel_spec})[0]  # (1, embed_dim)
    audio_embs.append(emb[0])
avg_audio_emb = np.mean(audio_embs, axis=0)                        # (embed_dim,)
avg_audio_emb = avg_audio_emb / (np.linalg.norm(avg_audio_emb) + 1e-9)  # L2 normalize

# -- TEXT ENCODER INFERENCE --
# CLAP encoders require exactly length=77
tok = tokenizer(TEXT_QUERY, padding="max_length", truncation=True, max_length=77, return_tensors="np")
text_emb = text_model.run(None, {
    "input_ids": tok["input_ids"].astype(np.int64),
    "attention_mask": tok["attention_mask"].astype(np.int64),
})[0]  # (1, embed_dim)

# -- MATCH AUDIO & TEXT --
sim = cosine_similarity(avg_audio_emb, text_emb).item()
print(f"Text Query: {TEXT_QUERY}")
print(f"Audio segments processed: {len(mel_segments)}")
print(f"Similarity score: {sim:.4f}")

Acknowledgements

This project was inspired by the tinyCLAP distillation approach for audio. Their work demonstrated how to compress the audio tower of Microsoft CLAP, and provided the conceptual foundation for DCLAP.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
dataset_license		dataset_license
student_clap		student_clap
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AudioMuse-AI-DCLAP

Distillation

Quick Start

Acknowledgements

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 2

Languages

Folders and files

Latest commit

History

Repository files navigation

AudioMuse-AI-DCLAP

Distillation

Quick Start

Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 2

Languages

Packages