Audio-Text-to-Text
PEFT
Safetensors
English
mistral-lmm
SonicVerse / README.md
dorienh's picture
Update README.md
b1a9a2b verified
metadata
base_model:
  - mistralai/Mistral-7B-v0.1
  - m-a-p/MERT-v1-95M
library_name: peft
license: apache-2.0
datasets:
  - amaai-lab/MusicBench
language:
  - en
metrics:
  - bertscore
  - bleu
pipeline_tag: audio-text-to-text

SonicVerse: Multi-Task Learning for Music Feature-Informed Captioning

SonicVerse is a multi-task music captioning model that integrates caption generation with auxiliary music feature detection tasks such as key detection, vocals detection, and more. The model directly captures both low-level acoustic details as well as high-level musical attributes through a novel projection-based architecture that transforms audio input into natural language captions while simultaneously detecting music features through dedicated auxiliary heads. Additionally, SonicVerse enables the generation of temporally informed long captions for extended music pieces by chaining outputs from short segments using large language models, providing detailed time-informed descriptions that capture the evolving musical narrative.

View demo on our HuggingFace Space

Read the paper: SonicVerse: Multi-Task Learning for Music Feature-Informed Captioning

GitHub: https://github.com/AMAAI-Lab/SonicVerse

How to Get Started

Use the instructions provided on the GitHub repository to run inference locally. Alternatively try out the model on the Spaces demo.

Citation

If you use SonicVerse, please cite our paper:

@article{chopra2025sonicverse,
  title={SonicVerse: Multi-Task Learning for Music Feature-Informed Captioning},
  author={Chopra, Anuradha and Roy, Abhinaba and Herremans, Dorien},
  journal={Proceedings of the 6th Conference on AI Music Creativity (AIMC 2025)},
  year={2025},
  address={Brussels, Belgium},
  month={September},
  url={https://arxiv.org/abs/2506.15154},
}

DOI: 10.48550/arXiv.2506.15154