MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response (2023-09-15T00:00:00.000000Z)

TL;DR

MusiLingo is a novel system for music caption generation and music-related query responses, bridging the gap between music audio and textual contexts and creating the MusicInstruct dataset from captions in the MusicCaps datasets, tailored for open-ended music inquiries.

Abstract

Large Language Models (LLMs) have shown immense potential in multimodal applications, yet the convergence of textual and musical domains remains not well-explored. To address this gap, we present MusiLingo, a novel system for music caption generation and music-related query responses. MusiLingo employs a single projection layer to align music representations from the pre-trained frozen music audio model MERT with a frozen LLM, bridging the gap between music audio and textual contexts. We train it on an extensive music caption dataset and fine-tune it with instructional data. Due to the scarcity of high-quality music Q&A datasets, we created the MusicInstruct (MI) dataset from captions in the MusicCaps datasets, tailored for open-ended music inquiries. Empirical evaluations demonstrate its competitive performance in generating music captions and composing music-related Q&A pairs. Our introduced dataset enables notable advancements beyond previous ones.

Authors

Wenhao Huang

5 papers

Wenhu Chen

13 papers

Emmanouil Benetos

8 papers

TL;DR

Abstract

Authors

References50 items

Joint Audio and Speech Understanding

Music Understanding LLaMA: Advancing Text-to-Music Generation with Question Answering and Captioning

LP-MusicCaps: LLM-Based Pseudo Music Captioning

Unified Model for Image, Video, Audio and Language Tasks

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training

Listen, Think, and Understand

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Visual Instruction Tuning

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Music Playlist Title Generation Using Artist Information

Toward Universal Text-To-Music Retrieval

Large-Scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation

MuLan: A Joint Embedding of Music Audio and Natural Language

Contrastive Audio-Language Learning for Music

Flamingo: a Visual Language Model for Few-Shot Learning

Semi-supervised Music Tagging Transformer

Music Playlist Title Generation: A Machine-Translation Approach

Audio Captioning Transformer

LoRA: Low-Rank Adaptation of Large Language Models

MusCaps: Generating Captions for Music Audio

Evaluation of CNN-based Automatic Music Tagging Models

Language Models are Few-Shot Learners

Output quality

BERTScore: Evaluating Text Generation with BERT

Towards Music Captioning: Generating Music Playlist Descriptions

METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments

ORANGE: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation

ROUGE: A Package for Automatic Evaluation of Summaries

Bleu: a Method for Automatic Evaluation of Machine Translation

Vicuna: An opensource chatbot impressing gpt-4 with 90%* chatgpt

Data-Efficient Playlist Captioning With Musical and Linguistic Knowledge

[Workshop].

Music autotagging as captioning

The MTG-Jamendo Dataset for Automatic Music Tagging

The Million Song Dataset

Evaluation of Algorithms Using Games: The Case of Music Tagging

OpenAI. 2023.

2023. Video-chatgpt: To-wards detailed video

Instruction Practicality: Evaluators judge the relevance of the instruction in music informatics scenarios

interesting to music

B.2 version 2 - long Q&A pairs

Instruction Feasibility: Evaluators assess whether the instruction was valid and an-swerable within the context and scope of the model’s capabilities

What kind of media could this music be used in?

to generate

coherent, encompassing necessary information without any vague terms or explanations

2023. Musiclm: Generating music from text

2023. Mulab: a benchmark suite and toolkit for evaluating music-and-language models

Field of Study

Journal Information

Name

Volume