Overview
The Academic Text-to-Music Generation (ATTM) Grand Challenge is a research-oriented competition designed to advance text-to-music generation under fair, transparent, and reproducible conditions.
Unlike many recent text-to-music systems that rely on proprietary datasets and industrial-scale compute, ATTM focuses on training generative models from scratch using a standardized, CC-licensed dataset. The goal is to shift attention away from data scale and pre-trained black-box models, and toward algorithmic efficiency, model design, and musical intelligence.
This challenge is hosted as part of ICME 2026 and welcomes participation from students, academic labs, and researchers worldwide.
Grand Challenge Description
State-of-the-art Text-to-Music (TTM) models such as MusicLM and MusicGen have relied on proprietary datasets and massive industrial compute resources. Even recent open efforts require enormous computational power, creating a significant barrier for most academic researchers. As a result, academic labs are often limited to fine-tuning pre-trained models rather than exploring fundamental architectural innovations from scratch.
The ATTM Grand Challenge establishes a fair-play benchmark to address this issue. All participants must train the core generative model strictly from scratch using a standardized, CC-licensed dataset of 457 hours derived from MTG-Jamendo. The focus is on algorithmic efficiency and musical intelligence rather than data scale or compute volume.
Key Principles of the Challenge
Core generative models must be trained from scratch
No pre-trained weights are allowed for the main music generation model.
Auxiliary components may use public checkpoints, including:
- Audio tokenizers / autoencoders
- Audio Language Models (ALMs) for captioning
- Vocoders or audio enhancement models
Proprietary or non-reproducible models are strictly prohibited.
Fully automatic generation only
No human-in-the-loop annotation, manual editing, or cherry-picking of samples is allowed.
Instrumental music only
All training data is processed to remove vocals, and generated outputs must be purely instrumental.
Standardized text prompts
Organizers will provide official caption sets (generated by Music Flamingo or Qwen2-audio) to ensure consistent evaluation across teams, though teams may create their own captions using public ALMs.
All components must be declared, and organizers reserve the right to verify training logs and configurations.
Tracks
To encourage broad participation, ATTM is divided into two tracks:
Efficiency Track
- Maximum of 500M parameters for the core generative model
- Designed to encourage innovation in efficient architectures
- Suitable for student teams and resource-constrained labs
Performance Track
- No parameter limit
- Focuses on pushing the upper bound of performance under academic data constraints
- Suitable for teams exploring large or complex architectures
The core generative model generally refers to the main architecture responsible for text-to-music generation, excluding auxiliary components such as audio encoders, text encoders, or vocoders. The organizers reserve the right to make final determinations on what constitutes a core generative model. If you believe your architecture may be difficult to classify, please contact the organizers proactively to discuss and resolve any ambiguities before submission.
Teams may choose to participate in one or both tracks. Both tracks are evaluated with the same criteria (as specified in the coming section), but ranked independently.
Awards
We are proud to partner with Moises to offer cash awards to the best performing teams in this challenge.
Efficiency Track
- First Prize: $1,000 USD
- Second Prize: $500 USD
Performance Track
- First Prize: $1,000 USD
- Second Prize: $500 USD
Award Distribution Policy
- Full prizes will be awarded in each track when there are strong qualifying entries.
- The organizers reserve the right to combine or adjust awards if participation is lower than expected in one track or if submissions do not meet minimum quality standards.
Evaluation Criteria
All teams will generate 100 audio samples based on a hidden set of test prompts provided by the organizers.
The evaluation process is performed on these submitted samples.
Evaluation is conducted in two stages, balancing automated metrics with human judgment.
Phase 1: Objective Evaluation (Scorecard)
All submissions are first ranked using a composite score based on the following metrics:
-
Audio Quality — Fréchet Audio Distance (FAD)
Measures distributional similarity between generated audio and a hidden reference set.
-
Semantic Alignment — CLAP Score
Evaluates how well the generated audio matches the input text prompt.
-
Concept Coverage Score (CCS / K–M Metric)
- Each prompt contains M musical concepts (e.g., tempo, instrumentation, style).
- Audio Language Models act as blind judges to detect whether each concept is present.
- A score of K / M is assigned per prompt and averaged across the evaluation set.
Submission constraints:
- Audio must be at least 10 seconds long
- Evaluation is performed only on the first 10 seconds for all teams
Phase 2: Human Evaluation (MOS)
Based on Phase 1 rankings, the Top N teams per track advance to a formal Mean Opinion Score (MOS) study conducted by expert listeners.
N will be determined after the registration deadline based on the number of participants in each track.
Evaluation dimensions include:
- Audio Quality
- Musicality (rhythmic stability, harmonic progressions, phrasing)
- Prompt Adherence
How to Participate
Registration
Teams must register before March 20, 2026 to indicate their intent to participate. Registration helps organizers prepare evaluation resources and does not require a completed system.
Registration instructions and links will be released at the official launch.
Submission
Teams must submit their final entries before April 23, 2026.
Final submissions must include:
-
Generated audio for 100 hidden test prompts
- Format: WAV or MP3
- Sample rate: 44.1 kHz
- Duration: exactly 10 seconds used for evaluation
- Model code for parameter verification and reproducibility
-
A short Grand Challenge paper (up to 4 pages)
Grand Challenge papers are required only from finalist teams (announced April 30, 2026) and must be submitted by May 15, 2026. Papers from non-finalist teams, though not required, are still encouraged and welcome.
All submissions:
- Must be generated fully automatically
- Must use one forward generation per prompt
- Will be anonymized during evaluation
Detailed submission instructions will be released at the official launch.
Top teams will be invited to present their work at the ICME 2026 Grand Challenge session.
Dataset & Code
Training Dataset
457 hours of CC-licensed instrumental music derived from MTG-Jamendo with official captions provided
Preprocessing Code
Provided for vocal separation to ensure instrumental-only data
Baseline Code
Provided to lower the entry barrier and help teams get started quickly
Dataset and code access links will be released at the official launch.
Important Dates
- Feb 10, 2026 Official launch
- Mar 20, 2026 Registration deadline
- Mar 30, 2026 Dry-run submission deadline (pipeline verification)
- Apr 20, 2026 Final test prompts released
- Apr 23, 2026 Final audio submission deadline (72-hour window)
- Apr 30, 2026 Finalists announcement
- May 15, 2026 Grand Challenge paper submission deadline
- May 22, 2026 Final MOS results & announcement of winners & paper acceptance notification
- May 30, 2026 Camera-ready and author registration deadline
Contact Information
Dr. Yi-Hsuan (Eric) Yang
National Taiwan University, AI Center of Research Excellence
Dr. Hung-Yi Lee
National Taiwan University, AI Center of Research Excellence