π’ Latest Updates
-
NEW!
Important Challenge Rules Updates!
(Mar 1, 2026)
We've updated several important rules for the challenge:
- Team Participation: Each member can now be in up to three teams (regardless of tracks), but can only be the first author for one team. See the Tracks section for details.
- Final Submissions: Teams may submit up to two final results per team. Only the higher-scoring one will count as your team's final score. See the How to Participate section for details.
- Dry-run Submissions: At the dry-run stage, we will release a dry-run set of prompts. Teams may optionally submit results to receive Phase 1 evaluation scores as mid-contest feedback. See the Important Dates section for details.
-
Reference Caption Sets & Captioning Pipeline Released!
(Feb 25, 2026)
Reference caption sets generated by Music Flamingo and Qwen2-audio are now available! We've also released the captioning pipeline to encourage data augmentation exploration. Visit the Dataset & Code section to access both resources.
-
Test Prompt Format Details Released!
(Feb 12, 2026)
We've added detailed information about the test prompt format, including generation process, specifications, and example prompts. Check out the Evaluation Criteria section to learn more and refine your training strategy!
-
Baseline Code Released!
(Feb 6, 2026)
The baseline code is now available! Check out the Dataset & Code section to access the MeanAudio baseline implementation and get started.
-
Registration Now Open!
(Feb 5, 2026)
Team registration is now available! Register by March 20, 2026 via the How to Participate section. Don't miss this opportunity to join the ATTM Grand Challenge!
-
Dataset Download & Preprocessing Code Released!
(Feb 5, 2026)
The training dataset download instructions and preprocessing code are now available in the Dataset & Code section. Please note that downloading and processing the dataset takes considerable time. We strongly recommend teams to start as soon as possible once you've decided to participate.
Overview
The Academic Text-to-Music Generation (ATTM) Grand Challenge is a research-oriented competition designed to advance text-to-music generation under fair, transparent, and reproducible conditions.
Unlike many recent text-to-music systems that rely on proprietary datasets and industrial-scale compute, ATTM focuses on training generative models from scratch using a standardized, CC-licensed dataset. The goal is to shift attention away from data scale and pre-trained black-box models, and toward algorithmic efficiency, model design, and musical intelligence.
This challenge is hosted as part of ICME 2026 and welcomes participation from students, academic labs, and researchers worldwide.
Grand Challenge Description
State-of-the-art Text-to-Music (TTM) models such as MusicLM and MusicGen have relied on proprietary datasets and massive industrial compute resources. Even recent open efforts require enormous computational power, creating a significant barrier for most academic researchers. As a result, academic labs are often limited to fine-tuning pre-trained models rather than exploring fundamental architectural innovations from scratch.
The ATTM Grand Challenge establishes a fair-play benchmark to address this issue. All participants must train the core generative model strictly from scratch using a standardized, CC-licensed dataset of 457 hours derived from MTG-Jamendo. The focus is on algorithmic efficiency and musical intelligence rather than data scale or compute volume.
Key Principles of the Challenge
Core generative models must be trained from scratch
No pre-trained weights are allowed for the main music generation model.
Exclusive use of Jamendo Dataset
All training, fine-tuning, and data augmentation must exclusively use the provided Jamendo dataset.
- No External Data: Using any music datasets other than the Jamendo dataset is prohibited.
- No Synthetic Data from External Models: Using synthetic audio generated by external models (e.g., Suno, Udio) to augment the training set is considered data laundering and will lead to disqualification.
Auxiliary components may use public checkpoints, including:
- Audio tokenizers / autoencoders
- Audio Language Models (ALMs) for captioning
- Vocoders or audio enhancement models
Proprietary or non-reproducible models are strictly prohibited.
Auxiliary models (like ALMs or Vocoders) are for processing and enhancement only, and must not be used to bypass the data source restrictions.
Fully automatic generation only
No human-in-the-loop annotation, manual editing, or cherry-picking of samples is allowed.
Instrumental music only
All training data is processed to remove vocals, and generated outputs must be purely instrumental.
Standardized text prompts
Organizers will provide reference caption sets (generated by Music Flamingo and Qwen2-audio) to ensure consistent evaluation across teams, though teams may create their own captions using public ALMs.
All components must be declared, and organizers reserve the right to verify training logs and configurations.
Tracks
To encourage broad participation, ATTM is divided into two tracks:
Efficiency Track
- Maximum of 500M parameters for the core generative model
- Designed to encourage innovation in efficient architectures
- Suitable for student teams and resource-constrained labs
Performance Track
- No parameter limit
- Focuses on pushing the upper bound of performance under academic data constraints
- Suitable for teams exploring large or complex architectures
The core generative model generally refers to the main architecture responsible for text-to-music generation, excluding auxiliary components such as audio encoders, text encoders, or vocoders. The organizers reserve the right to make final determinations on what constitutes a core generative model. If you believe your architecture may be difficult to classify, please contact the organizers proactively to discuss and resolve any ambiguities before submission.
Teams may participate in one or both tracks. Each member can be in up to three teams (regardless of tracks), but can only be the first author for one team. Both tracks are evaluated with the same criteria (as specified in the coming section), but ranked independently.
Awards
We are proud to partner with Moises to offer cash awards to the best performing teams in this challenge.
Efficiency Track
- First Prize: $1,000 USD
- Second Prize: $500 USD
Performance Track
- First Prize: $1,000 USD
- Second Prize: $500 USD
Award Distribution Policy
- Full prizes will be awarded in each track when there are strong qualifying entries.
- The organizers reserve the right to combine or adjust awards if participation is lower than expected in one track or if submissions do not meet minimum quality standards.
Evaluation Criteria
All teams will generate 100 audio samples based on a hidden set of test prompts provided by the organizers.
The evaluation process is performed on these submitted samples.
Evaluation is conducted in two stages, balancing automated metrics with human judgment.
Phase 1: Objective Evaluation (Scorecard)
All submissions are first ranked using a composite score based on the following metrics:
-
Audio Quality β FrΓ©chet Audio Distance (FAD)
Measures distributional similarity between generated audio and a hidden reference set.
-
Semantic Alignment β CLAP Score
Evaluates how well the generated audio matches the input text prompt.
-
Concept Coverage Score (CCS / KβM Metric)
- Each prompt contains M musical concepts (e.g., tempo, instrumentation, style).
- Audio Language Models act as blind judges to detect whether each concept is present.
- A score of K / M is assigned per prompt and averaged across the evaluation set.
Submission constraints:
- Audio must be at least 10 seconds long
- Evaluation is performed only on the first 10 seconds for all teams
Phase 2: Human Evaluation (MOS)
Based on Phase 1 rankings, the Top N teams per track advance to a formal Mean Opinion Score (MOS) study conducted by expert listeners.
N will be determined after the registration deadline based on the number of participants in each track.
Evaluation dimensions include:
- Audio Quality
- Musicality (rhythmic stability, harmonic progressions, phrasing)
- Prompt Adherence
Test Prompt Format
Understanding the test prompt format can help teams refine their training approaches and caption augmentation strategies to achieve better alignment during evaluation.
Prompt generation process
Test prompts will be generated using the Qwen2-Audio-7B-Instruct model
(Qwen/Qwen2-Audio-7B-Instruct
on Hugging Face).
The model will be prompted with several musical concept tags including genre, instrumentation, mood, and
other musical attributes, then asked to generate a natural music caption that incorporates these tags.
Format Specifications
- Test Set Size: 100 prompts
- Prompt Length: Under 100 words
- Prompt Style: Caption-like descriptions (e.g., "An upbeat EDM track with pulsing synths and driving bass") rather than instruction-like commands (e.g., "Generate an EDM song with synths and bass")
Example Test Prompts
Below are representative examples of the test prompt style:
-
Tags: rock, electric guitar, energeticCaption: "An energetic rock track driven by a bold electric guitar, pulsing with intensity and raw power."
-
Tags: jazz, piano, mellowCaption: "A mellow jazz piece featuring soothing piano melodies, calm and relaxed in tone."
-
Tags: electronic, synthesizer, atmosphericCaption: "A smooth, atmospheric electronic track driven by rich synthesizer layers, creating a serene and immersive soundscape."
How to Participate
Registration
Teams must register before March 20, 2026 to indicate their intent to participate. Registration helps organizers prepare evaluation resources and does not require a completed system.
Submission
Teams must submit their final entries before April 23, 2026.
Final submissions must include:
-
Generated audio for 100 hidden test prompts
- Format: WAV or MP3
- Sample rate: 44.1 kHz
- Duration: exactly 10 seconds used for evaluation
- Teams may submit up to two final results (two submissions per team). Only the higher-scoring one will be used as your team's final score.
- Model code for parameter verification and reproducibility
-
A short Grand Challenge paper (up to 4 pages)
Grand Challenge papers are required only from finalist teams (announced April 30, 2026) and must be submitted by May 15, 2026. Papers from non-finalist teams, though not required, are still encouraged and welcome.
All submissions:
- Must be generated fully automatically
- Must use one forward generation per prompt
- Will be anonymized during evaluation
Detailed submission instructions will be released at the official launch.
Top teams will be invited to present their work at the ICME 2026 Grand Challenge session.
Dataset & Code
Training Dataset
457 hours of CC-licensed instrumental music derived from MTG-Jamendo.
Download Instructions
- Visit github.com/MTG/mtg-jamendo-dataset
- Follow the README instructions
- Download the
raw_30ssubset
Reference Caption Sets
Provided to help you get started. Teams may extend these using MTG-Jamendo tags, other annotations, and
LLM-based augmentations to create richer training data.
Access Resources
- Reference captions: github.com/ntu-musicailab/ICME26-ATTM-GC-MeanAudio/tree/main/data/captions
- Captioning pipeline: github.com/ntu-musicailab/ICME26-ATTM-GC-ALM-captioning-pipeline
Preprocessing Code
Provided for vocal separation to ensure instrumental-only data
Access Code
Find the preprocessing code and instructions at github.com/ntu-musicailab/ICME26-ATTM-GC-Preprocessing
Baseline Code
Provided to lower the entry barrier and help teams get started quickly
Access Code
Find the baseline code and instructions at github.com/ntu-musicailab/ICME26-ATTM-GC-MeanAudio
Important Dates
- Feb 10, 2026 Official launch
- Mar 20, 2026 Registration deadline
-
Mar 30, 2026
Dry-run submission deadline (pipeline verification)
At this stage, we will release a dry-run set of prompts. Teams may optionally run inference and submit results to receive Phase 1 evaluation scores using the same criteria as the final submission. This provides mid-contest feedback to reflect on your strategy and ensures you can run inference under limited time and submit results in the correct format.
- Apr 20, 2026 Final test prompts released
- Apr 23, 2026 Final audio submission deadline (72-hour window)
- Apr 30, 2026 Finalists announcement
- May 15, 2026 Grand Challenge paper submission deadline
- May 22, 2026 Final MOS results & announcement of winners & paper acceptance notification
- May 30, 2026 Camera-ready and author registration deadline
Contact Information
Dr. Yi-Hsuan (Eric) Yang
National Taiwan University, AI Center of Research Excellence
Dr. Hung-Yi Lee
National Taiwan University, AI Center of Research Excellence