ICME26 ATTM GC

Overview

The Academic Text-to-Music Generation (ATTM) Grand Challenge is a research-oriented competition designed to advance text-to-music generation under fair, transparent, and reproducible conditions.

Unlike many recent text-to-music systems that rely on proprietary datasets and industrial-scale compute, ATTM focuses on training generative models from scratch using a standardized, CC-licensed dataset. The goal is to shift attention away from data scale and pre-trained black-box models, and toward algorithmic efficiency, model design, and musical intelligence.

This challenge is hosted as part of ICME 2026 and welcomes participation from students, academic labs, and researchers worldwide.

Grand Challenge Description

State-of-the-art Text-to-Music (TTM) models such as MusicLM and MusicGen have relied on proprietary datasets and massive industrial compute resources. Even recent open efforts require enormous computational power, creating a significant barrier for most academic researchers. As a result, academic labs are often limited to fine-tuning pre-trained models rather than exploring fundamental architectural innovations from scratch.

The ATTM Grand Challenge establishes a fair-play benchmark to address this issue. All participants must train the core generative model strictly from scratch using a standardized, CC-licensed dataset of 457 hours derived from MTG-Jamendo. The focus is on algorithmic efficiency and musical intelligence rather than data scale or compute volume.

Key Principles of the Challenge

Core generative models must be trained from scratch

No pre-trained weights are allowed for the main music generation model.

Auxiliary components may use public checkpoints, including:

Audio tokenizers / autoencoders
Audio Language Models (ALMs) for captioning
Vocoders or audio enhancement models

Proprietary or non-reproducible models are strictly prohibited.

Fully automatic generation only

No human-in-the-loop annotation, manual editing, or cherry-picking of samples is allowed.

Instrumental music only

All training data is processed to remove vocals, and generated outputs must be purely instrumental.

Standardized text prompts

Organizers will provide official caption sets (generated by Music Flamingo or Qwen2-audio) to ensure consistent evaluation across teams, though teams may create their own captions using public ALMs.

All components must be declared, and organizers reserve the right to verify training logs and configurations.

Tracks

To encourage broad participation, ATTM is divided into two tracks:

Efficiency Track

Maximum of 500M parameters for the core generative model
Designed to encourage innovation in efficient architectures
Suitable for student teams and resource-constrained labs

Performance Track

No parameter limit
Focuses on pushing the upper bound of performance under academic data constraints
Suitable for teams exploring large or complex architectures

The core generative model generally refers to the main architecture responsible for text-to-music generation, excluding auxiliary components such as audio encoders, text encoders, or vocoders. The organizers reserve the right to make final determinations on what constitutes a core generative model. If you believe your architecture may be difficult to classify, please contact the organizers proactively to discuss and resolve any ambiguities before submission.

Teams may choose to participate in one or both tracks. Both tracks are evaluated with the same criteria (as specified in the coming section), but ranked independently.

Awards

We are proud to partner with Moises to offer cash awards to the best performing teams in this challenge.

Efficiency Track

First Prize: $1,000 USD
Second Prize: $500 USD

Performance Track

First Prize: $1,000 USD
Second Prize: $500 USD

Award Distribution Policy

Full prizes will be awarded in each track when there are strong qualifying entries.
The organizers reserve the right to combine or adjust awards if participation is lower than expected in one track or if submissions do not meet minimum quality standards.

Evaluation Criteria

All teams will generate 100 audio samples based on a hidden set of test prompts provided by the organizers. The evaluation process is performed on these submitted samples.
Evaluation is conducted in two stages, balancing automated metrics with human judgment.

Phase 1: Objective Evaluation (Scorecard)

All submissions are first ranked using a composite score based on the following metrics:

Audio Quality — Fréchet Audio Distance (FAD)
Measures distributional similarity between generated audio and a hidden reference set.
Semantic Alignment — CLAP Score
Evaluates how well the generated audio matches the input text prompt.
Concept Coverage Score (CCS / K–M Metric)
- Each prompt contains M musical concepts (e.g., tempo, instrumentation, style).
- Audio Language Models act as blind judges to detect whether each concept is present.
- A score of K / M is assigned per prompt and averaged across the evaluation set.

Submission constraints:

Audio must be at least 10 seconds long
Evaluation is performed only on the first 10 seconds for all teams

Phase 2: Human Evaluation (MOS)

Based on Phase 1 rankings, the Top N teams per track advance to a formal Mean Opinion Score (MOS) study conducted by expert listeners.

N will be determined after the registration deadline based on the number of participants in each track.

Evaluation dimensions include:

Audio Quality
Musicality (rhythmic stability, harmonic progressions, phrasing)
Prompt Adherence

How to Participate

Registration

Teams must register before March 20, 2026 to indicate their intent to participate. Registration helps organizers prepare evaluation resources and does not require a completed system.

Registration instructions and links will be released at the official launch.

Submission

Teams must submit their final entries before April 23, 2026.
Final submissions must include:

Generated audio for 100 hidden test prompts
- Format: WAV or MP3
- Sample rate: 44.1 kHz
- Duration: exactly 10 seconds used for evaluation
Model code for parameter verification and reproducibility
A short Grand Challenge paper (up to 4 pages)
Grand Challenge papers are required only from finalist teams (announced April 30, 2026) and must be submitted by May 15, 2026. Papers from non-finalist teams, though not required, are still encouraged and welcome.

All submissions:

Must be generated fully automatically
Must use one forward generation per prompt
Will be anonymized during evaluation

Detailed submission instructions will be released at the official launch.
Top teams will be invited to present their work at the ICME 2026 Grand Challenge session.

Dataset & Code

Training Dataset

457 hours of CC-licensed instrumental music derived from MTG-Jamendo with official captions provided

Preprocessing Code

Provided for vocal separation to ensure instrumental-only data

Baseline Code

Provided to lower the entry barrier and help teams get started quickly

Dataset and code access links will be released at the official launch.

Important Dates

Feb 10, 2026 Official launch
Mar 20, 2026 Registration deadline
Mar 30, 2026 Dry-run submission deadline (pipeline verification)
Apr 20, 2026 Final test prompts released
Apr 23, 2026 Final audio submission deadline (72-hour window)
Apr 30, 2026 Finalists announcement
May 15, 2026 Grand Challenge paper submission deadline
May 22, 2026 Final MOS results & announcement of winners & paper acceptance notification
May 30, 2026 Camera-ready and author registration deadline

Contact Information

Dr. Yi-Hsuan (Eric) Yang

National Taiwan University, AI Center of Research Excellence

📧 yhyangtw@ntu.edu.tw 🔗 https://affige.github.io/

Dr. Hao-Wen (Herman) Dong

University of Michigan

📧 hwdong@umich.edu 🔗 https://hermandong.com/

Dr. Hung-Yi Lee

National Taiwan University, AI Center of Research Excellence

📧 hungyilee@ntu.edu.tw 🔗 https://speech.ee.ntu.edu.tw/~hylee/

Fang-Chih (Andrew) Hsieh

National Taiwan University, Music and AI Lab

📧 andrew891221@gmail.com

Wei-Jaw (Lonian) Lee

National Taiwan University, Music and AI Lab

📧 weijaw2000@gmail.com

ICME 2026 Grand Challenge on Academic Text-to-Music Generation

Overview

Grand Challenge Description

Key Principles of the Challenge

Tracks

Efficiency Track

Performance Track

Awards

Efficiency Track

Performance Track

Evaluation Criteria

Phase 1: Objective Evaluation (Scorecard)

Phase 2: Human Evaluation (MOS)

How to Participate

Registration

Submission

Dataset & Code

Training Dataset

Preprocessing Code

Baseline Code

Important Dates

Contact Information

Acknowledgment