ICME 2026 Grand Challenge on Academic Text-to-Music Generation

National Taiwan University University of Michigan ICME 2026 Moises

📢 Latest Updates

  • NEW! Final Evaluation Results Released! (Apr 29, 2026)
    The final evaluation results are now available, including FAD, CLAP, CCS, and Final Ranking for all submissions. See the Final Evaluation Results section for details.
  • Paper Submission Guideline Added! (Apr 24, 2026)
    A dedicated Grand Challenge paper submission guideline is now available, including report format, eligibility, and CMT3 submission information. See the How to Participate section for details.
  • Important Dates Timeline Updated! (Apr 24, 2026)
    The Important Dates timeline has been updated with revised deadlines for technical paper submission, paper acceptance notification, camera-ready submission, final MOS results announcement, and author registration. See the Important Dates section for full details.
  • Dry-run Results Table Released! (Apr 15, 2026)
    De-identified dry-run results are now available for reference, including Submission ID, FAD, and CLAP scores. See the Dry-run Results section for details.
  • More Detailed Dry-run Timeline Released! (Mar 25, 2026)
    We have provided a more detailed timeline for the dry-run phase:
    • Mar 30, 2026: Dry-run prompts released
    • Apr 10, 2026: Dry-run submission deadline (pipeline verification)
    See the Important Dates section for full details.
  • Important Challenge Rules Updates! (Mar 1, 2026)
    We've updated several important rules for the challenge:
    • Team Participation: Each member can now be in up to three teams (regardless of tracks), but can only be the first author for one team. See the Tracks section for details.
    • Final Submissions: Teams may submit up to two final results per team. Only the higher-scoring one will count as your team's final score. See the How to Participate section for details.
    • Dry-run Submissions: At the dry-run stage, we will release a dry-run set of prompts. Teams may optionally submit results to receive Phase 1 evaluation scores as mid-contest feedback. See the Important Dates section for details.
  • Reference Caption Sets & Captioning Pipeline Released! (Feb 25, 2026)
    Reference caption sets generated by Music Flamingo and Qwen2-audio are now available! We've also released the captioning pipeline to encourage data augmentation exploration. Visit the Dataset & Code section to access both resources.
  • Test Prompt Format Details Released! (Feb 12, 2026)
    We've added detailed information about the test prompt format, including generation process, specifications, and example prompts. Check out the Evaluation Criteria section to learn more and refine your training strategy!
  • Baseline Code Released! (Feb 6, 2026)
    The baseline code is now available! Check out the Dataset & Code section to access the MeanAudio baseline implementation and get started.
  • Registration Now Open! (Feb 5, 2026)
    Team registration is now available! Register by March 20, 2026 via the How to Participate section. Don't miss this opportunity to join the ATTM Grand Challenge!
  • Dataset Download & Preprocessing Code Released! (Feb 5, 2026)
    The training dataset download instructions and preprocessing code are now available in the Dataset & Code section. Please note that downloading and processing the dataset takes considerable time. We strongly recommend teams to start as soon as possible once you've decided to participate.

Final Evaluation Results

De-identified results of the final evaluation phase.

  • Both FAD and CLAP are computed with the CLAP-Laion-Music checkpoint music_audioset_epoch_15_esc_90.14.pt. The reference set used for FAD is a hidden instrumental subset from Jamendo.
  • CCS (Concept Coverage Score) measures how well generated audio covers the musical concepts specified in each prompt, evaluated using a Large Audio Language Model (LALM).
  • Final Ranking is determined by a Borda count aggregation across FAD, CLAP, and CCS scores.
  • Submission IDs starting with e are from the Efficiency Track; those starting with p are from the Performance Track.
Sort by:
Submission ID FAD↓ CLAP↑ CCS↑ Ranking
(within track)
Ranking
(across track)
Finalists
e00 0.556 0.310 0.796 5 6
e01 0.577 0.338 0.863 2 2
e02 0.498 0.270 0.763 6 8
e03 0.518 0.251 0.763 10 12
e04 0.574 0.195 0.833 7 9
e05 0.487 0.305 0.800 2 2
e06 0.667 0.268 0.808 7 9
e07 0.417 0.261 0.867 1 1
e08 0.495 0.295 0.804 4 2
e09 0.646 0.263 0.767 10 12
e10 0.482 0.163 0.738 9 11
e11 0.892 0.097 0.675 12 16
p00 0.557 0.311 0.796 2 6
p05 0.514 0.306 0.800 1 5
p09 0.646 0.260 0.767 4 15
p10 0.500 0.171 0.721 3 14

Dry-run Results

De-identified result of the dry-run phase.

  • Both FAD and CLAP are computed with the CLAP-Laion-Music checkpoint music_audioset_epoch_15_esc_90.14.pt.
  • The reference set used for FAD in the dry-run phase is the Song Describer Dataset. Final evaluation will use a hidden reference set.
  • Concept Coverage Score (CCS) is not presented during the dry-run phase.
Submission ID FAD↓ CLAP↑
0 0.4040 0.3231
1 0.5455 0.3244
2 0.5181 0.3100
3 0.5547 0.3176
4 0.5127 0.3092
5 0.6583 0.2742
6 0.7164 0.2119
7 0.6461 0.2642
8 0.6364 0.2598
9 0.6890 0.2200
10 0.5003 0.1950
11 0.4998 0.1866
12 0.6516 0.2807
13 0.6711 0.2827

Overview

The Academic Text-to-Music Generation (ATTM) Grand Challenge is a research-oriented competition designed to advance text-to-music generation under fair, transparent, and reproducible conditions.

Unlike many recent text-to-music systems that rely on proprietary datasets and industrial-scale compute, ATTM focuses on training generative models from scratch using a standardized, CC-licensed dataset. The goal is to shift attention away from data scale and pre-trained black-box models, and toward algorithmic efficiency, model design, and musical intelligence.

This challenge is hosted as part of ICME 2026 and welcomes participation from students, academic labs, and researchers worldwide.

Grand Challenge Description

State-of-the-art Text-to-Music (TTM) models such as MusicLM and MusicGen have relied on proprietary datasets and massive industrial compute resources. Even recent open efforts require enormous computational power, creating a significant barrier for most academic researchers. As a result, academic labs are often limited to fine-tuning pre-trained models rather than exploring fundamental architectural innovations from scratch.

The ATTM Grand Challenge establishes a fair-play benchmark to address this issue. All participants must train the core generative model strictly from scratch using a standardized, CC-licensed dataset of 457 hours derived from MTG-Jamendo. The focus is on algorithmic efficiency and musical intelligence rather than data scale or compute volume.

Key Principles of the Challenge

Core generative models must be trained from scratch

No pre-trained weights are allowed for the main music generation model.

Exclusive use of Jamendo Dataset

All training, fine-tuning, and data augmentation must exclusively use the provided Jamendo dataset.

  • No External Data: Using any music datasets other than the Jamendo dataset is prohibited.
  • No Synthetic Data from External Models: Using synthetic audio generated by external models (e.g., Suno, Udio) to augment the training set is considered data laundering and will lead to disqualification.

Auxiliary components may use public checkpoints, including:

  • Audio tokenizers / autoencoders
  • Audio Language Models (ALMs) for captioning
  • Vocoders or audio enhancement models

Proprietary or non-reproducible models are strictly prohibited.

Auxiliary models (like ALMs or Vocoders) are for processing and enhancement only, and must not be used to bypass the data source restrictions.

Fully automatic generation only

No human-in-the-loop annotation, manual editing, or cherry-picking of samples is allowed.

Instrumental music only

All training data is processed to remove vocals, and generated outputs must be purely instrumental.

Standardized text prompts

Organizers will provide reference caption sets (generated by Music Flamingo and Qwen2-audio) to ensure consistent evaluation across teams, though teams may create their own captions using public ALMs.

All components must be declared, and organizers reserve the right to verify training logs and configurations.

Tracks

To encourage broad participation, ATTM is divided into two tracks:

Efficiency Track

  • Maximum of 500M parameters for the core generative model
  • Designed to encourage innovation in efficient architectures
  • Suitable for student teams and resource-constrained labs

Performance Track

  • No parameter limit
  • Focuses on pushing the upper bound of performance under academic data constraints
  • Suitable for teams exploring large or complex architectures

The core generative model generally refers to the main architecture responsible for text-to-music generation, excluding auxiliary components such as audio encoders, text encoders, or vocoders. The organizers reserve the right to make final determinations on what constitutes a core generative model. If you believe your architecture may be difficult to classify, please contact the organizers proactively to discuss and resolve any ambiguities before submission.

Teams may participate in one or both tracks. Each member can be in up to three teams (regardless of tracks), but can only be the first author for one team. Both tracks are evaluated with the same criteria (as specified in the coming section), but ranked independently.

Awards

Moises

We are proud to partner with Moises to offer cash awards to the best performing teams in this challenge.

Efficiency Track

  • First Prize: $1,000 USD
  • Second Prize: $500 USD

Performance Track

  • First Prize: $1,000 USD
  • Second Prize: $500 USD

Award Distribution Policy

  • Full prizes will be awarded in each track when there are strong qualifying entries.
  • The organizers reserve the right to combine or adjust awards if participation is lower than expected in one track or if submissions do not meet minimum quality standards.

Evaluation Criteria

All teams will generate 100 audio samples based on a hidden set of test prompts provided by the organizers. The evaluation process is performed on these submitted samples.
Evaluation is conducted in two stages, balancing automated metrics with human judgment.

Phase 1: Objective Evaluation (Scorecard)

All submissions are first ranked using a composite score based on the following metrics:

  • Audio Quality — Fréchet Audio Distance (FAD)

    Measures distributional similarity between generated audio and a hidden reference set.
    Computed using the CLAP-Laion-Music checkpoint music_audioset_epoch_15_esc_90.14.pt as the embedder.

  • Semantic Alignment — CLAP Score

    Evaluates how well the generated audio matches the input text prompt.
    Computed using the CLAP-Laion-Music checkpoint music_audioset_epoch_15_esc_90.14.pt.

  • Concept Coverage Score (CCS / K–M Metric)
    • Each prompt contains M musical concepts (e.g., tempo, instrumentation, style, genre, mood/theme).
    • Audio Language Models act as blind judges to detect whether each concept is present.
    • A score of K / M is assigned per prompt and averaged across the evaluation set.

Submission constraints:

  • Audio must be at least 10 seconds long
  • Evaluation is performed only on the first 10 seconds for all teams

Phase 2: Human Evaluation (MOS)

Based on Phase 1 rankings, the Top N teams per track advance to a formal Mean Opinion Score (MOS) study conducted by expert listeners.

N will be determined after the registration deadline based on the number of participants in each track.

Evaluation dimensions include:

  1. Audio Quality
  2. Musicality (rhythmic stability, harmonic progressions, phrasing)
  3. Prompt Adherence

Test Prompt Format

Understanding the test prompt format can help teams refine their training approaches and caption augmentation strategies to achieve better alignment during evaluation.

Prompt generation process

Test prompts will be generated using the Qwen2-Audio-7B-Instruct model (Qwen/Qwen2-Audio-7B-Instruct on Hugging Face). The model will be prompted with several musical concept tags including genre, instrumentation, mood, and other musical attributes, then asked to generate a natural music caption that incorporates these tags.

Format Specifications

  • Test Set Size: 100 prompts
  • Prompt Length: Under 100 words
  • Prompt Style: Caption-like descriptions (e.g., "An upbeat EDM track with pulsing synths and driving bass") rather than instruction-like commands (e.g., "Generate an EDM song with synths and bass")

Example Test Prompts

Below are representative examples of the test prompt style:

  • Tags: rock, electric guitar, energetic
    Caption: "An energetic rock track driven by a bold electric guitar, pulsing with intensity and raw power."
  • Tags: jazz, piano, mellow
    Caption: "A mellow jazz piece featuring soothing piano melodies, calm and relaxed in tone."
  • Tags: electronic, synthesizer, atmospheric
    Caption: "A smooth, atmospheric electronic track driven by rich synthesizer layers, creating a serene and immersive soundscape."

How to Participate

Registration

Teams must register before March 20, 2026 (AoE, UTC-12) to indicate their intent to participate. Registration helps organizers prepare evaluation resources and does not require a completed system.

Submission

Teams must submit their final entries before April 23, 2026 (AoE, UTC-12).
Final submissions must include:

  • Generated audio for 100 hidden test prompts
    • Format: WAV or MP3
    • Sample rate: 44.1 kHz
    • Duration: exactly 10 seconds used for evaluation
    • Teams may submit up to two final results (two submissions per team). Only the higher-scoring one will be used as your team's final score.
  • Model code for parameter verification and reproducibility
  • A Grand Challenge technical report (up to 6 pages, ICME format)

    See the paper submission guideline below for eligibility and ICME submission requirements.

All submissions:

  • Must be generated fully automatically
  • Must use one forward generation per prompt
  • Will be anonymized during evaluation

Detailed submission instructions will be released at the official launch.
Top teams will be invited to present their work at the ICME 2026 Grand Challenge session.

Grand Challenge Paper Submission Guidelines

The ICME 2026 Grand Challenge paper submission system is active on CMT3: https://cmt3.research.microsoft.com/IEEEICMEW2026.

Formatting and Submission

  • Technical reports must follow the same format as regular ICME conference papers.
  • Maximum length is 6 pages, including text, figures, and references.
  • Author instructions and paper templates: ICME 2026 Author Information.

Proceedings and Indexing Eligibility

  • Reports are eligible for IEEE Xplore inclusion and potential EI Compendex indexing if uploaded in CMT3 before the submission deadline.
  • Eligibility also requires full registration and payment by the camera-ready deadline, per ICME 2026 Conference Policies.
  • Participants who do not submit a report may still receive a participation certificate from the grand challenge (no fee required), but will not be eligible for proceedings/indexing.

GC Paper Dates

For the latest submission, acceptance, and camera-ready deadlines, see Important Dates.

Dataset & Code

Training Dataset

457 hours of CC-licensed instrumental music derived from MTG-Jamendo.

Download Instructions

  1. Visit github.com/MTG/mtg-jamendo-dataset
  2. Follow the README instructions
  3. Download the raw_30s subset

Reference Caption Sets

Provided to help you get started. Teams may extend these using MTG-Jamendo tags, other annotations, and LLM-based augmentations to create richer training data.

Preprocessing Code

Provided for vocal separation to ensure instrumental-only data

Access Code

Find the preprocessing code and instructions at github.com/ntu-musicailab/ICME26-ATTM-GC-Preprocessing

Baseline Code

Provided to lower the entry barrier and help teams get started quickly

Access Code

Find the baseline code and instructions at github.com/ntu-musicailab/ICME26-ATTM-GC-MeanAudio

Important Dates

All dates below are in AoE (Anywhere on Earth, UTC-12).

  • Feb 10, 2026 Official launch
  • Mar 20, 2026 Registration deadline
  • Mar 30, 2026 Dry-run prompts released
  • Apr 10, 2026
    Dry-run submission deadline (pipeline verification)
  • Apr 20, 2026 Final test prompts released
  • Apr 23, 2026 Final audio submission deadline (72-hour window)
  • Apr 30, 2026 Finalists announcement
  • May 15, 2026 Grand Challenge paper submission deadline
  • May 22, 2026 Final MOS results & announcement of winners & paper acceptance notification
  • May 30, 2026 Camera-ready and author registration deadline
  • May 05, 2026 Technical paper submission deadline
  • May 10, 2026 Paper acceptance notification
  • May 15, 2026 Camera-ready and author registration deadline
  • May 22, 2026 Final MOS results, announcement of winners

Contact Information

Dr. Yi-Hsuan (Eric) Yang

National Taiwan University, AI Center of Research Excellence

Dr. Hao-Wen (Herman) Dong

University of Michigan

Dr. Hung-Yi Lee

National Taiwan University, AI Center of Research Excellence

Fang-Chih (Andrew) Hsieh

National Taiwan University, Music and AI Lab

Wei-Jaw (Lonian) Lee

National Taiwan University, Music and AI Lab

Acknowledgment

This challenge is sponsored by Moises.ai and supported by grants from Google Asia Pacific, the National Science and Technology Council of Taiwan (NSTC 114-2628-E-002-013-MY3), and the Ministry of Education (MOE) of Taiwan (for Taiwan Centers of Excellence).