Zachary Novack1 Julian McAuley1 Taylor Berg-Kirkpatrick1 Nicholas J. Bryan2
1University of California, San Diego
2Adobe Research
Controllable music generation methods are critical for human-centered AI-based music creation, but are currently limited by speed, quality, and control design trade-offs. Diffusion inference-time T-optimization (DITTO), in particular, offers state-of-the-art results, but is over 10x slower than real-time, limiting practical use. We propose Distilled Diffusion Inference-Time T-Optimization (or DITTO-2), a new method to speed up inference-time optimization-based control and unlock faster-than-real-time generation for a wide-variety of applications such as music inpainting, outpainting, intensity, melody, and musical structure control. Our method works by (1) distilling a pre-trained diffusion model for fast sampling via an efficient, modified consistency or consistency trajectory distillation process (2) performing inference-time optimization using our distilled model with one-step sampling as an efficient surrogate optimization task and (3) running a final multi-step sampling generation (decoding) using our estimated noise latents for best-quality, fast, controllable generation. Through thorough evaluation, we find our method not only speeds up generation over 10-20x, but simultaneously improves control adherence and generation quality all at once. Furthermore, we apply our approach to a new application of maximizing text adherence (CLAP score) and show we can convert an unconditional diffusion model without text inputs into a model that yields state-of-the-art text control.
@inproceedings{Novack2024Ditto2,
title={{DITTO-2}: Distilled Diffusion Inference-Time T-Optimization for Music Generation},
author={Novack, Zachary and McAuley, Julian and Berg-Kirkpatrick, Taylor and Bryan, Nicholas J.},
year={2024},
eprint={2405.20289},
booktitle={International Society of Music Information Retrieval (ISMIR)}
}
Outpainting |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Inpainting |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Text (CLAP) Control |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Caption: A male vocalist sings this energetic Punjabi folk song. The tempo is medium fast with an infectious tabla and Dhol percussive beat, ektara rhythm and a funky keyboard accompaniment. The song is lively, spirited, cheerful, simple, happy, playful, enthusiastic, vivacious with a festive, celebratory vibe and dance groove. This song is a Festive Punjabi Folk song. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Caption: This instrumental song features a distortion guitar playing a guitar solo. The guitar starts playing an ascending pattern followed by a sweep picking lick. This is followed by an alternate picking pattern. The ending of this solo consists of a sweep picking lick using string skipping technique. There are no other instruments in this song. There is no voice in this song. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Caption: Solo harmonica playing blues licks, followed by a male voiceover. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Caption: This is a hard rock music piece. There is a male vocalist singing in the lead. The electric guitar is playing the main melody while a bass guitar plays in the background. The acoustic drums are playing a groovy rock beat. There is a raw, psychedelic feel to this piece. It could be playing in the background of a rock bar. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Caption: This is a guitar cover video. There are no vocals in this piece. The acoustic guitar is playing a mellow tune with the heavy use of arpeggios. The atmosphere is gentle and relaxing. This piece could be used as the opening theme of a teenage drama TV series. It could also be playing in the background at a coffee shop. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Intensity Control |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Control: Low to High Intensity, Linear Ramp. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Control: High to Low to High Intensity, Hard Jump. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Control: Low to High to Low Intensity, Hard Jump. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Control: Low to High Intensity, Hard Jump. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Control: High to Low Intensity, Linear Ramp. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Musical Structure Control |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Control: A (4 seconds), B (2 seconds). |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Control: A (4.5 seconds), B (1.5 seconds). |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Control: A (1.5 seconds), B (4.5 seconds). |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Control: A (2 seconds), B (4 seconds). |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Control: A (1.5 seconds), B(3 seconds), A (1.5 seconds). |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Melody Control |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Reference |
Generated |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||