DITTO-2:
Distilled Diffusion Inference-Time T-Optimization
for Music Generation


Zachary Novack1   Julian McAuley1   Taylor Berg-Kirkpatrick1   Nicholas J. Bryan2  

1University of California, San Diego
2Adobe Research

Paper 🤗 HF paper

Abstract


Controllable music generation methods are critical for human-centered AI-based music creation, but are currently limited by speed, quality, and control design trade-offs. Diffusion inference-time T-optimization (DITTO), in particular, offers state-of-the-art results, but is over 10x slower than real-time, limiting practical use. We propose Distilled Diffusion Inference-Time T-Optimization (or DITTO-2), a new method to speed up inference-time optimization-based control and unlock faster-than-real-time generation for a wide-variety of applications such as music inpainting, outpainting, intensity, melody, and musical structure control. Our method works by (1) distilling a pre-trained diffusion model for fast sampling via an efficient, modified consistency or consistency trajectory distillation process (2) performing inference-time optimization using our distilled model with one-step sampling as an efficient surrogate optimization task and (3) running a final multi-step sampling generation (decoding) using our estimated noise latents for best-quality, fast, controllable generation. Through thorough evaluation, we find our method not only speeds up generation over 10-20x, but simultaneously improves control adherence and generation quality all at once. Furthermore, we apply our approach to a new application of maximizing text adherence (CLAP score) and show we can convert an unconditional diffusion model without text inputs into a model that yields state-of-the-art text control.


Bibtex

          
          @inproceedings{Novack2024Ditto2,
              title={{DITTO-2}: Distilled Diffusion Inference-Time T-Optimization for Music Generation}, 
              author={Novack, Zachary and McAuley, Julian and Berg-Kirkpatrick, Taylor and Bryan, Nicholas J.},
              year={2024},
              eprint={2405.20289},
              booktitle={International Society of Music Information Retrieval (ISMIR)}
          }
                    

Examples

Below, we display editing and control results for a wide range of music creation tasks, including outpainting, inpainting, intensity, melody, musical structure, and freeform text (CLAP) control. Note that this webpage is best viewed with Chrome. The examples here randomly generated, using 1 sampling step during the optimization process and 2 sampling steps in the final decoding process. For CLAP control, samples are generated with a fully unconditional (rather than tag-conditioned) model.

For the outpainting and inpainting examples, we show the generated audio with an overlap region of 1 second. For intensity and musical structure control, we show the generated audio, with the control described in the caption (e.g. "high to low intensity" for intensity control or "A (3 seconds), B (3 seconds)" for structure control). For melody control, we show the original melody and generated audio. For Text (CLAP) control, we show the input text caption.
Outpainting
Inpainting
Text (CLAP) Control
Caption: A male vocalist sings this energetic Punjabi folk song. The tempo is medium fast with an infectious tabla and Dhol percussive beat, ektara rhythm and a funky keyboard accompaniment. The song is lively, spirited, cheerful, simple, happy, playful, enthusiastic, vivacious with a festive, celebratory vibe and dance groove. This song is a Festive Punjabi Folk song.
Caption: This instrumental song features a distortion guitar playing a guitar solo. The guitar starts playing an ascending pattern followed by a sweep picking lick. This is followed by an alternate picking pattern. The ending of this solo consists of a sweep picking lick using string skipping technique. There are no other instruments in this song. There is no voice in this song.
Caption: Solo harmonica playing blues licks, followed by a male voiceover.
Caption: This is a hard rock music piece. There is a male vocalist singing in the lead. The electric guitar is playing the main melody while a bass guitar plays in the background. The acoustic drums are playing a groovy rock beat. There is a raw, psychedelic feel to this piece. It could be playing in the background of a rock bar.
Caption: This is a guitar cover video. There are no vocals in this piece. The acoustic guitar is playing a mellow tune with the heavy use of arpeggios. The atmosphere is gentle and relaxing. This piece could be used as the opening theme of a teenage drama TV series. It could also be playing in the background at a coffee shop.
Intensity Control
Control: Low to High Intensity, Linear Ramp.
Control: High to Low to High Intensity, Hard Jump.
Control: Low to High to Low Intensity, Hard Jump.
Control: Low to High Intensity, Hard Jump.
Control: High to Low Intensity, Linear Ramp.
Musical Structure Control
Control: A (4 seconds), B (2 seconds).
Control: A (4.5 seconds), B (1.5 seconds).
Control: A (1.5 seconds), B (4.5 seconds).
Control: A (2 seconds), B (4 seconds).
Control: A (1.5 seconds), B(3 seconds), A (1.5 seconds).
Melody Control
Reference
Generated