Abstract

Controllable music generation methods are critical for human-centered AI-based music creation, but are currently limited by speed, quality, and control design trade-offs. Diffusion inference-time T-optimization (DITTO), in particular, offers state-of-the-art results, but is over 10x slower than real-time, limiting practical use. We propose Distilled Diffusion Inference-Time T-Optimization (or DITTO-2), a new method to speed up inference-time optimization-based control and unlock faster-than-real-time generation for a wide-variety of applications such as music inpainting, outpainting, intensity, melody, and musical structure control. Our method works by (1) distilling a pre-trained diffusion model for fast sampling via an efficient, modified consistency or consistency trajectory distillation process (2) performing inference-time optimization using our distilled model with one-step sampling as an efficient surrogate optimization task and (3) running a final multi-step sampling generation (decoding) using our estimated noise latents for best-quality, fast, controllable generation. Through thorough evaluation, we find our method not only speeds up generation over 10-20x, but simultaneously improves control adherence and generation quality all at once. Furthermore, we apply our approach to a new application of maximizing text adherence (CLAP score) and show we can convert an unconditional diffusion model without text inputs into a model that yields state-of-the-art text control.

Bibtex

          
          @inproceedings{Novack2024Ditto2,
              title={{DITTO-2}: Distilled Diffusion Inference-Time T-Optimization for Music Generation}, 
              author={Novack, Zachary and McAuley, Julian and Berg-Kirkpatrick, Taylor and Bryan, Nicholas J.},
              year={2024},
              eprint={2405.20289},
              booktitle={International Society of Music Information Retrieval (ISMIR)}
          }

Examples

Below, we display editing and control results for a wide range of music creation tasks, including outpainting, inpainting, intensity, melody, musical structure, and freeform text (CLAP) control. Note that this webpage is best viewed with Chrome. The examples here randomly generated, using 1 sampling step during the optimization process and 2 sampling steps in the final decoding process. For CLAP control, samples are generated with a fully unconditional (rather than tag-conditioned) model.

For the outpainting and inpainting examples, we show the generated audio with an overlap region of 1 second. For intensity and musical structure control, we show the generated audio, with the control described in the caption (e.g. "high to low intensity" for intensity control or "A (3 seconds), B (3 seconds)" for structure control). For melody control, we show the original melody and generated audio. For Text (CLAP) control, we show the input text caption.