DITTO:
Diffusion Inference-Time T-Optimization
for Music Generation


Zachary Novack1,2*   Julian McAuley1   Taylor Berg-Kirkpatrick1   Nicholas J. Bryan2  

1University of California, San Diego
2Adobe Research
*Work done during an internship at Adobe Research

Paper Video 🤗 HF paper

Abstract


We propose Diffusion Inference-Time T-Optimization (DITTO), a general-purpose frame-work for controlling pre-trained diffusion models at inference-time via optimizing initial noise latents. Our method can be used to optimize through any differentiable feature matching loss to achieve a target (stylized) output and leverages gradient checkpointing for efficient memory use. We demonstrate a surprisingly wide-range of applications for music generation including inpainting, outpainting, and looping as well as intensity, melody, and musical structure control – all without ever fine-tuning the underlying model. When we compare our approach against related training, guidance, and optimization-based methods, we find DITTO achieves state-of-the-art performance on nearly all tasks and is over 2x faster than related optimization-based methods while using less memory, opening the door for high-quality, flexible, training-free control of diffusion models.


Bibtex

          
          @inproceedings{Novack2024Ditto,
              title={{DITTO}: Diffusion Inference-Time T-Optimization for Music Generation}, 
              author={Novack, Zachary and McAuley, Julian and Berg-Kirkpatrick, Taylor and Bryan, Nicholas J.},
              year={2024},
              booktitle={International Conference on Machine Learning (ICML)}
          }
                    

Examples (Cherrypicked)

Below, we display editing and control results for a wide range of music creation tasks, including outpainting, inpainting, looping, intensity, melody, and musical structure control. Note that this webpage is best viewed with Chrome. The examples here are mildly cherry-picked to show the best results. For random (non-cherry-picked) examples, please see the section below.

For the outpainting, inpainting, and looping examples, we show the reference audio and the generated audio with an overlap region of 1 second. For intensity and musical structure control, we show the generated audio, with the target and generated control feature displayed on the right. For melody control, we show the original melody, generated audio, and the target and generated melody features.
Outpainting
Reference
Generated

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram
Inpainting
Reference
Generated

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram
Looping
Reference
Generated

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram
Intensity Control
Generated
Feature Plots

Spectrogram
Features

Spectrogram
Features

Spectrogram
Features

Spectrogram
Features

Spectrogram
Features

Spectrogram
Features

Spectrogram
Features

Spectrogram
Features

Spectrogram
Features
Musical Structure Control
Generated
Feature Plots

Spectrogram
Features

Spectrogram
Features

Spectrogram
Features

Spectrogram
Features

Spectrogram
Features

Spectrogram
Features

Spectrogram
Features

Spectrogram
Features

Spectrogram
Features
Melody Control
Reference
Generated
Feature Plots

Spectrogram
Features

Spectrogram
Features

Spectrogram
Features

Spectrogram
Features

Spectrogram
Features

Spectrogram
Features

Spectrogram
Features

Spectrogram
Features

Spectrogram
Features

Spectrogram
Features

Examples (Random)

Below, we display editing and control results for a wide range of music creation tasks, including outpainting, inpainting, looping, intensity, melody, and musical structure control. Note that this webpage is best viewed with Chrome. The examples here randomly generated.

For the outpainting, inpainting, and looping examples, we show the reference audio and the generated audio with an overlap region of 1 second. For intensity and musical structure control, we show the generated audio, with the target and generated control feature displayed on the right. For melody control, we show the original melody, generated audio, and the target and generated melody features.
Outpainting
Reference
Generated

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram
Inpainting
Reference
Generated

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram
Looping
Reference
Generated

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram

Spectrogram
Intensity Control
Generated
Feature Plots

Spectrogram
Features

Spectrogram
Features

Spectrogram
Features

Spectrogram
Features

Spectrogram
Features

Spectrogram
Features

Spectrogram
Features

Spectrogram
Features

Spectrogram
Features

Spectrogram
Features
Musical Structure Control
Generated
Feature Plots

Spectrogram
Features

Spectrogram
Features

Spectrogram
Features

Spectrogram
Features

Spectrogram
Features

Spectrogram
Features

Spectrogram
Features

Spectrogram
Features

Spectrogram
Features
Melody Control
Reference
Generated
Feature Plots

Spectrogram
Features

Spectrogram
Features

Spectrogram
Features

Spectrogram
Features

Spectrogram
Features

Spectrogram
Features

Spectrogram
Features

Spectrogram
Features

Spectrogram
Features

Spectrogram
Features

Acknowledgements


Thank you to Ge Zhu, Juan-Pablo Caceres, Zhiyao Duan, and Nicholas J. Bryan for sharing their MusicHifi cascaded vocoder:

          
            @article{zhu2024musichifi,
              title={MusicHiFi: Fast High-Fidelity Stereo Vocoding}, 
              author={Zhu, Ge and Caceres, Juan-Pablo and Duan, Zhiyao and Bryan, Nicholas J.},
              year={2024},
              archivePrefix={arXiv},
              primaryClass={cs.SD},
          }