Taming Rectified Flow for Inversion and Editing

Jiangshan Wang1,2 Junfu Pu2 Zhongang Qi2 Jiayi Guo1 Yue Ma3
Nisha Huang1 Yuxin Chen2 Xiu Li1 Ying Shan2

1 Tsinghua University 2 Tencent ARC lab 3 HKUST

[Paper]      [arXiv]      [Code]     

We propose RF-Solver to solve the rectified flow ODE with less error, thus enhancing both sampling quality and inversion reconstruction accuracy for rectified-flow-based generative models. Furthermore, we propose RF-Edit to leverage the RF-Solver for image and video editing tasks. Our methods achieve impressive performance on various tasks, including text-to-image generation, image & video inversion, and image & video editing.

Abstract

Rectified-flow-based diffusion transformers, such as FLUX and OpenSora, have demonstrated exceptional performance in the field of image and video generation. Despite their robust generative capabilities, these models often suffer from inaccurate inversion, which could further limit their effectiveness in downstream tasks such as image and video editing. To address this issue, we propose RF-Solver, a novel training-free sampler that enhances inversion precision by reducing errors in the process of solving rectified flow ODEs. Specifically, we derive the exact formulation of the rectified flow ODE and perform a high-order Taylor expansion to estimate its nonlinear components, significantly decreasing the approximation error at each timestep. Building upon RF-Solver, we further design RF-Edit, which comprises specialized sub-modules for image and video editing. By sharing self-attention layer features during the editing process, RF-Edit effectively preserves the structural information of the source image or video while achieving high-quality editing results. Our approach is compatible with any pre-trained rectified-flow-based models for image and video tasks, requiring no additional training or optimization. Extensive experiments on text-to-image generation, image & video inversion, and image & video editing demonstrate the robust performance and adaptability of our methods.

Contributions

RF-Solver

The vanilla rectified flow (RF) sampler demonstrates strong performance in image and video generation. However, when applied to inversion and reconstruction tasks, we observe significant error accumulation at each timestep. This results in reconstructions that diverge notably from the original image, further limiting the performance of RF-based models in various downstream tasks, such as image and video editing.


Delving into this problem, we notice that the inversion and reconstruction processes in rectified flow rely on estimating an approximate solution of the rectified flow ODE at each timestep. Obtaining more precise solutions for the ODE would effectively mitigate these errors, leading to improved reconstruction quality. Based on this analysis, we propose RF-Solver Algorithm.


RF-Edit

The proposed RF-Edit framework enables high-quality editing while preserving structural information. Building on this concept, we design two sub-modules for RF-Edit, specifically tailored for image editing and video editing. For image editing, we use FLUX as the backbone. For video editing, we employ OpenSora as the backbone.


Text-to-Image Generation Results

We compare the performance of our method with the vanilla rectified flow on the text-to-image generation task. Both the quantitative and qualitative results demonstrate the superior performance of RF-Solver in fundamental T2I generation tasks, producing higher-quality images that align more closely with human cognition.

Inversion and Reconstruction Results

RF-Solver effectively reduces the error in the solution of RF ODE, thereby increasing the accuracy of the reconstruction. The image reconstruction results using vanilla rectified flow exhibit noticeable drift from the source image, with significant alterations to the appearance of subjects in the image. For video reconstruction, the baseline reconstruction results suffer from distortion. In contrast, RF-Solver significantly alleviates these issues, achieving more satisfactory results.

Image Editing Results

We compare the performance of our methods with several baselines across different types of editing. The baseline methods often suffer from background changes or fail to perform the desired edits. In contrast, our methods demonstrate satisfying performance, effectively achieves a balanced trade-off between the fidelity to the target prompt and preservation of the source image. To be noticed, although RF-inversion also uses the rectified flow model for image editing (third row), the structure of the source image which is unrelated to editing prompt (such as background and human appearance) is modified obviously.

Video Editing Results

For video editing, we primarily evaluate the performance of our methods on long videos (200 frames) and high-resolution videos. Furthermore, we assess the performance on complicated videos and prompts where there are multiple objects in the video, and the user has different editing requirements for each object. Our method successfully handles complicated editing cases (e.g., modifying the leftmost lion among three lions into a white polar bear and changing the other two small lions into orange tiger cubs), whereas all other baseline methods fail in this scenario. Our method also demonstrates strong performance in global editing tasks, such as transforming scenes into autumn.

BibTex

@misc{wang2024tamingrectifiedflowinversion,
  title={Taming Rectified Flow for Inversion and Editing},
  author={Jiangshan Wang and Junfu Pu and Zhongang Qi and Jiayi Guo and Yue Ma and Nisha Huang and Yuxin Chen and Xiu Li and Ying Shan},
  year={2024},
  eprint={2411.04746},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2411.04746},
}