Taming Rectified Flow for Inversion and Editing
Jiangshan Wang1,2
Junfu Pu2
Zhongang Qi2
Jiayi Guo1
Yue Ma3
Nisha Huang1
Yuxin Chen2
Xiu Li1
Ying Shan2
1 Tsinghua University 2 Tencent ARC lab 3 HKUST
[Paper]
[arXiv]
[Code]
We propose RF-Solver to solve the rectified flow ODE with less error, thus enhancing both sampling quality and inversion
reconstruction accuracy for rectified-flow-based generative models. Furthermore, we propose RF-Edit to leverage the RF-Solver for image
and video editing tasks. Our methods achieve impressive performance on various tasks, including text-to-image generation, image & video
inversion, and image & video editing.
Abstract
Rectified-flow-based diffusion transformers, such as FLUX and OpenSora, have demonstrated exceptional performance in the field of image and video generation. Despite their robust generative capabilities, these models often suffer from inaccurate inversion, which could further limit their effectiveness in downstream tasks such as image and video editing. To address this issue, we propose RF-Solver, a novel training-free sampler that enhances inversion precision by reducing errors in the process of solving rectified flow ODEs. Specifically, we derive the exact formulation of the rectified flow ODE and perform a high-order Taylor expansion to estimate its nonlinear components, significantly decreasing the approximation error at each timestep. Building upon RF-Solver, we further design RF-Edit, which comprises specialized sub-modules for image and video editing. By sharing self-attention layer features during the editing process, RF-Edit effectively preserves the structural information of the source image or video while achieving high-quality editing results. Our approach is compatible with any pre-trained rectified-flow-based models for image and video tasks, requiring no additional training or optimization. Extensive experiments on text-to-image generation, image & video inversion, and image & video editing demonstrate the robust performance and adaptability of our methods.
Contributions
- We propose RF-Solver, a training-free sampler that significantly reduces errors in the inversion and reconstruction processes of the rectified-flow model.
- We present RF-Edit, which leverages RF-Solver for image and video editing, effectively preserving the structural integrity of the source image/video while achieving high-quality results.
- Extensive experiments on images and videos demonstrate the efficacy of our methods, showcasing superior performance in both inversion and high-quality editing compared to various existing baselines.
RF-Solver
The vanilla rectified flow (RF) sampler demonstrates strong performance in image and video generation. However, when applied to inversion and reconstruction tasks, we observe significant error accumulation at each timestep.
This results in reconstructions that diverge notably from the original image, further limiting the performance of RF-based models in various downstream tasks, such as image and video editing.
Delving into this problem, we notice that the inversion and reconstruction processes in rectified flow rely on estimating an approximate solution of the rectified flow ODE at each timestep.
Obtaining more precise solutions for the ODE would effectively mitigate these errors, leading to improved reconstruction quality. Based on this analysis, we propose RF-Solver Algorithm.
RF-Edit
The proposed RF-Edit framework enables high-quality editing while preserving structural information.
Building on this concept, we design two sub-modules for RF-Edit, specifically tailored for image editing and video editing.
For image editing, we use FLUX as the backbone.
For video editing, we employ OpenSora as the backbone.
Text-to-Image Generation Results
We compare the performance of our method with the vanilla rectified flow on the text-to-image generation task.
Both the quantitative and qualitative results demonstrate the superior performance of RF-Solver in fundamental T2I generation tasks, producing higher-quality images that align more closely with human cognition.
Inversion and Reconstruction Results
RF-Solver effectively reduces the error in the solution of RF ODE, thereby increasing the accuracy of the reconstruction.
The image reconstruction results using vanilla rectified flow exhibit noticeable drift from the source image, with significant alterations to the appearance of subjects in the image.
For video reconstruction, the baseline reconstruction results suffer from distortion.
In contrast, RF-Solver significantly alleviates these issues, achieving more satisfactory results.
Image Editing Results
We compare the performance of our methods with several baselines across different types of editing.
The baseline methods often suffer from background changes or fail to perform the desired edits.
In contrast, our methods demonstrate satisfying performance, effectively achieves a balanced trade-off between the fidelity to the target prompt and preservation of the source image.
To be noticed, although RF-inversion also uses the rectified flow model for image editing (third row), the structure of the source image which is unrelated to editing prompt (such as background and human appearance) is modified obviously.
Video Editing Results
For video editing, we primarily evaluate the performance of our methods on long videos (200 frames) and high-resolution videos.
Furthermore, we assess the performance on complicated videos and prompts where there are multiple objects in the video, and the user has different editing requirements for each object.
Our method successfully handles complicated editing cases (e.g., modifying the leftmost lion among three lions into a white polar bear and changing the other two small lions into orange tiger cubs), whereas all other baseline methods fail in this scenario.
Our method also demonstrates strong performance in global editing tasks, such as transforming scenes into autumn.
BibTex
@misc{wang2024tamingrectifiedflowinversion,
title={Taming Rectified Flow for Inversion and Editing},
author={Jiangshan Wang and Junfu Pu and Zhongang Qi and Jiayi Guo and Yue Ma and Nisha Huang and Yuxin Chen and Xiu Li and Ying Shan},
year={2024},
eprint={2411.04746},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2411.04746},
}