DriveEditor: A Unified 3D Information-Guided Framework for Controllable Object Editing in Driving Scenes

Yiyuan Liang^1,2,*, Zhiying Yan^1,2,*, Liqun Chen^1,2,*, Jiahuan Zhou³,
Luxin Yan^1,2, Sheng Zhong^1,2, Xu Zou^1,2,†

¹Huazhong University of Science and Technology
²National Key Laboratory of Multispectral Information Intelligent Processing Technology
³Wangxuan Institute of Computer Technology, Peking University

^*Indicates Equal Contribution,^†Indicates Corresponding Author

AAAI 2025

ArXiv Conference Paper Code

DriveEditor enables user-friendly repositioning, insertion, replacement, and deletion within a unified framework.

Abstract

Vision-centric autonomous driving systems require diverse data for robust training and evaluation, which can be augmented by manipulating object positions and appearances within existing scene captures. While recent advancements in diffusion models have shown promise in video editing, their application to object manipulation in driving scenarios remains challenging due to imprecise positional control and difficulties in preserving high-fidelity object appearances. To address these challenges in position and appearance control, we introduce DriveEditor, a diffusion-based framework for object editing in driving videos. DriveEditor offers a unified framework for comprehensive object editing operations, including repositioning, replacement, deletion, and insertion. These diverse manipulations are all achieved through a shared set of varying inputs, processed by identical position control and appearance maintenance modules. The position control module projects the given 3D bounding box while preserving depth information and hierarchically injects it into the diffusion process, enabling precise control over object position and orientation. The appearance maintenance module preserves consistent attributes with a single reference image by employing a three-tiered approach: low-level detail preservation, high-level semantic maintenance, and the integration of 3D priors from a novel view synthesis model. Extensive qualitative and quantitative evaluations on the nuScenes dataset demonstrate DriveEditor's exceptional fidelity and controllability in generating diverse driving scene edits, as well as its remarkable ability to facilitate downstream tasks.

Structure

(a) High-level overview of DriveEditor. (b) Diagram of the training pipeline of DriveEditor. Three levels of appearance control are established based on the single reference image I^r: low-level details preservation through a cut-and-paste approach, high-level semantics maintenance through cross-attention (omitted in the pipeline for brevity), and incorporation of 3D priors derived from the frozen SV3D U-Net. For position control, we perform a projection that preserves depth information, followed by the Pose Controller to extract multi-scale features. Control signals are injected through three distinct paths in block of the video model: position features into ResBlocks, semantic features via cross-attention, and 3D features added to block outputs.

Unified Editing

DriveEditor is trained to reconstruct occluded objects using inputs from constructed dataset. At inference time, it performs various editing tasks based on specific input prompts.

Long Video Editing

DriveEditor supports iterative editing by conditioning on the last frame of the previous video, allowing for the editing of long videos. (demo: 39 frames)

More Editing Results

Show position condition on the videos (for insertion and repositioning)

Repositioning Results

Insertion Results

Replacement Results

Deletion Results

BibTex


@misc{liang2024driveeditor,
  title={DriveEditor: A Unified 3D Information-Guided Framework for Controllable Object Editing in Driving Scenes}, 
  author={Yiyuan Liang and Zhiying Yan and Liqun Chen and Jiahuan Zhou and Luxin Yan and Sheng Zhong and Xu Zou},
  year={2024},
  eprint={2412.19458},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2412.19458}, 
}