ControLRM: Fast and Controllable 3D Generation via Large Reconstruction Model

Hongbin Xu1,2,   Weitao Chen3,   Zhipeng Zhou4,   Feng Xiao3,   Baigui Sun3,   Mike Zheng Shou2,   Wenxiong Kang1,  

1 South China University of Technology

,

2 National University of Singapore

,

3 Alibaba Group

,

4 University of Chinese Academy of Sciences

Abstract

Despite recent advancements in 3D generation methods, achieving controllability still remains a challenging issue. Current approaches utilizing score-distillation sampling are hindered by laborious procedures that consume a significant amount of time. Furthermore, the process of first generating 2D representations and then mapping them to 3D lacks internal alignment between the two forms of representation. To address these challenges, we introduce ControLRM, an end-to-end feed-forward model designed for rapid and controllable 3D generation using a large reconstruction model (LRM). ControLRM comprises a 2D condition generator, a condition encoding transformer, and a triplane decoder transformer. Instead of training our model from scratch, we advocate for a joint training framework. In the condition training branch, we lock the triplane decoder and reuses the deep and robust encoding layers pretrained with millions of 3D data in LRM. In the image training branch, we unlock the triplane decoder to establish an implicit alignment between the 2D and 3D representations. To ensure unbiased evaluation, we curate evaluation samples from three distinct datasets (G-OBJ, GSO, ABO) rather than relying on cherry-picking manual generation. The comprehensive experiments conducted on quantitative and qualitative comparisons of 3D controllability and generation quality demonstrate the strong generalization capacity of our proposed approach.

Performance Overview

Performance and efficiency comparison among different conditional 3D generation methods. Fig. (a) shows the average time consumption on a single V100-32G GPU of different methods. Our ControLRM-T and ControLRM-D can respectively achieve 60 and 18 times faster inference speed compared with the fastest baseline, MVControl [14]. Fig (b) shows the results of 15 evaluation metrics on the G-Objaverse test set, including 3D controllability metrics (introduced in Sec. 4.2.1 of the paper) and controllable 3D generation metrics (introduced in Sec. 4.3.1 of the paper).

Method Overview

The overall framework of ControLRM, a feed-forward controllable 3D generation model. ControLRM comprises a 2D condition generator, a condition encoding transformer, and a triplane decoder transformer. Instead of training our model from scratch, we advocate for a joint training framework. In the condition training branch, we lock the triplane decoder and reuses the deep and robust encoding layers pretrained with millions of 3D data in LRM. In the image training branch, we unlock the triplane decoder to establish an implicit alignment between the 2D and 3D representations.

Generated Results with Edge(Canny) Condition

Generated results given Edge (Canny) condition. The first row shows the input condition, the second row shows the results of our ControLRM-D, and the third row shows the results of MVControl. You can change the visualization mode between rendered RGB images and depth maps via dragging the sliders.

A purple teapot with a small horned hat.
 

A red and white plastic cup with a straw.
 

Thor Hammer.
 

A cartoon rocket ship with a person riding in it, flying in the air.
 

Purple sniper rifle.
 

A wooden crate with a cross design on it.
 

Generated Results with Sketch Condition

Generated results given Sketch condition. The first row shows the input condition, the second row shows the results of our ControLRM-D, and the third row shows the results of MVControl. You can change the visualization mode between rendered RGB images and depth maps via dragging the sliders.

Wooden Barrel Model.
 

Low poly a hamburger.
 

An axe.
 

A wooden console side table with drawers and shelves, featuring holes for added detail..
 

A sword with a blue handle.
 

A small wooden house with a tin roof.
 

Generated Results with Depth Condition

Generated results given Depth condition. The first row shows the input condition, the second row shows the results of our ControLRM-D, and the third row shows the results of MVControl. You can change the visualization mode between rendered RGB images and depth maps via dragging the sliders.

A red apple.
 

A fur-covered, camouflage-patterned, wooden chair with brown and tan accents.
 

Minecraft hammer model with a white pole.
 

A purple chair with holes and a back, featuring a plastic seat and legs.
 

A small garage or bus stop with a red roof.
 

A steampunk gun model with a scope, featuring a wooden handle, gold accents, and a metal barrel.
 

Generated Results with Normal Condition

Generated results given Normal condition. The first row shows the input condition, the second row shows the results of our ControLRM-D, and the third row shows the results of MVControl. You can change the visualization mode between rendered RGB images and depth maps via dragging the sliders.

An axe.
 

A green military crate with a metal handle.
 

White toilet model.
 

A vintage wooden radio with an attached cord.
 

A white woven bag with wooden handles.
 

A LEGO gun with a purple light, resembling a small robotic device..