φ-Scene: Physically Grounded Image-to-3D Scene Reconstruction

Abstract(click to expand)

Reconstructing compositional 3D scenes from a single image is a fundamental challenge in 3D world modeling. Recent methods can recover high-fidelity, complete 3D objects and predict plausible scene arrangements, but most still treat scene reconstruction primarily as a visual and geometric prediction problem. Their outputs may therefore contain floating objects, interpenetrations, or unstable-contact artifacts, limiting their physical validity and downstream usability in simulation, robotics, and interactive environments.

We present φ-Scene, a physically grounded approach to open-vocabulary and compositional image-to-3D scene reconstruction. The key premise is that a reconstructed scene should not be treated merely as a set of objects with predicted poses, but as a stable physical system. Accordingly, φ-Scene formulates reconstruction as topology-driven physical assembly: it infers how objects support one another, orders them accordingly, and progressively settles each object against its already stabilized support context. To instantiate this, φ-Scene uses compositional 3D foundation models to recover complete object geometries and initial object poses, optionally transfers global arrangement cues from a holistic image-to-3D prior, and then performs support-aware physical assembly. For each object in topological order, SDF-based optimization first resolves penetrations against the pre-settled support context, and rigid-body simulation then settles the object into a stable contact configuration under real-world physical constraints. This process produces compositional 3D reconstructions that preserve object-level geometry while improving placement coherence, contact validity, and dynamic physical stability.

Experiments on 3D-Front show that φ-Scene achieves the strongest overall performance among out-of-domain methods and remains highly competitive with in-domain baselines on standard 3D reconstruction metrics. Human and VLM evaluations further show strong preference for φ-Scene in visual quality, reference alignment, and physical plausibility. Finally, dedicated physical plausibility metrics covering static contact quality and dynamic stability demonstrate that φ-Scene substantially reduces penetration artifacts while producing much lower post-simulation drift, indicating more stable and physically grounded 3D scenes.

Interactive Comparison: Static

Drag with left click to rotate. Scroll Wheel

Scroll to zoom in/out.

Drag with right click to move.

Recent methods can recover complete 3D objects and plausible scene arrangements, but often produce unstable contacts because physical validity is not explicitly enforced. φ-Scene instead places physical grounding at the core of 3D scene reconstruction by formulating it as a physical assembly process, producing much more coherent and physically stable 3D scenes.

Interactive Comparison: Dynamic

Drag with left click to rotate. Scroll Wheel

Scroll to zoom in/out.

Drag with right click to move.

When the reconstructed 3D scenes are released in a physics simulator, prior methods often explode / fall apart due to unstable contacts like penetrations, yielding severe post-simulation drift. In contrast, φ-Scene's topology-driven physical assembly strategy keeps the 3D scene almost in a global rigid-body equilibrium, allowing it stay physically stable with minimal post-simulation drift.

Interactive Comparison: Static

Interactive Comparison: Dynamic

BibTeX