University of California San Diego
TL;DR: φ-Scene reconstructs physically grounded, simulation-ready 3D scenes from single images via topology-driven physical assembly.
Reconstructing compositional 3D scenes from a single image is a fundamental challenge in 3D world modeling. Recent methods can recover high-fidelity, complete 3D objects and predict plausible scene arrangements, but most still treat scene reconstruction primarily as a visual and geometric prediction problem. Their outputs may therefore contain floating objects, interpenetrations, or unstable-contact artifacts, limiting their physical validity and downstream usability in simulation, robotics, and interactive environments.
We present φ-Scene, a physically grounded approach to open-vocabulary and compositional image-to-3D scene reconstruction. The key premise is that a reconstructed scene should not be treated merely as a set of objects with predicted poses, but as a stable physical system. Accordingly, φ-Scene formulates reconstruction as topology-driven physical assembly: it infers how objects support one another, orders them accordingly, and progressively settles each object against its already stabilized support context. To instantiate this, φ-Scene uses compositional 3D foundation models to recover complete object geometries and initial object poses, optionally transfers global arrangement cues from a holistic image-to-3D prior, and then performs support-aware physical assembly. For each object in topological order, SDF-based optimization first resolves penetrations against the pre-settled support context, and rigid-body simulation then settles the object into a stable contact configuration under real-world physical constraints. This process produces compositional 3D reconstructions that preserve object-level geometry while improving placement coherence, contact validity, and dynamic physical stability.
Experiments on 3D-Front show that φ-Scene achieves the strongest overall performance among out-of-domain methods and remains highly competitive with in-domain baselines on standard 3D reconstruction metrics. Human and VLM evaluations further show strong preference for φ-Scene in visual quality, reference alignment, and physical plausibility. Finally, dedicated physical plausibility metrics covering static contact quality and dynamic stability demonstrate that φ-Scene substantially reduces penetration artifacts while producing much lower post-simulation drift, indicating more stable and physically grounded 3D scenes.
Recent methods can recover complete 3D objects and plausible scene arrangements, but often produce unstable contacts because physical validity is not explicitly enforced. φ-Scene instead places physical grounding at the core of 3D scene reconstruction by formulating it as a physical assembly process, producing much more coherent and physically stable 3D scenes.
When the reconstructed 3D scenes are released in a physics simulator, prior methods often explode / fall apart due to unstable contacts like penetrations, yielding severe post-simulation drift. In contrast, φ-Scene's topology-driven physical assembly strategy keeps the 3D scene almost in a global rigid-body equilibrium, allowing it stay physically stable with minimal post-simulation drift.
Please consider citing our paper if you find it useful in your research. 🌹
Coming soon...
Coming soon...
Coming soon...
Coming soon...
Coming soon...