Conditioned Denoising Diffusion with Spatial Attention for Controllable 3D Scene Layout Generation and Editing

Kaiwen Zhu, Houmin Wu, Bin Xiao

Abstract


Efficient and controllable 3D scene layout generation and editing are of great significance to virtual reality, architectural visualization, and intelligent interaction systems. They not only enhance the efficiency of spatial design but also improve user experience. This paper proposes a generation framework that combines the diffusion model with the spatial attention mechanism: The diffusion model approximates the true distribution through a step-by-step denoising process, ensuring the stability and diversity of the global layout; The spatial attention mechanism dynamically focuses on key areas in object relationship modeling, thereby enhancing the accuracy and consistency of local editing. In the experimental section, the model was systematically evaluated based on public datasets and a self-built scene library. Performance metrics such as layout accuracy (89.3%), intersection over union (IoU) (0.76), Fréchet Inception Distance (FID) (31.2), and editing consistency score (0.84) were used for performance measurement. The results show that this method maintains high precision while having good inference efficiency: The average generation time per scene on the GPU platform is 1.3 s, and about 5.9 s on embedded devices, which is superior to baseline methods. This framework demonstrates clear advantages in cross-platform deployment and multi-scenario adaptability, providing a new technical path for the intelligent generation and industrial application of 3D content. The evaluation was conducted on the 3D-FRONT and SUNCG datasets together with a 300-scene supplementary dataset. Layout Accuracy was defined as correct placement within 0.20 m translation error and 15 ° rotation error., IoU was computed on 128³ voxel grids, FID was calculated from five rendered views per scene using Inception-v3 features, and the Editing Consistency Score was defined as the ratio of satisfied spatial constraints while preserving overall structural similarity.


Full Text:

PDF

References


Zhang H , He R , Fang W .An Underwater Image Enhancement Method Based on Diffusion Model Using Dual-Layer Attention Mechanism[J].Water (20734441), 2024, 16(13).https://doi.org/10.3390/w16131813.

Luo X , Wang S , Wu Z ,et al.CamDiff: Camouflage Image Augmentation via Diffusion Model[J].ArXiv, 2023, abs/2304.05469.https://doi.org/10.48550/arXiv.2304.05469.

Zhang Y , Zhang H , Cheng Z ,et al.SSP-IR: Semantic and Structure Priors for Diffusion-based Realistic Image Restoration[J].IEEE Transactions on Circuits and Systems for Video Technology, 2024.https://doi.org/10.1109/TCSVT.2025.3538772.

Wang C, Peng H Y, Liu Y T, et al. Diffusion models for 3D generation: A survey[J]. Computational Visual Media, 2025, 11(1): 1-28.https://doi.org/10.26599/CVM.2025.9450452.

Dong Z , Yuan G , Hua Z ,et al.Diffusion model-based text-guided enhancement network for medical image segmentation[J].Expert Systems With Applications, 2024, 249.https://doi.org/10.1016/j.eswa.2024.123549.

Zhang C , Chen Y , Fan Z ,et al.TC-DiffRecon: Texture coordination MRI reconstruction method based on diffusion model and modified MF-UNet method[J].IEEE, 2024.https://doi.org/10.1109/ISBI56570.2024.10635308.

Xu Y , Li X , Jie Y ,et al.Simultaneous Tri-Modal Medical Image Fusion and Super-Resolution using Conditional Diffusion Model[J]. 2024.https://doi.org/10.1007/978-3-031-72104-5_61.

Han X, Zhao Y, You M. Scene Diffusion: Text-driven Scene Image Synthesis Conditioning on a Single 3D Model[C]//Proceedings of the 32nd ACM International Conference on Multimedia. 2024: 7862-7870.https://doi.org/10.1145/3664647.3681678.

Li X, Wu Y, Cen J, et al. iControl3D: An interactive system for controllable 3D scene generation[C]//Proceedings of the 32nd ACM International Conference on Multimedia. 2024: 10814-10823.https://doi.org/10.1145/3664647.3680557.

Gomel E, Wolf L. Diffusion-Based Attention Warping for Consistent 3D Scene Editing[J]. arXiv preprint arXiv:2412.07984, 2024.https://doi.org/10.48550/arXiv.2412.07984.

Liu Y , Li X , Zhang J ,et al.DA-HFNet: Progressive Fine-Grained Forgery Image Detection and Localization Based on Dual Attention[J].IEEE, 2024.https://doi.org/10.1109/ICIPMC62364.2024.10586611.

Yao Y , Duan X , Qu A ,et al.DFCG: A Dual-Frequency Cascade Graph model for semi-supervised ultrasound image segmentation with diffusion model[J].Knowledge-Based Systems, 2024, 300(000):15.https://doi.org/10.1016/j.knosys.2024.112261.

Yao G , Zhao K , Deng C ,et al.Geometric-Aware Control inDiffusion Model forHandwritten Chinese Font Generation[C]//International Conference on Document Analysis and Recognition.Springer, Cham, 2024.https://doi.org/10.1007/978-3-031-70536-6_1.

Li J , Wu Q , Wang Y ,et al.DiffCAS: diffusion based multi-attention network for segmentation of 3D coronary artery from CT angiography[J].Signal, Image and Video Processing, 2024, 18(10):7487-7498.https://doi.org/10.1007/s11760-024-03409-5.

Jin Z , Fang Y , Huang J ,et al.Diff3Dformer: Leveraging Slice Sequence Diffusion for Enhanced 3D CT Classification with Transformer Networks[J]. 2024.https://doi.org/10.1007/978-3-031-72378-0_47.

Bourigault E , Bourigault P .MVDiff: Scalable and Flexible Multi-View Diffusion for 3D Object Reconstruction from Single-View[J].IEEE, 2024.https://doi.org/10.1109/CVPRW63382.2024.00753.

Huang N , Dong W , Zhang Y ,et al.CreativeSynth: Cross-Art-Attention for Artistic Image Synthesis with Multimodal Diffusion[J].IEEE Transactions on Visualization and Computer Graphics, 2024.https://doi.org/10.1109/TVCG.2025.3570771.

Zou X , Zhou L , Du G ,et al.Information Diffusion Prediction Based on Deep Attention in Heterogeneous Networks[J].Springer, Cham, 2022.https://doi.org/10.1007/978-3-031-24521-3_8.

Dandwate P , Shahane C , Jagtap V ,et al.Comparative study of Transformer and LSTM Network with attention mechanism on Image Captioning[J]. 2023.https://doi.org/10.48550/arXiv.2303.02648.

Yesiltepe H , Akdemir K , Yanardag P .MIST: Mitigating Intersectional Bias with Disentangled Cross-Attention Editing in Text-to-Image Diffusion Models[J]. 2024.https://doi.org/10.48550/arXiv.2403.1973.




DOI: https://doi.org/10.31449/inf.v49i14.11284

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.