Conditioned Denoising Diffusion with Spatial Attention for Controllable 3D Scene Layout Generation and Editing
Abstract
Efficient and controllable 3D scene layout generation and editing are of great significance to virtual reality, architectural visualization, and intelligent interaction systems. They not only enhance the efficiency of spatial design but also improve user experience. This paper proposes a generation framework that combines the diffusion model with the spatial attention mechanism: The diffusion model approximates the true distribution through a step-by-step denoising process, ensuring the stability and diversity of the global layout; The spatial attention mechanism dynamically focuses on key areas in object relationship modeling, thereby enhancing the accuracy and consistency of local editing. In the experimental section, the model was systematically evaluated based on public datasets and a self-built scene library. Performance metrics such as layout accuracy (89.3%), intersection over union (IoU) (0.76), Fréchet Inception Distance (FID) (31.2), and editing consistency score (0.84) were used for performance measurement. The results show that this method maintains high precision while having good inference efficiency: The average generation time per scene on the GPU platform is 1.3 s, and about 5.9 s on embedded devices, which is superior to baseline methods. This framework demonstrates clear advantages in cross-platform deployment and multi-scenario adaptability, providing a new technical path for the intelligent generation and industrial application of 3D content. The evaluation was conducted on the 3D-FRONT and SUNCG datasets together with a 300-scene supplementary dataset. Layout Accuracy was defined as correct placement within 0.20 m translation error and 15 ° rotation error., IoU was computed on 128³ voxel grids, FID was calculated from five rendered views per scene using Inception-v3 features, and the Editing Consistency Score was defined as the ratio of satisfied spatial constraints while preserving overall structural similarity.References
Zhang H , He R , Fang W .An Underwater Image Enhancement Method Based on Diffusion Model Using Dual-Layer Attention Mechanism[J].Water (20734441), 2024, 16(13).https://doi.org/10.3390/w16131813.
Luo X , Wang S , Wu Z ,et al.CamDiff: Camouflage Image Augmentation via Diffusion Model[J].ArXiv, 2023, abs/2304.05469.https://doi.org/10.48550/arXiv.2304.05469.
Zhang Y , Zhang H , Cheng Z ,et al.SSP-IR: Semantic and Structure Priors for Diffusion-based Realistic Image Restoration[J].IEEE Transactions on Circuits and Systems for Video Technology, 2024.https://doi.org/10.1109/TCSVT.2025.3538772.
Wang C, Peng H Y, Liu Y T, et al. Diffusion models for 3D generation: A survey[J]. Computational Visual Media, 2025, 11(1): 1-28.https://doi.org/10.26599/CVM.2025.9450452.
Dong Z , Yuan G , Hua Z ,et al.Diffusion model-based text-guided enhancement network for medical image segmentation[J].Expert Systems With Applications, 2024, 249.https://doi.org/10.1016/j.eswa.2024.123549.
Zhang C , Chen Y , Fan Z ,et al.TC-DiffRecon: Texture coordination MRI reconstruction method based on diffusion model and modified MF-UNet method[J].IEEE, 2024.https://doi.org/10.1109/ISBI56570.2024.10635308.
Xu Y , Li X , Jie Y ,et al.Simultaneous Tri-Modal Medical Image Fusion and Super-Resolution using Conditional Diffusion Model[J]. 2024.https://doi.org/10.1007/978-3-031-72104-5_61.
Han X, Zhao Y, You M. Scene Diffusion: Text-driven Scene Image Synthesis Conditioning on a Single 3D Model[C]//Proceedings of the 32nd ACM International Conference on Multimedia. 2024: 7862-7870.https://doi.org/10.1145/3664647.3681678.
Li X, Wu Y, Cen J, et al. iControl3D: An interactive system for controllable 3D scene generation[C]//Proceedings of the 32nd ACM International Conference on Multimedia. 2024: 10814-10823.https://doi.org/10.1145/3664647.3680557.
Gomel E, Wolf L. Diffusion-Based Attention Warping for Consistent 3D Scene Editing[J]. arXiv preprint arXiv:2412.07984, 2024.https://doi.org/10.48550/arXiv.2412.07984.
Liu Y , Li X , Zhang J ,et al.DA-HFNet: Progressive Fine-Grained Forgery Image Detection and Localization Based on Dual Attention[J].IEEE, 2024.https://doi.org/10.1109/ICIPMC62364.2024.10586611.
Yao Y , Duan X , Qu A ,et al.DFCG: A Dual-Frequency Cascade Graph model for semi-supervised ultrasound image segmentation with diffusion model[J].Knowledge-Based Systems, 2024, 300(000):15.https://doi.org/10.1016/j.knosys.2024.112261.
Yao G , Zhao K , Deng C ,et al.Geometric-Aware Control inDiffusion Model forHandwritten Chinese Font Generation[C]//International Conference on Document Analysis and Recognition.Springer, Cham, 2024.https://doi.org/10.1007/978-3-031-70536-6_1.
Li J , Wu Q , Wang Y ,et al.DiffCAS: diffusion based multi-attention network for segmentation of 3D coronary artery from CT angiography[J].Signal, Image and Video Processing, 2024, 18(10):7487-7498.https://doi.org/10.1007/s11760-024-03409-5.
Jin Z , Fang Y , Huang J ,et al.Diff3Dformer: Leveraging Slice Sequence Diffusion for Enhanced 3D CT Classification with Transformer Networks[J]. 2024.https://doi.org/10.1007/978-3-031-72378-0_47.
Bourigault E , Bourigault P .MVDiff: Scalable and Flexible Multi-View Diffusion for 3D Object Reconstruction from Single-View[J].IEEE, 2024.https://doi.org/10.1109/CVPRW63382.2024.00753.
Huang N , Dong W , Zhang Y ,et al.CreativeSynth: Cross-Art-Attention for Artistic Image Synthesis with Multimodal Diffusion[J].IEEE Transactions on Visualization and Computer Graphics, 2024.https://doi.org/10.1109/TVCG.2025.3570771.
Zou X , Zhou L , Du G ,et al.Information Diffusion Prediction Based on Deep Attention in Heterogeneous Networks[J].Springer, Cham, 2022.https://doi.org/10.1007/978-3-031-24521-3_8.
Dandwate P , Shahane C , Jagtap V ,et al.Comparative study of Transformer and LSTM Network with attention mechanism on Image Captioning[J]. 2023.https://doi.org/10.48550/arXiv.2303.02648.
Yesiltepe H , Akdemir K , Yanardag P .MIST: Mitigating Intersectional Bias with Disentangled Cross-Attention Editing in Text-to-Image Diffusion Models[J]. 2024.https://doi.org/10.48550/arXiv.2403.1973.
DOI:
https://doi.org/10.31449/inf.v49i14.11284Downloads
Published
How to Cite
Issue
Section
License
I assign to Informatica, An International Journal of Computing and Informatics ("Journal") the copyright in the manuscript identified above and any additional material (figures, tables, illustrations, software or other information intended for publication) submitted as part of or as a supplement to the manuscript ("Paper") in all forms and media throughout the world, in all languages, for the full term of copyright, effective when and if the article is accepted for publication. This transfer includes the right to reproduce and/or to distribute the Paper to other journals or digital libraries in electronic and online forms and systems.
I understand that I retain the rights to use the pre-prints, off-prints, accepted manuscript and published journal Paper for personal use, scholarly purposes and internal institutional use.
In certain cases, I can ask for retaining the publishing rights of the Paper. The Journal can permit or deny the request for publishing rights, to which I fully agree.
I declare that the submitted Paper is original, has been written by the stated authors and has not been published elsewhere nor is currently being considered for publication by any other journal and will not be submitted for such review while under review by this Journal. The Paper contains no material that violates proprietary rights of any other person or entity. I have obtained written permission from copyright owners for any excerpts from copyrighted works that are included and have credited the sources in my article. I have informed the co-author(s) of the terms of this publishing agreement.
Copyright © Slovenian Society Informatika







