Multi-Modal Modified U-Net for Text-Image Restoration: A Diffusion-Based Multimodal Information Fusion Approach
Abstract
Realistic picture restoration is a crucial task in computer vision, with diffusion-based models widely explored for their generative capabilities. However, image quality remains a challenge due to the uncontrolled nature of diffusion theory and severe image degradation. To address this, we propose a MultiModal Modified U-Net (M3UNET) model that integrates textual and visual modalities for enhanced restoration. We leverage a pre-trained multimodal large language model to extract semantic information from low-quality images and employ an image encoder with a custom-built Refine Layer to improve feature acquisition. At the visual level, pixel-level spatial structures are managed for fine-grained restoration. By incorporating control information through multi-level attention mechanisms, our model enables precise and controlled restoration. Experimental results on synthetic and real-world datasets demonstrate that our approach surpasses state-of-the-art techniques in both qualitative and quantitative evaluations, proving the efficacy of multimodal insights in improving image restoration quality.
Full Text:
PDFDOI: https://doi.org/10.31449/inf.v49i2.8245

This work is licensed under a Creative Commons Attribution 3.0 License.