Zhao Haiyang
School of Data Science
City University of Hong Kong
haiyazhao7-c@my.cityu.edu.hk
Abstract
In this study, we propose an enhanced image restoration model, SUPIR, based on the integration of two low-rank adaptive (LoRA) modules with the Stable Diffusion XL (SDXL) framework. Our method leverages the advantages of LoRA to fine-tune SDXL models, thereby significantly improving image restoration quality and efficiency. We collect 2600 high-quality real-world images, each with detailed descriptive text, for training the model. The proposed method is evaluated on standard benchmarks and achieves excellent performance, demonstrated by higher peak signal-to-noise ratio (PSNR), lower learned perceptual image patch similarity (LPIPS), and higher structural similarity index measurement (SSIM) scores. These results underscore the effectiveness of combining LoRA with SDXL for advanced image restoration tasks, highlighting the potential of our approach in generating high-fidelity restored images.
1 Introduction
With the advancement of image restoration technology, it has become feasible to construct models capable of generating ultra-high-quality images while retaining the original semantic information as much as possible. Some of these approaches have proven to be highly effective, such as generative priors and increasing the model scale. Among these, model expansion has been demonstrated to be a significant and efficient technique. For instance, notable advancements have been achieved in tasks like Vision Transformers (ViT) [7] and DALL-E [25] through the expansion of the model scale. This encourages us to further pursue and develop large-scale intelligent IR models capable of generating ultra-high-quality images.
The SUPIR [42] model has demonstrated extraordinary performance in image restoration, using a novel method of improving image restoration ability through text prompts. The author collected 20 million high-quality, high-definition images containing descriptive text annotations for training SUPIR. SUPIR considers Stable Diffusion XL (SDXL) [24] as a powerful computational prior, containing 2.6 billion parameters. SDXL utilizes an expanded UNet [28] backbone network and introduces an image-to-image refinement model for post-processing, allowing it to produce images of superior quality and resolution [7]. After exploring the performance of SUPIR, we want to improve its performance in terms of image details and textures. In this work, we introduce two trained LoRA [12] applied to SDXL to fine-tune model parameters and improve the model’s face and landscape restoration performance. To verify the effectiveness of this method, we conducted comprehensive experiments on real-world images and achieved better results on indicators PSNR, SSIM, and LPIPS[49]. The main contribution of this work is to shorten the time for image generation and improve the quality of generated images.
2 Related Work
2.1 Image Restoration
The purpose of image restoration is to convert degraded images into clean, high-quality images [8, 14, 45, 44]. Typical image restoration problems include super-resolution [23, 6, 17], deblurring [26, 46, 4], and denoising [15, 50, 38], but these methods generally have limited generalization ability, making it difficult to handle degraded images in the real world.
Deep learning introduces other architectures and training paradigms, thereby improving image restoration performance. For example, transformer-based models enhance the fidelity and authenticity of restored images [13]. In addition, attention mechanisms [33] and multi-scale feature extraction techniques are integrated into the restoration framework to better capture fine details of images.
In addition to these methods, diffusion models also receive attention in image restoration tasks. Models such as Denoising Diffusion Implicit Model (DDIM) [32] and Denoising Diffusion Neural Network (DDNM) [36] iteratively refine images through a series of denoising steps, effectively handling various types of image degradation, and show promising results. These models utilize the power of diffusion processes to gradually improve image quality, making them highly effective in tasks such as denoising, deblurring, and more. However, achieving robust performance in various invisible degradation scenarios still faces challenges. Moreover, to better optimize the image restoration model, researchers propose new loss functions, such as Perceptual Loss, which better capture the details of the image and improve the restoration effect. Future research aims to develop more adaptable and scalable models that can effectively generalize to various types of image degradation encountered in real-world applications. Over time, many models that can handle multiple degradation scenarios emerge, among which two-stage methods show good results, such as DiffBIR [19] and SUPIR [42].
2.2 Low Rank Adaptation
Low Rank Adaptation (LoRA) [12] is an approximate numerical decomposition technique that is particularly useful for large-scale language models. This method involves performing low rank decomposition of feature matrices, which allows for efficient adaptation of pre-trained models. By utilizing low rank decomposition techniques, LoRA can significantly reduce the number of parameters in the feature matrix, leading to decreased memory usage and computational overhead.
The core idea behind LoRA is to insert a low rank adaptation matrix into the model architecture, enabling fast adaptation and efficient fine-tuning without altering the original model’s weights. This approach is particularly advantageous in scenarios where computational resources are limited or where rapid model updates are necessary. LoRA achieves this by leveraging the inherent low-rank structure within the feature matrices of large-scale models, which often contain redundant information. By decomposing these matrices into lower-dimensional components, LoRA reduces the complexity of the adaptation process. Furthermore, LoRA’s ability to maintain the integrity of the original model’s weights ensures that the foundational knowledge embedded within the pre-trained model is preserved. This makes LoRA especially effective in transfer learning scenarios, where the pre-trained model is adapted to new tasks or domains. The low-rank adaptation matrix can be trained with relatively fewer resources compared to re-training the entire model, thus making the process more efficient and cost-effective.
In practical terms, LoRA can be applied to various aspects of model adaptation, including fine-tuning for specific tasks, domain adaptation, and even continual learning. Its flexibility and efficiency make it a valuable tool in the toolkit of machine learning practitioners, particularly when dealing with large and complex models. The reduced computational burden also facilitates experimentation and iteration, allowing researchers and engineers to explore a wider range of configurations and settings. LoRA can offer a blend of efficiency, effectiveness, and practicality. Its application can lead to more responsive and adaptable AI systems, capable of quickly incorporating new information and tasks without the need for extensive computational resources.
2.3 Stable Diffusion XL
Diffusion models [11, 27, 40] garner significant attention in the field of generative artificial intelligence, delivering state-of-the-art outcomes across various applications, including text-to-image [16, 29, 48] and text-to-video [3, 43, 37] transformations. These models operate by gradually transforming a simple, structured noise distribution into a complex data distribution through a series of iterative refinement steps. This process enables the generation of high-fidelity images and videos from random noise, making diffusion models a powerful tool for various generative tasks.
Stable diffusion [27, 5, 22] is particularly influential in text-to-image synthesis, leveraging the Latent Diffusion Model (LDM) to execute diffusion operations within a semantically compressed space. This approach enhances computational efficiency by reducing the dimensionality of the data on which diffusion operations are performed. The core architecture of stable diffusion models centers around U-Net, a convolutional neural network architecture that is well-suited for image restoration tasks. U-Net iteratively denoises random latent codes, supported by text encoders and image decoders, to harmonize text and image generation. The use of text encoders allows the model to understand and incorporate textual descriptions into the generated images, resulting in highly detailed and contextually relevant outputs. However, the computational demands of its multi-step inference process [39] become a significant burden, particularly when generating high-resolution images or long video sequences. Each step in the diffusion process requires complex computations, leading to substantial time and resource consumption. This computational overhead poses a challenge for real-time applications and large-scale deployments, where efficiency is crucial. To address these challenges, researchers have introduced various distillation techniques such as Progressive Distillation and Adversarial Distillation [30, 31]. Progressive Distillation incrementally transfers knowledge from a complex model to a simpler one, maintaining the performance while reducing the number of necessary computation steps. Adversarial Distillation, on the other hand, leverages adversarial training to enhance the quality of the distilled model, ensuring that the simplified model retains the generative capabilities of the original.
SDXL Lightning [18] is an enhanced version of the SDXL model that employs progressive adversarial distillation technology, to significantly boost the quality and efficiency of image generation. SDXL Lightning employs an advanced model architecture and adversarial training mechanism to generate high-resolution, detailed images while minimizing computational resources.In order to reduce the computational requirements of training diffusion models for high-resolution image synthesis, it has been found that although diffusion models can ignore perceptually unimportant details through undersampling loss terms, they still require expensive function evaluation in pixel space, thereby requiring a large amount of computation time and energy resources. Therefore, a method was introduced to clearly separate the compression learning phase. This method employs an autoencoder model that learns a space that is perceptually equivalent to the image space, but with significantly reduced computational complexity.
3 Method
3.1 Background of Stable Diffusion
The key steps of the stable diffusion model consist of the forward diffusion process and the backward denoising process [11]. The forward diffusion process progressively adds noise to the data, whereas the backward process utilizes the learned model to remove noise and restore the original data. Specifically, the forward diffusion process can be described as:
(1) |
The reverse process is approximated by a parameterized denoising model , which is trained by maximizing logarithmic likelihood estimation:
(2) |
Among them, and are the learned mean and variance functions, respectively.Its goal is to predict a noise to be added to the input image based on the noisy image at time . The objective function of LDM [27] is
(3) |
In LDM, we learn in latent space, that is, predict the noise added on , and the corresponding loss function is expressed as follows:
(4) |
Compared with traditional Generative Adversarial Networks (GANs) [9], stable diffusion models exhibit enhanced stability and a reduced risk of mode collapse. Moreover, by functioning within the latent space, it substantially enhance the computational efficiency of the generation process.
3.2 Scaling-UP Image Restoration
SUPIR combines large-scale pre trained generative models, significantly improving the effectiveness of image restoration. SUPIR adopts a two-stage architecture, with each stage optimized for different tasks. In the first stage, a pre-trained restoration module is employed to remove degraded components from the image, such as blur and noise. In the second stage, SUPIR leverages sdxl for image detail and texture reconstruction.
First, input a low-quality image, and then the low-quality image will be encoded by the fine tuned encoder and mapped to the latent space. This encoder has been specially trained to handle degraded images. The author designed an adapter based on ControlNet [47] that can recognize LQ image content and guide it to recover images based on the provided low-quality input. The adapter adopts a partially trimmed Vision Transformer (ViT) [1] module and introduces a ZeroSFT [35] module to enhance the guidance effect of LQ images.
Let the input image be . The output of the encoder is :
(5) |
The encoded feature is processed by the trimmed ControlNet:
(6) |
Here, represents a convolution operation with zero-padding, ensuring the spatial dimensions of the input are preserved. It is used to integrate additional information (such as prompts) without altering the feature map size. The output of the trimmed ControlNet is then processed by the decoder to generate the final output :
(7) |
The author also introduced the LLaVA [20] large language model to clarify the content in low-quality images subjected to robust degradation processing, and output it in the form of text descriptions. Then use these descriptions as prompts to guide the recovery.Additionally, the author employs negative prompts to manage the output quality of image generation models, particularly in the absence of Classifier-Free Guidance (CFG) [10]. Specifically, at each step of diffusion, we will make two predictions using positive prompts pos and negative prompts neg, and take the fusion of these two results as the final output :
(8) |
(9) |
(10) |
where is our diffusion model with adaptor, is the variance of the noise at time-step , and is a hyperparameter. In our framework, pos can be the image description with positive words of quality, and neg is the negative words of quality.For instance, negative prompts can direct the model to avoid generating blurry, distorted, or low-quality images. Consequently, the SUPIR model is capable of generating high-quality images within the latent space. Subsequently, the produced high-quality images are transformed back into the image space via a fixed decoder. Furthermore, by training on the dataset and leveraging the properties of diffusion models, image restoration is performed selectively based on LLaVA’s prompts, effectively addressing a range of restoration requirements.
In this study, we explore not only the use of SDXL but also several other variants of the Stable Diffusion model. Among these, the SDXL model demonstrates superior performance, while the SDXL Lightning variant also exhibits commendable capabilities. SDXL Lightning is particularly noteworthy for its ability to reduce the number of inference steps to 15 without compromising image quality, thereby enabling the generation of high-quality images within a significantly shorter time frame.
3.3 Training for Low Rank Adaptation
In this study, we adopted LoRA technology to adapt the SDXL model and enhance its performance in facial image generation. LoRA effectively fine tunes the pre trained model by introducing a low rank factorization matrix without increasing memory and computational overhead. Specifically, we incorporated a LoRA adaptation layer into the SDXL model and trained two separate LoRa models. One LoRa model was trained on 1300 landscape images, while the second was specifically trained on 300 facial images. All images are preprocessed to a resolution of 512x512 to ensure data consistency and quality. Each image is labeled in detail to ensure learning ability during training. This method has been proven to be efficient and adaptable under resource constrained conditions in multiple studies [2]. During the training process, the LoRA adaptation matrix is initialized to zero and the update matrix is initialized using a random Gaussian distribution to ensure the stability of the model in the early stages of training. We adjust hyperparameters such as learning rate based on specific performance to optimize model performance. Through this low rank adaptation method, we successfully improved the performance of the SUPIR model in facial image generation and validated the effectiveness of LoRA technology in model fine-tuning.
As mentioned earlier, LoRA represents parameter updates through low rank matrix decomposition. Given a weight matrix , LoRA decomposes it into two low-rank matrices and , such that:
(11) |
where the ranks of and are much smaller than the original dimensions of .
During training, LoRA updates only the matrices and , keeping unchanged. This method drastically reduces the number of parameters that need to be trained. For a weight matrix , if we choose a rank , the number of parameters to be adjusted decreases from to . The training process can be described as follows:
Initialize low-rank matrices and . At each iteration, update and based on the gradient of the loss function:
(12) |
(13) |
where is the learning rate and is the loss function.
The effective weight matrix during training is given by:
(14) |
The optimization objective can be expressed as minimizing the loss function with respect to the effective weights:
(15) |
To ensure convergence and stability, regularization terms can be added to the loss function to penalize large updates in and :
(16) |
where denotes the Frobenius norm and is a regularization parameter.By this means, LoRA achieves efficient parameter updates and demonstrates superior performance across various tasks, such as image and text generation, while significantly reducing computational and storage costs.
4 Experiments
4.1 Model Training and Quantitative Comparisons
Method | Degradation | PSNR | SSIM | LPIPS |
---|---|---|---|---|
(Blur () + Noise ()) | Ours | 32.19 | 0.7434 | 0.0932 |
Lighting-LoRA | 29.37 | 0.5834 | 0.1232 | |
HWXL-LoRA | 28.87 | 0.6025 | 0.1183 | |
SUPIR | 29.46 | 0.4203 | 0.1402 | |
Lighting | 29.63 | 0.5523 | 0.2085 | |
HWXL | 29.13 | 0.5856 | 0.1490 | |
SR () | Ours | 29.64 | 0.6382 | 0.0916 |
Lighting-LoRA | 29.03 | 0.5795 | 0.1328 | |
HWXL-LoRA | 28.78 | 0.6004 | 0.1265 | |
SUPIR | 29.81 | 0.5934 | 0.1432 | |
Lighting | 28.57 | 0.5357 | 0.2250 | |
HWXL | 28.64 | 0.5774 | 0.1688 | |
Blur () + SR () | Ours | 29.38 | 0.5651 | 0.1250 |
Lighting-LoRA | 28.11 | 0.5436 | 0.1414 | |
HWXL-LoRA | 28.65 | 0.5609 | 0.1293 | |
SUPIR | 27.75 | 0.4702 | 0.1306 | |
Lighting | 27.33 | 0.4694 | 0.2880 | |
HWXL | 28.78 | 0.5495 | 0.1803 | |
Blur () + SR () + Noise () | Ours | 18.48 | 0.2881 | 0.3505 |
Lighting-LoRA | 17.44 | 0.2590 | 0.4075 | |
HWXL-LoRA | 19.70 | 0.2705 | 0.2907 | |
SUPIR | 18.57 | 0.2808 | 0.3225 | |
Lighting | 17.39 | 0.1681 | 0.6379 | |
HWXL | 20.94 | 0.2823 | 0.4472 |
For training, We use the AdamW optimizer [21] with a learning rate of 0.0001. The training process spans 2 days, with a batch size of 256. In our experiments, the integration of two LoRA modules with the SDXL model in the SUPIR framework demonstrated remarkable improvements in image restoration tasks. We used three SDXL models, namely SDXL, SDXL-lighting, HelloWorld-XL, Among them, HelloWorld XL used 20821 images as the training set, which includes different people and actions, as well as many lifelike animals. Moreover, the image quality of close-up portrait output is better than SDXL. And HelloWorld-XL intentionally includes some low-quality images in the training to enhance the model’s response to negative prompts, which is also why HelloWorld-XL performs well in handling blur and noise. The proposed method achieved excellent performance across multiple metrics, with Peak Signal-to-Noise Ratio (PSNR) values significantly higher than the baseline models, indicating clearer and more accurate restored images. Additionally, the Learned Perceptual Image Patch Similarity (LPIPS) scores were notably lower, reflecting a better perceptual similarity to the ground truth images. Furthermore, the Structural Similarity Index Measure (SSIM) scores were substantially improved, showcasing enhanced structural fidelity and visual quality of the restored images. These outstanding results affirm the effectiveness of our approach in producing high-quality image restorations, making it a promising solution for advanced image restoration applications.
To generate low-quality images to test the performance of our method, we introduced various degradations ranging from simple to complex. For quantitative comparison, we selected the following indicators: complete reference indicators PSNR, SSIM, and LPIPS [49]. Compared with the original SUPIR method, our method has improved in all parameter indicators. Similarly, in Fig.2, it can be seen that our model has indeed achieved good results in face restoration, with some progress compared to SUPIR in certain small details and colors
In terms of details in image restoration, our method demonstrates some better features than the original SUPIR model. For example, in LABEL:fig:first we can see that the texture of the goat’s wool on the trained Lora image is more in line with the texture of the original image. In the image of the little girl, we can see that low-quality images basically do not show the earrings. In the SUPIR model, the earrings are also restored to hair, while the image generated by the trained Lora will show the earrings. Therefore, it can be demonstrated that our method generates high fidelity textures.
Metrics | PASD | Stable-SR | DiffBIR | SUPIR | Ours |
---|---|---|---|---|---|
PSNR | 26.87 | 19.76 | 28.72 | 27.74 | 29.38 |
SSIM | 0.4513 | 0.4051 | 0.4663 | 0.4702 | 0.5651 |
LPIPS | 0.1828 | 0.2418 | 0.1289 | 0.1306 | 0.1250 |
Metrics | PASD | Stable-SR | DiffBIR | SUPIR | Ours |
---|---|---|---|---|---|
Computational Time | 28.65 | 1013 | 45.20 | 18.44 | 11.28 |
4.2 Comparison with Other Methods
We also conducted tests on low-quality images and compared them with other models, such as DiffBIR [19], Stable-SR [34], PASD [41]. We selected the following metrics for quantitative comparison: the reference metrics PSNR, SSIM, LPIPS.In terms of results, our method achieved the best scores on PSNR and SSIM and LPIPS, indicating that our method has higher perceptual similarity between the restored image and the reference image than other methods.
LoRA reduces the complexity of the model through parameter decomposition, thereby reducing time. As shown in Tab.3, the comparison between the original method and our method shows that the LoRA method has improved by nearly 7 seconds compared to before. Compared with the other two models, our method still has the shortest time. However, StableSR requires 200 steps to generate a perfect image and consumes a lot of time. This efficiency gain demonstrates the effectiveness of our approach in handling large-scale models. Moreover, the reduction in computational time does not compromise the quality of the generated images, as evidenced by the consistent performance metrics. Moreover, Fig.3, we can clearly see the differences between the stable SR model and other models, but its performance is not very good. The PASD model performs well in restoring details, such as in case 1. However, PASD has a low ability to restore images with high noise and blur. In case 2, it was unable to restore the windows of distant high-rise buildings and still had noise points in the restored images. In case 3, the restoration of the clock changed its original color.
5 Conclusion
In this study, we propose an enhanced image restoration model based on the LoRA module and the SDXL framework. Our method utilizes the advantages of LoRA to fine tune the SDXL model, thereby improving image restoration quality and reducing computation time. Experiments have shown that the proposed model outperforms the original SUPIR model and several other methods in most image degradation scenarios. This indicates that it has better performance and structural fidelity. However, there are still some challenges that need to be addressed. Firstly, when the added blur and noise are too large, the performance of the LoRA enhanced SUPIR model tends to be on par with the original SUPIR model. This indicates that the effectiveness of the LoRA module decreases in extreme degradation scenarios. This observation suggests a potential direction for future research: developing more robust adaptation mechanisms to more effectively handle high-level blurring and noise. In the future, more robust adaptation mechanisms can be developed to more effectively handle high-level blurring and noise. In addition, expanding the image dataset to include a wider variety of real-world images can further enhance the model’s ability to handle details.
References
- Alexey [2020]Dosovitskiy Alexey.An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv: 2010.11929, 2020.
- Biderman etal. [2024]Dan Biderman, JoseGonzalez Ortiz, Jacob Portes, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle, etal.Lora learns less and forgets less.arXiv preprint arXiv:2405.09673, 2024.
- Blattmann etal. [2023]Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, etal.Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023.
- Chen etal. [2024]Zheng Chen, Yulun Zhang, Ding Liu, Jinjin Gu, Linghe Kong, Xin Yuan, etal.Hierarchical integration diffusion model for realistic image deblurring.Advances in Neural Information Processing Systems, 36, 2024.
- Croitoru etal. [2023]Florinel-Alin Croitoru, Vlad Hondru, RaduTudor Ionescu, and Mubarak Shah.Diffusion models in vision: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(9):10850–10869, 2023.
- Dong etal. [2015]Chao Dong, ChenChange Loy, Kaiming He, and Xiaoou Tang.Image super-resolution using deep convolutional networks.IEEE transactions on pattern analysis and machine intelligence, 38(2):295–307, 2015.
- Dosovitskiy etal. [2020]Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, etal.An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020.
- Fan etal. [2020]Yuchen Fan, Jiahui Yu, Yiqun Mei, Yulun Zhang, Yun Fu, Ding Liu, and ThomasS Huang.Neural sparse representation for image restoration.Advances in Neural Information Processing Systems, 33:15394–15404, 2020.
- Goodfellow etal. [2014]Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio.Generative adversarial nets.Advances in neural information processing systems, 27, 2014.
- Ho and Salimans [2022]Jonathan Ho and Tim Salimans.Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022.
- Ho etal. [2020]Jonathan Ho, Ajay Jain, and Pieter Abbeel.Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020.
- Hu etal. [2021]EdwardJ Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen.Lora: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685, 2021.
- Huang etal. [2017]Gao Huang, Zhuang Liu, Laurens vander Maaten, and KilianQ Weinberger.Densely connected convolutional networks.In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017.
- Jinjin etal. [2020]Gu Jinjin, Cai Haoming, Chen Haoyu, Ye Xiaoxing, JimmyS Ren, and Dong Chao.Pipal: a large-scale image quality assessment dataset for perceptual image restoration.In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16, pages 633–651. Springer, 2020.
- Kawar etal. [2022]Bahjat Kawar, Michael Elad, Stefano Ermon, and Jiaming Song.Denoising diffusion restoration models.Advances in Neural Information Processing Systems, 35:23593–23606, 2022.
- Ko etal. [2023]Hyung-Kwon Ko, Gwanmo Park, Hyeon Jeon, Jaemin Jo, Juho Kim, and Jinwook Seo.Large-scale text-to-image generation models for visual artists’ creative works.In Proceedings of the 28th international conference on intelligent user interfaces, pages 919–933, 2023.
- Lepcha etal. [2023]DawaChyophel Lepcha, Bhawna Goyal, Ayush Dogra, and Vishal Goyal.Image super-resolution: A comprehensive review, recent trends, challenges and applications.Information Fusion, 91:230–260, 2023.
- Lin etal. [2024]Shanchuan Lin, Anran Wang, and Xiao Yang.Sdxl-lightning: Progressive adversarial diffusion distillation.arXiv preprint arXiv:2402.13929, 2024.
- Lin etal. [2023]Xinqi Lin, Jingwen He, Ziyan Chen, Zhaoyang Lyu, Ben Fei, Bo Dai, Wanli Ouyang, Yu Qiao, and Chao Dong.Diffbir: Towards blind image restoration with generative diffusion prior.arXiv preprint arXiv:2308.15070, 2023.
- Liu etal. [2024]Haotian Liu, Chunyuan Li, Qingyang Wu, and YongJae Lee.Visual instruction tuning.Advances in neural information processing systems, 36, 2024.
- Loshchilov and Hutter [2017]Ilya Loshchilov and Frank Hutter.Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017.
- Mou etal. [2024]Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, and Ying Shan.T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models.In Proceedings of the AAAI Conference on Artificial Intelligence, pages 4296–4304, 2024.
- Park etal. [2003]SungCheol Park, MinKyu Park, and MoonGi Kang.Super-resolution image reconstruction: a technical overview.IEEE signal processing magazine, 20(3):21–36, 2003.
- Podell etal. [2023]Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach.Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023.
- Ramesh etal. [2021]Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever.Zero-shot text-to-image generation.In International conference on machine learning, pages 8821–8831. Pmlr, 2021.
- Ren etal. [2023]Mengwei Ren, Mauricio Delbracio, Hossein Talebi, Guido Gerig, and Peyman Milanfar.Multiscale structure guided diffusion for image deblurring.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10721–10733, 2023.
- Rombach etal. [2022]Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer.High-resolution image synthesis with latent diffusion models.In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
- Ronneberger etal. [2015]Olaf Ronneberger, Philipp Fischer, and Thomas Brox.U-net: Convolutional networks for biomedical image segmentation.In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pages 234–241. Springer, 2015.
- Saharia etal. [2022]Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, EmilyL Denton, Kamyar Ghasemipour, Raphael GontijoLopes, Burcu KaragolAyan, Tim Salimans, etal.Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022.
- Salimans and Ho [2022]Tim Salimans and Jonathan Ho.Progressive distillation for fast sampling of diffusion models.arXiv preprint arXiv:2202.00512, 2022.
- Sauer etal. [2023]Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach.Adversarial diffusion distillation.arXiv preprint arXiv:2311.17042, 2023.
- Song etal. [2020]Jiaming Song, Chenlin Meng, and Stefano Ermon.Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020.
- Vaswani etal. [2017]Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, AidanN Gomez, Łukasz Kaiser, and Illia Polosukhin.Attention is all you need.In Advances in neural information processing systems, pages 5998–6008, 2017.
- Wang etal. [2024]Jianyi Wang, Zongsheng Yue, Shangchen Zhou, KelvinCK Chan, and ChenChange Loy.Exploiting diffusion prior for real-world image super-resolution.International Journal of Computer Vision, pages 1–21, 2024.
- Wang etal. [2018]Xintao Wang, Ke Yu, Chao Dong, and ChenChange Loy.Recovering realistic texture in image super-resolution by deep spatial feature transform.In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 606–615, 2018.
- Wang etal. [2022]Yinhuai Wang, Jiwen Yu, and Jian Zhang.Zero-shot image restoration using denoising diffusion null-space model.arXiv preprint arXiv:2212.00490, 2022.
- Wu etal. [2023]JayZhangjie Wu, Yixiao Ge, Xintao Wang, StanWeixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and MikeZheng Shou.Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7623–7633, 2023.
- Xia etal. [2023]Bin Xia, Yulun Zhang, Shiyin Wang, Yitong Wang, Xinglong Wu, Yapeng Tian, Wenming Yang, and Luc VanGool.Diffir: Efficient diffusion model for image restoration.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13095–13105, 2023.
- Xiao etal. [2017]Xuefeng Xiao, Lianwen Jin, Yafeng Yang, Weixin Yang, Jun Sun, and Tianhai Chang.Building fast and compact convolutional neural networks for offline handwritten chinese character recognition.Pattern Recognition, 72:72–81, 2017.
- Yang etal. [2023a]Ling Yang, Zhilong Zhang, Yang Song, Shenda Hong, Runsheng Xu, Yue Zhao, Wentao Zhang, Bin Cui, and Ming-Hsuan Yang.Diffusion models: A comprehensive survey of methods and applications.ACM Computing Surveys, 56(4):1–39, 2023a.
- Yang etal. [2023b]Tao Yang, Peiran Ren, Xuansong Xie, and Lei Zhang.Pixel-aware stable diffusion for realistic image super-resolution and personalized stylization.arXiv preprint arXiv:2308.14469, 2023b.
- Yu etal. [2024]Fanghua Yu, Jinjin Gu, Zheyuan Li, Jinfan Hu, Xiangtao Kong, Xintao Wang, Jingwen He, Yu Qiao, and Chao Dong.Scaling up to excellence: Practicing model scaling for photo-realistic image restoration in the wild.arXiv preprint arXiv:2401.13627, 2024.
- Zhang etal. [2023a]DavidJunhao Zhang, JayZhangjie Wu, Jia-Wei Liu, Rui Zhao, Lingmin Ran, Yuchao Gu, Difei Gao, and MikeZheng Shou.Show-1: Marrying pixel and latent diffusion models for text-to-video generation.arXiv preprint arXiv:2309.15818, 2023a.
- Zhang etal. [2022a]Jiale Zhang, Yulun Zhang, Jinjin Gu, Yongbing Zhang, Linghe Kong, and Xin Yuan.Accurate image restoration with attention retractable transformer.arXiv preprint arXiv:2210.01427, 2022a.
- Zhang etal. [2017]Kai Zhang, Wangmeng Zuo, Shuhang Gu, and Lei Zhang.Learning deep cnn denoiser prior for image restoration.In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3929–3938, 2017.
- Zhang etal. [2022b]Kaihao Zhang, Wenqi Ren, Wenhan Luo, Wei-Sheng Lai, Björn Stenger, Ming-Hsuan Yang, and Hongdong Li.Deep image deblurring: A survey.International Journal of Computer Vision, 130(9):2103–2130, 2022b.
- Zhang etal. [2023b]Lvmin Zhang, Anyi Rao, and Maneesh Agrawala.Adding conditional control to text-to-image diffusion models.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023b.
- Zhang etal. [2023c]Lvmin Zhang, Anyi Rao, and Maneesh Agrawala.Adding conditional control to text-to-image diffusion models.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023c.
- Zhang etal. [2018]Richard Zhang, Phillip Isola, AlexeiA Efros, Eli Shechtman, and Oliver Wang.The unreasonable effectiveness of deep features as a perceptual metric.In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.
- Zhu etal. [2023]Yuanzhi Zhu, Kai Zhang, Jingyun Liang, Jiezhang Cao, Bihan Wen, Radu Timofte, and Luc VanGool.Denoising diffusion models for plug-and-play image restoration.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1219–1229, 2023.