南方科技大学知识苑(SUSTech KC): 基于深度生成模型的高逼真图像修复

题名	基于深度生成模型的高逼真图像修复
其他题名	HIGH REALISTIC IMAGE INPAINTING BASED ON DEEP GENERATIVE MODELS
姓名	周翔
姓名拼音	ZHOU Xiang
学号	12032872
学位类型	硕士
学位专业	0809 电子科学与技术
学科门类/专业学位类别	08 工学
导师	曾媛
导师单位	斯发基斯可信自主系统研究院
论文答辩日期	2023-11-06
论文提交日期	2024-01-11
学位授予单位	南方科技大学
学位授予地点	深圳
摘要	图像修复作为计算机视觉领域的一个重要研究方向，目的是填充图像中的缺失区域，并保证所填充内容在视觉上逼真，在语义上与原图相符。图像修复不仅可作用于去遮挡、物体去除和图像编辑等基本应用，还能解决如三维场景补全等更为复杂的高级视觉任务，因此具有极高的研究价值。随着深度学习的发展，基于深度生成模型的图像修复技术也取得了显著进步，修复质量日益提升。尽管如此，深度图像修复的结果仍存在一些易于被人类察觉的缺陷，例如色彩不一致、伪影和上下文理解缺失等，因此要实现高逼真的深度图像修复仍是一项富有挑战性的任务。在此背景下，本文以深度生成模型为基础，开展了高逼真图像修复方法研究。首先，针对深度修复模型同时面对的低级纹理表示和高级语义表征生成问题，本文提出了一个基于自适应温度自注意力的深度图像修复模型。现有基于自注意力的图像修复方法，通过固定温度参数，来关注特征空间中的有限空间位置。而本文所提出的方法引入了自适应多温度掩模引导注意力（Adaptive multi -emperature Mask-guided Attention，ATMA），通过多个可学习的温度参数，来自适应地调整注意力的柔软程度，从而优化了网络的特征表示，提升了修复质量。在CelebAHQ、ParisStreetView和Places2三个图像修复基准数据集上的实验结果证明，所提模型在图像修复逼真度上超越了当前最先进的模型。其次，针对ATMA存在的生成伪影、训练不稳定等问题，本文进行了深入分析研究，确定了问题来源。进而，本文提出了基于多头温度掩膜自注意力机制（Multi-Head Temperature Masked Self-Attention，MHTMA）的图像修复框架。此框架能够并行化、稳定且高效地学习多个温度，也能够在自注意力中利用多个远距离环境信息来提升修复质量。在CelebA-HQ、Paris StreetView、和Places2数据集上实验表明，MHTMA提高了图像修复的运算效率和训练稳定性，增强了模型可解释性，并提升了深度图像修复的逼真程度。此外，对该方法的拓展研究表明，它还能帮助用户生成基于线条引导的多样化修复结果。最后，针对图像修复研究过程中，模型效果评估所面临的困难，本文搭建了一个交互式图像修复平台，用于评估和验证所提模型的性能，并为用户提供便捷的图像修复编辑服务。该平台主要基于Python语言和Flask框架进行开发。
其他摘要	Image inpainting, an important research area in computer vision, aims to fill missing regions with visually realistic and semantically coherent content. It serves not only as a general method to support fundamental visual applications such as de-occlusion, object removal and image editing, but also as a means to tackle complex visual tasks such as 3D scene completion. With the advancement of deep learning, deep generative model-based image inpainting has made remarkable progress with continually improving quality. However, their results still suffer from general artifacts that are easily perceptible to humans, such as color inconsistencies, artifacts and lack of context understanding. Achieving high-realistic image inpainting through deep generative models remains a challenging task. In this context, this paper investigates high-realistic image inpainting methods basedon deep generative models. Firstly, a novel image inpainting model based on adaptive multi-temperature attention was proposed to address the issue of generating both low-level texture and high-level semantic representations in deep inpainting models. This approach diverges from existing image inpainting techniques, which typically attend to features of limited spatial locations by setting the temperature as a constant. An attention mechanism module called Adaptive multi-Temperature Mask-guided Attention (ATMA) was introduced. ATMA dynamically adjusts the softness of attention through multiple learnable temperatures, enhancing the feature representation and improving the inpainting quality. Experiments on three benchmark datasets CelebA-HQ, Paris StreetView and Places2 have shown that the proposed model can achieve image inpainting of higher quality compared to current state-of-the-art models. Next, we provide an in-depth analysis and identify the reasons for several problems in ATMA, such as the generation of artifacts and training instability. A new image inpainting framework based on Multi-Head Temperature Masked Self-Attention (MHTMA) therefore is introduced to address these issues. This approach allows parallel, stable and efficient learning of multiple temperatures and utilizes multiple distant context information within self-attention to enhance inpainting quality. Experiments on CelebA-HQ, ParisStreetView, and Places2 show that MHTMA improves inpainting efficiency, stability, and interpretability, offering more realistic deep image inpainting. Additionally, this method helps users to produce diverse stroke-guided outputs. Finally, an interactive image inpainting platform is developed to address the difficulties of model effectiveness evaluation in inpainting research. Using Python and the light weight Flask framework, this platform serves to assess the proposed inpainting model performance, offering users easy image inpainting editing.
关键词	图像修复深度生成模型生成对抗网络注意力机制自适应温度
语种	中文
培养类别	独立培养
入学年份	2020
学位授予年份	2023-12
参考文献列表	[1] BERTALMIO M, SAPIRO G, CASELLES V, et al. Image Inpainting[C]//SIGGRAPH ’00: Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques. USA: ACM Press/Addison-Wesley Publishing Co., 2000: 417–424. [2] CHAN T F, SHEN J. Nontexture Inpainting by Curvature-Driven Diffusions[J/OL]. Journal of Visual Communication and Image Representation, 2001, 12(4): 436449. DOI: https://doi.org/10.1006/jvci.2001.0487. [3] SHEN J, CHAN T F. Mathematical Models for Local Nontexture Inpaintings[J/OL]. SIAM J. Appl. Math., 2002, 62(3): 1019–1043. DOI: 10.1137/S0036139900368844. [4] RICHARD M, CHANG M. Fast digital image inpainting[C]//Appeared in the Proceedings of the International Conference on Visualization, Imaging and Image Processing (VIIP 2001), Marbella, Spain. 2001: 106-107. [5] EFROS A, LEUNG T. Texture synthesis by non-parametric sampling[C/OL]//IEEE International Conference on Computer Vision: volume 2. 1999: 1033-1038 vol.2. DOI: 10.1109/ICCV.1999.790383. [6] BARNES C, SHECHTMAN E, FINKELSTEIN A, et al. PatchMatch: A Randomized Correspondence Algorithm for Structural Image Editing[J/OL]. ACM Transactions on Graphics, 2009, 28(3). DOI: 10.1145/1531326.1531330. [7] HE K, SUN J. Statistics of patch offsets for image completion[C]//Proceedings of the European conference on computer vision. 2012: 1629. [8] HE K, SUN J. Image completion approaches using the statistics of similar patches[J]. IEEE transactions on pattern analysis and machine intelligence, 2014, 36(12): 2423-2435. [9] GOODFELLOW I, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial networks[J]. Communications of the ACM, 2020, 63(11): 139-144. [10] PATHAK D, KRäHENBüHL P, DONAHUE J, et al. Context Encoders: Feature Learning by Inpainting[C/OL]//IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2016: 2536-2544. DOI: 10.1109/CVPR.2016.278. [11] IIZUKA S, SIMOSERRA E, ISHIKAWA H. Globally and Locally Consistent Image Completion[J/OL]. ACM Transactions on Graphics, 2017, 36(4). [12] YU J, LIN Z, YANG J, et al. Generative Image Inpainting with Contextual Attention[C/OL]//IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018: 55055514. DOI:10.1109/CVPR.2018.00577. [13] ZENG Y, GONG Y, ZHANG J. Feature learning and patch matching for diverse image inpainting[J/OL]. Pattern Recognition, 2021, 119: 108036. [14] ZHENG C, CHAM T J, CAI J, et al. Bridging Global Context Interactions for High-Fidelity Image Completion[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021. [15] YU F, KOLTUN V. Multi-Scale Context Aggregation by Dilated Convolutions[C]//International Conference on Learning Representations. 2016. [16] LIU G, REDA F A, SHIH K J, et al. Image Inpainting for Irregular Holes Using Partial Convolutions[C]//Proceedings of the European conference on computer vision. 2018: 89-105. [17] YU J, LIN Z, YANG J, et al. Free-Form Image Inpainting With Gated Convolution[C]//IEEE/CVF International Conference on Computer Vision. 2019. [18] SUVOROV R, LOGACHEVA E, MASHIKHIN A, et al. Resolution-robust Large Mask Inpainting with Fourier Convolutions[C/OL]//IEEE/CVF Winter Conference on Applications of Computer Vision. 2022: 3172-3182. DOI: 10.1109/WACV51458.2022.00323. [19] CHI L, JIANG B, MU Y. Fast Fourier Convolution[C/OL]//LAROCHELLE H, RANZATO M, HADSELL R, et al. Advances in Neural Information Processing Systems: volume 33. Curran Associates, Inc., 2020: 4479-4488. [20] ARJOVSKY M, CHINTALA S, BOTTOU L. Wasserstein generative adversarial networks[C]//International conference on machine learning. PMLR, 2017: 214-223. [21] GULRAJANI I, AHMED F, ARJOVSKY M, et al. Improved Training of Wasserstein GANs[C]//GUYON I, LUXBURG U V, BENGIO S, et al. Advances in Neural Information Processing Systems: volume 30. Curran Associates, Inc., 2017. [22] MIYATO T, KATAOKA T, KOYAMA M, et al. Spectral Normalization for Generative Adversarial Networks[C]//International Conference on Learning Representations. 2018. [23] ISOLA P, ZHU J Y, ZHOU T, et al. Image-to-image translation with conditional adversarial networks[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 1125-1134. [24] MATHIEU M, COUPRIE C, LECUN Y. Deep multiscale video prediction beyond mean square error[C]//International Conference on Learning Representations. 2016. [25] KARRAS T, LAINE S, AILA T. A stylebased generator architecture for generative adversarial networks[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019: 4401-4410. [26] CHOI Y, UH Y, YOO J, et al. Stargan v2: Diverse image synthesis for multiple domains[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020: 8188-8197. [27] KIM J, KIM M, KANG H, et al. U-GAT-IT: Unsupervised Generative Attentional Networks with Adaptive Layer-Instance Normalization for Image-to-Image Translation[C]//International Conference on Learning Representations. 2020. [28] NOWOZIN S, CSEKE B, TOMIOKA R. fgan: Training generative neural samplers using variational divergence minimization[J]. Advances in neural information processing systems, 2016, 29. [29] SALIMANS T, GOODFELLOW I, ZAREMBA W, et al. Improved techniques for training gans[J]. Advances in neural information processing systems, 2016, 29. [30] HEUSEL M, RAMSAUER H, UNTERTHINER T, et al. Gans trained by a two time-scale update rule converge to a local nash equilibrium[J]. Advances in neural information processing systems, 2017, 30. [31] ARJOVSKY M, BOTTOU L. Towards Principled Methods for Training Generative Adversarial Networks[C]//International Conference on Learning Representations. 2017. [32] ZHANG H, GOODFELLOW I, METAXAS D, et al. Self-Attention Generative Adversarial Networks[C]//International Conference on Machine Learning: volume 97. 2019: 7354-7363. [33] HU J, SHEN L, SUN G. Squeeze-and-excitation networks[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 7132-7141. [34] WOO S, PARK J, LEE J Y, et al. Cbam: Convolutional block attention module[C]//Proceedings of the European conference on computer vision. 2018: 3-19. [35] WANG X, GIRSHICK R, GUPTA A, et al. Non-local neural networks[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 77947803. [36] BAHDANAU D, CHO K, BENGIO Y. Neural machine translation by jointly learning to align and translate[A]. 2014. [37] LUONG M T, PHAM H, MANNING C D. Effective Approaches to Attention-based Neural Machine Translation[C]//Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 2015: 14121421. [38] RENSINK R A. The dynamic representation of scenes[J]. Visual cognition, 2000, 7(1-3): 17-42. [39] CORBETTA M, SHULMAN G L. Control of goal-directed and stimulus-driven attention in the brain[J]. Nature reviews neuroscience, 2002, 3(3): 201215. [40] YANG Z, YANG D, DYER C, et al. Hierarchical attention networks for document classification [C]//Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies. 2016: 14801489. [41] CHOROWSKI J K, BAHDANAU D, SERDYUK D, et al. Attention-based models for speech recognition[J]. Advances in neural information processing systems, 2015, 28. [42] PARMAR N, VASWANI A, USZKOREIT J, et al. Image transformer[C]//International conference on machine learning. PMLR, 2018: 4055-4064. [43] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is All you Need[C]//Advances in Neural Information Processing Systems: volume 30. 2017. [44] VELIČKOVIĆ P, CUCURULL G, CASANOVA A, et al. Graph Attention Networks[C]//International Conference on Learning Representations. 2018. [45] XIE C, LIU S, LI C, et al. Image Inpainting With Learnable Bidirectional Attention Maps[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019. [46] ZHENG C, CHAM T J, CAI J. Pluralistic Image Completion[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019: 14381447. [47] ZHENG C, CHAM T, CAI J. TFill: Image Completion via a Transformer-Based Architecture[J]. CoRR, 2021, abs/2104.00845. [48] LIU H, JIANG B, XIAO Y, et al. Coherent Semantic Attention for Image Inpainting[C/OL]//IEEE/CVF International Conference on Computer Vision. 2019: 4169-4178. DOI: 10.1109/ICCV.2019.00427. [49] LARSEN A B L, SØNDERBY S K, LAROCHELLE H, et al. Autoencoding beyond pixels using a learned similarity metric[C]//International conference on machine learning. PMLR, 2016: 1558-1566. [50] DENG J, DONG W, SOCHER R, et al. ImageNet: A large-scale hierarchical image database[C/OL]//IEEE Conference on Computer Vision and Pattern Recognition. 2009: 248255. DOI: 10.1109/CVPR.2009.5206848. [51] SZEGEDY C, VANHOUCKE V, IOFFE S, et al. Rethinking the inception architecture for computer vision[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 2818-2826. [52] YANG C, LU X, LIN Z, et al. HighResolution Image Inpainting Using Multi-scale Neural Patch Synthesis[C/OL]//IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2017: 40764084. DOI: 10.1109/CVPR.2017.434. [53] NAZERI K, NG E, JOSEPH T, et al. EdgeConnect: Structure Guided Image Inpainting using Edge Prediction[C/OL]//IEEE/CVF International Conference on Computer Vision Workshop. 2019: 3265-3274. DOI: 10.1109/ICCVW.2019.00408. [54] WAN Z, ZHANG J, CHEN D, et al. HighFidelity Pluralistic Image Completion with Transformers[C/OL]//IEEE/CVF International Conference on Computer Vision. 2021: 4672-4681. DOI: 10.1109/ICCV48922.2021.00465. [55] XU R, GUO M, WANG J, et al. Texture Memory-Augmented Deep Patch-Based Image Inpainting[J/OL]. IEEE Transactions on Image Processing, 2021, 30: 91129124. DOI: 10.1109/TIP.2021.3122930. [56] LIU H, JIANG B, XIAO Y, et al. Coherent Semantic Attention for Image Inpainting[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019. [57] AGARWALA A, PENNINGTON J, DAUPHIN Y, et al. Temperature check: theory and practice for training models with softmax-cross-entropy losses[A]. 2020. [58] ZHANG X, YU F X, KARAMAN S, et al. Heated-up softmax embedding[A]. 2018. [59] GUO C, PLEISS G, SUN Y, et al. On Calibration of Modern Neural Networks[C]//International Conference on Machine Learning. 2017: 1321–1330. [60] HEIN M, ANDRIUSHCHENKO M, BITTERWOLF J. Why relu networks yield high-confidence predictions far away from the training data and how to mitigate the problem[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019: 4150. [61] HINTON G, VINYALS O, DEAN J. Distilling the Knowledge in a Neural Network[C]//NIPS Deep Learning and Representation Learning Workshop. 2015. [62] RAJASEGARAN J, KHAN S, HAYAT M, et al. Self-supervised Knowledge Distillation for Few-shot Learning[C]//British Machine Vision Conference. 2021. [63] CHEN T, KORNBLITH S, NOROUZI M, et al. A Simple Framework for Contrastive Learning of Visual Representations[C]//International Conference on Machine Learning: volume 119. 2020: 1597-1607. [64] ZHANG O, WU M, BAYROOTI J, et al. Temperature as Uncertainty in Contrastive Learning[C]//NeurIPS Self-Supervised Learning Theory and Practice Workshop. 2021. [65] WANG F, LIU H. Understanding the behaviour of contrastive loss[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021: 24952504. [66] PLÖTZ T, ROTH S. Neural nearest neighbors networks[J]. Advances in Neural Information Processing Systems, 2018, 31: 10871098. [67] CACCIA M, CACCIA L, FEDUS W, et al. Language GANs Falling Short[C]//International Conference on Learning Representations. 2020. [68] KIRKPATRICK S, GELATT C D, VECCHI M P. Optimization by simulated annealing[J]. SCIENCE, 1983, 220(4598): 671-680. [69] ACKLEY D H, HINTON G E, SEJNOWSKI T J. A learning algorithm for Boltzmann machines[J]. Cognitive science, 1985, 9(1): 147169. [70] SUTTON R S, BARTO A G. Reinforcement learning: An introduction[M]. MIT press, 2018. [71] HE Y L, ZHANG X L, AO W, et al. Determining the optimal temperature parameter for Softmax function in reinforcement learning[J/OL]. Applied Soft Computing, 2018, 70: 80-85. https://www.sciencedirect.com/science/article/pii/S1568494618302758. DOI: https://doi.org/10.1016/j.asoc.2018.05.012. [72] LIN J, SUN X, REN X, et al. Learning When to Concentrate or Divert Attention: SelfAdaptive Attention Temperature for Neural Machine Translation[C/OL]//Conference on Empirical Methods in Natural Language Processing. 2018: 29852990. DOI: 10.18653/v1/D18-1331. [73] RADFORD A, KIM J W, HALLACY C, et al. Learning Transferable Visual Models From Natural Language Supervision[C]//International Conference on Machine Learning: volume 139. 2021: 8748-8763. [74] CHELLAPILLA K, PURI S, SIMARD P. High Performance Convolutional Neural Networks for Document Processing[C]//International Workshop on Frontiers in Handwriting Recognition. Suvisoft, 2006. [75] ZHOU X, ZENG Y, GONG Y. Image Completion with Adaptive Multi-Temperature Mask-Guided Attention[C]//British Machine Vision Conference. 2021. [76] LIN J, SUN X, REN X, et al. Learning When to Concentrate or Divert Attention: SelfAdaptive Attention Temperature for Neural Machine Translation[C/OL]//Conference on Empirical Methods in Natural Language Processing. 2018: 29852990. DOI: 10.18653/v1/D18-1331. [77] DOERSCH C, SINGH S, GUPTA A, et al. What Makes Paris Look like Paris?[J]. ACM Transactions on Graphics (SIGGRAPH), 2012, 31(4): 1-9. [78] KARRAS T, AILA T, LAINE S, et al. Progressive Growing of GANs for Improved Quality, Stability, and Variation[C]//International Conference on Learning Representations. 2018. [79] ZHOU B, LAPEDRIZA A, KHOSLA A, et al. Places: A 10 million Image Database for Scene Recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017. [80] SAHARIA C, CHAN W, SAXENA S, et al. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding[C/OL]//KOYEJO S, MOHAMED S, AGARWAL A, et al. Advances in Neural Information Processing Systems: volume 35. Curran Associates, Inc., 2022: 3647936494. [81] YU Y, ZHAN F, LU S, et al. WaveFill: A Wavelet-based Generation Network for Image Inpainting[C]//IEEE/CVF International Conference on Computer Vision. 2021. [82] SU Z, LIU W, YU Z, et al. Pixel Difference Networks for Efficient Edge Detection[C]//IEEE/CVF International Conference on Computer Vision. 2021: 5117-5127. [83] VINCENT L. Morphological Area Openings and Closings for Grey-scale Images[C]//Shape in Picture. 1994: 197-208. [84] MARAGOS P, SCHAFER R. Morphological skeleton representation and coding of binary images[J/OL]. IEEE Transactions on Acoustics, Speech, and Signal Processing, 1986, 34(5): 1228-1244. DOI: 10.1109/TASSP.1986.1164959.
所在学位评定分委会	电子科学与技术
国内图书分类号	TP391
来源库	人工提交
成果类型	学位论文
条目标识符	http://sustech.caswiz.com/handle/2SGJ60CL/673933
专题	南方科技大学工学院_电子与电气工程系
推荐引用方式 GB/T 7714	周翔. 基于深度生成模型的高逼真图像修复[D]. 深圳. 南方科技大学,2023.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可	操作
12032872-周翔-电子与电气工程系（18782KB）	--	--	限制开放	--	请求全文