基于多文本描述的图像生成方法

聂开琴; 倪郑威

doi:10.11959/j.issn.1000-0801.2024142

您当前的位置：

首页 >

文章列表页 >

基于多文本描述的图像生成方法

研究与开发 | 更新时间：2024-08-05

- 基于多文本描述的图像生成方法
- Image synthesis method based on multiple text description
- 电信科学 2024年40卷第5期页码：73-85
- 作者机构：
  
  浙江工商大学信息与电子工程学院，浙江杭州 310018
- 作者简介：
  
  [ "聂开琴（1998- ），女，浙江工商大学硕士生，主要研究方向为深度学习。" ]
  [ "倪郑威（1989- ），男，博士，浙江工商大学副研究员，主要研究方向为机器学习、物联网、无线通信等。" ]
- 基金信息：
  
  浙江省自然科学基金资助项目(LQ22F010008)
- DOI：10.11959/j.issn.1000-0801.2024142
  中图分类号： TP183
- 收稿日期：2024-01-09，
  
  修回日期：2024-05-06，
  
  纸质出版日期：2024-05-20
- 稿件说明：
移动端阅览
聂开琴,倪郑威.基于多文本描述的图像生成方法[J].电信科学,2024,40(05):73-85.

NIE Kaiqin,NI Zhengwei.Image synthesis method based on multiple text description[J].Telecommunications Science,2024,40(05):73-85.
聂开琴,倪郑威.基于多文本描述的图像生成方法[J].电信科学,2024,40(05):73-85. DOI： 10.11959/j.issn.1000-0801.2024142.

NIE Kaiqin,NI Zhengwei.Image synthesis method based on multiple text description[J].Telecommunications Science,2024,40(05):73-85. DOI： 10.11959/j.issn.1000-0801.2024142.

摘要

针对单条文本描述生成的图像质量不高且存在结构错误的问题进行研究，采用多阶段生成对抗网络模型，并提出对不同文本序列进行插值操作，从多条文本描述中提取特征，以丰富给定的文本描述，使生成图像具有更多细节。为了生成与文本更为相关的图像，引入了多文本深度注意多模态相似度模型以得到注意力特征，并与上一层视觉特征联合作为下一层的输入，从而提升生成图像的真实程度和文本描述之间的语义一致性。为了能够让模型学会协调每个位置的细节，引入了自注意力机制，让生成器生成更加符合真实场景的图像。优化后的模型在CUB和MS-COCO数据集上进行验证，生成的图像不仅结构完整，语义一致性更强，视觉上的效果更加丰富多样。

Abstract

Aiming at the challenges associates with the low quality and structural errors existed in the images generated by a single text description

a multi-stage generative adversarial network model was used to study

and it was proposed to interpolate different text sequences to enrich the given text descriptions by extracting features from multiple text descriptions and imparting greater detail to the generated images. In order to enhance the correlation between the generated images and the corresponding text

a multi-captions deep attentional multi-modal similarity model that captured attention features was introduced. These features were subsequently integrated with visual features from the preceding layer

serving as input for the subsequent layer. This integration improved the realism of the generated images and enhanced their semantic consistency with the text descriptions. In addition

a self-attention mechanism to enable the model to effectively coordinate the details at each position was incorporated

resulting in images that were more aligned with real-world scenarios. The optimized model was verified on the CUB and MS-COCO datasets

demonstrating the generation of images with intact structures

stronger semantic consistency

and richer visual diversity.

关键词

Keywords

references

WU X , XU K , HALL P . A survey of image synthesis and editing with generative adversarial networks [J ] . Tsinghua Science and Technology , 2017 , 22 ( 6 ): 660 - 674 .

AGNESE J , HERRERA J , TAO H C , et al . A survey and taxonomy of adversarial neural networks for text-to-image synthesis [J ] . arXiv preprint , 2019 , arXiv: 1910.09399 .

GREGOR K , DANIHELKA I , GRAVES A , et al . DRAW: a recurrent neural network for image generation [J ] . arXiv preprint , 2015 , arXiv: 1502.04623 .

KINGMA D P , WELLING M . Auto-encoding variational Bayes [J ] . arXiv preprint , 2013 , arXiv: 1312.6114 .

ZENG X M , LONG L Q . Generative adversarial networks [M ] . Beginning Deep Learning with TensorFlow. Berkeley, CA : Apress , 2022 : 553 - 599 .

MIRZA M , OSINDERO S . Conditional generative adversarial nets [J ] . arXiv preprint , 2014 , arXiv: 1411.1784 .

RUAN S L , ZHANG Y , ZHANG K , et al . DAE-GAN: dynamic aspect-aware GAN for text-to-image synthesis [C ] // Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE Press , 2021 : 13940 - 13949 .

YANG Z Y , WANG J F , GAN Z , et al . ReCo: region-controlled text-to-image generation [C ] // Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE Press , 2023 : 14246 - 14255 .

PARMAR G , SINGH K , ZHANG R , et al . Zero-shot image-to-image translation [C ] // ACM SIGGRAPH 2023 Conference Proceedings . 2023 : 1 - 11 .

TUMANYAN N , GEYER M , BAGON S , et al . Plug-and-play diffusion features for text-driven image-to-image translation [C ] // Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE Press , 2023 : 1921 - 1930 .

REED S , AKATA Z , YAN X C , et al . Generative adversarial text to image synthesis [C ] // 33rd International Conference on Machine Learning (ICML) . JMLR . org , 2016 .

REED S , AKATA Z , MOHAN S , et al . Learning what and where to draw [J ] . Advances in Neural Information Processing Systems , 2016 , 29 .

ZHANG H , XU T , LI H S , et al . StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks [C ] // Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV) . Piscataway : IEEE Press , 2017 : 5908 - 5916 .

ZHANG H , XU T , Ll H , et al . StackGAN++: realistic image synthesis with stacked generative adversarial networks [J ] . IEEE Transactions on Pattern Analysis and Machine lntelligence , 2019 : 1947 - 1962 .

XU T , ZHANG P C , HUANG Q Y , et al . AttnGAN: fine-grained text to image generation with attentional generative adversarial networks [C ] // Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE Press , 2018 : 1316 - 1324 .

ZHU M F , PAN P B , CHEN W , et al . DM-GAN: dynamic memory generative adversarial networks for text-to-image synthesis [C ] // Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE Press , 2019 : 5795 - 5803 .

ZHANG Z Z , XIE Y P , YANG L . Photographic text-to-image synthesis with a hierarchically-nested adversarial network [C ] // Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE Press , 2018 : 6199 - 6208 .

GAO L L , CHEN D Y , SONG J K , et al . Perceptual pyramid adversarial networks for text-to-image synthesis [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2019 , 33 ( 1 ): 8312 - 8319 .

TAO M , TANG H , WU S , et al . DF-GAN: a simple and effective baseline for text-to-image synthesis [C ] // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE Press , 2022 : 16515 - 16525 .

ZHANG Z Q , ZHOU J J , YU W X , et al . Drawgan: text to image synthesis with drawing generative adversarial networks [C ] // Proceedings of the ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE Press , 2021 : 4195 - 4199 .

DASH A , GAMBOA J C B , AHMED S , et al . TAC-GAN-text conditioned auxiliary classifier generative adversarial network [J ] . arXiv preprint , 2017 , arXiv: 1703.06412 .

LI W B , ZHANG P C , ZHANG L , et al . Object-driven text-to-image synthesis via adversarial training [C ] // Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE Press , 2019 : 12166 - 12174 .

LIAO W T , HU K , YANG M Y , et al . Text to image generation with semantic-spatial aware GAN [C ] // Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE Press , 2022 : 18166 - 18175 .

CHENG J , WU F X , TIAN Y L , et al . RiFeGAN: rich feature generation for text-to-image synthesis from prior knowledge [C ] // Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE Press , 2020 : 10908 - 10917 .

ZHANG Z X , SCHOMAKER L . DTGAN: dual attention generative adversarial networks for text-to-image generation [C ] // Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN) . Piscataway : IEEE Press , 2021 : 1 - 8 .

BENGIO Y , MESNIL G , DAUPHIN Y , et al . Better mixing via deep representations [C ] // 30th International Conference on Machine Learning , ICML 2013 , 2013(PART 1 ): 552 - 560 .

REED S , SOHN K , ZHANG Y T , et al . Learning to disentangle factors of variation with manifold interaction [C ] // Proceedings of the Proceedings of the 31st International Conference on Machine Learning . New York : ACM , 2014 : Ⅱ-1431-Ⅱ-1439.

MANSIMOV E , PARISOTTO E , LEI BA J , et al . Generating images from captions with attention [J ] . arXiv e-prints , 2015 , arXiv: 1511.02793 .

ZHANG H , GOODFELLOW I , METAXAS D , et al . Self-attention generative adversarial networks [C ] // International Conference on Machine Learning . PMLR , 2019 : 7354 - 7363 .

WAH C , BRANSON S , WELINDER P , et al . The caltech-UCSD birds-200-2011 dataset [J ] . California Institute of Technology , 2011 .

LIN T Y , MAIRE M , BELONGIE S , et al . Microsoft COCO: common objects in context [D ] . Computer Vision-ECCV 2014. Cham : Springer International Publishing , 2014 : 740 - 755 .

SALIMANS T , GOODFELLOW I , ZAREMBA W , et al . Improved techniques for training GANs [J ] . arXiv preprint , 2016 , arXiv: 1606.03498 .

浏览量

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

基于对比增强时间感知自注意力机制的序列推荐

基于时间卷积网络的无监督入侵检测模型

基于生成对抗网络的超宽带数字信道建模

信号增强网络驱动的调制识别

基于掩模提取的SAR图像对抗样本生成方法