采用自监督对比学习的合成伪造语音检测方法

杨曼; 简志华; 梁承涵

doi:10.11959/j.issn.1000-0801.2024236

您当前的位置：

首页 >

文章列表页 >

采用自监督对比学习的合成伪造语音检测方法

研究与开发 | 更新时间：2025-03-17

- 采用自监督对比学习的合成伪造语音检测方法
- A method of synthetic spoofing speech detection using self-supervised contrastive learning
- 电信科学 2024年40卷第11期页码：40-49
- 作者机构：
  
  杭州电子科技大学通信工程学院，浙江杭州 310018
- 作者简介：
  
  [ "杨曼（2000- ），女，杭州电子科技大学通信工程学院硕士生，主要研究方向为伪造语音检测。" ]
  [ "简志华（1978- ），男，博士，杭州电子科技大学通信工程学院副教授、硕士生导师，主要研究方向为伪造语音检测、语音中的隐私保护、语音转换与生成等。" ]
  [ "梁承涵（2001- ），男，杭州电子科技大学通信工程学院硕士生，主要研究方向为伪造语音检测与声纹鉴伪。" ]
- 基金信息：
  
  国家自然科学基金资助项目(61201301;61772166)
- DOI：10.11959/j.issn.1000-0801.2024236
  中图分类号： TN912
- 收稿日期：2024-07-30，
  
  修回日期：2024-10-07，
  
  纸质出版日期：2024-11-20
- 稿件说明：
移动端阅览
杨曼,简志华,梁承涵.采用自监督对比学习的合成伪造语音检测方法[J].电信科学,2024,40(11):40-49.

YANG Man,JIAN Zhihua,LIANG Chenghan.A method of synthetic spoofing speech detection using self-supervised contrastive learning[J].Telecommunications Science,2024,40(11):40-49.
杨曼,简志华,梁承涵.采用自监督对比学习的合成伪造语音检测方法[J].电信科学,2024,40(11):40-49. DOI： 10.11959/j.issn.1000-0801.2024236.

YANG Man,JIAN Zhihua,LIANG Chenghan.A method of synthetic spoofing speech detection using self-supervised contrastive learning[J].Telecommunications Science,2024,40(11):40-49. DOI： 10.11959/j.issn.1000-0801.2024236.

摘要

为了消除训练数据集中真实语音和伪造语音的样本数量不平衡对合成伪造语音检测系统性能的影响，并进一步提高系统的检测准确率，提出了一种基于自监督对比学习的合成语音检测方法。所提方法将经过音高变换后的样本视为负样本，通过训练神经网络使锚点样本特征与负样本特征不同，从而促使网络提取对于音高变换敏感的特征，再采用深度残差网络作为后端分类器来判决语音真伪。实验结果表明，与传统手工设计的声学特征方法、基于深度学习的伪造语音检测系统以及基于端到端的伪造语音检测系统相比，所提方法显著降低了系统的等错误率。由于自监督对比学习的合成伪造语音检测方法可以训练网络提取对音高变换敏感的特征，并且不受数据集中真伪语音数量不平衡的影响，因此显著提高了合成伪造语音检测的准确率。

Abstract

In order to eliminate the impact of the imbalance of the sample size of bonafide speech and fake speech in the training dataset on the performance of synthetic speech detection system and further improve the accuracy of synthetic speech detection

a method of synthetic speech detection was proposed based on self-supervised contrastive learning. In this method

the samples after pitch transformation were regarded as negative samples

and the neural network was trained to make the anchor sample features different from the negative sample features

so that the network could extract the features sensitive to pitch transformation. And the deep residual network was used as the back-end classifier to judge the authenticity of the speech. Experimental results show that

compared with the traditional hand-crafted acoustic features

the deep learning-based and the end-to-end spoofing speech detection systems

the proposed method significantly reduces the equal error rate of the system. The synthetic forged speech detection method based on self-supervised contrastive learning can train the network to extract features sensitive to pitch transformation and will not affect the accuracy of synthetic speech detection because of the imbalance of bonafide and fake speech in the dataset

so the accuracy of synthetic forged speech detection is significantly improved.

关键词

Keywords

references

杨震 , 王天朗 , 郭海燕 , 等 . 跨域注意力特征融合的说话人确认方法 [J ] . 通信学报 , 2023 , 44 ( 8 ): 89 - 98 .

YANG Z , WANG T L , GUO H Y , et al . Speaker verification method based on cross-domain attentive feature fusion [J ] . Journal on Communications , 2023 , 44 ( 8 ): 89 - 98 .

徐嘉 , 简志华 , 金宏辉 , 等 . 采用局部相位量化的合成语音检测方法 [J ] . 电信科学 , 2024 , 40 ( 2 ): 63 - 71 .

XU J , JIAN Z H , JIN H H , et al . A method for synthetic speech detection using local phase quantization [J ] . Telecommunications Science , 2024 , 40 ( 2 ): 63 - 71 .

GARG D , GILL R . Deepfake generation and detection: an exploratory study [C ] // Proceedings of 2023 10th IEEE Uttar Pradesh Section International Conference on Electrical, Electronics and Computer Engineering (UPCON) . Piscataway : IEEE Press , 2023 : 888 - 893 .

金宏辉 , 简志华 , 杨曼 , 等 . 采用圆周局部三值模式纹理特征的合成语音检测方法 [J ] . 电信科学 , 2023 , 39 ( 6 ): 85 - 95 .

JIN H H , JIAN Z H , YANG M , et al . Synthetic speech detection method using texture feature based on circumferential local ternary pattern [J ] . Telecommunications Science , 2023 , 39 ( 6 ): 85 - 95 .

WINURSITO A , HIDAYAT R , BEJO A . Improvement of MFCC feature extraction accuracy using PCA in Indonesian speech recognition [C ] // Proceedings of 2018 International Conference on Information and Communications Technology (ICOIACT) . Piscataway : IEEE Press , 2018 : 379 - 383 .

MON K Z , GALAJIT K , MAWALIM C O , et al . Spoof detection using voice contribution on LFCC features and ResNet-34 [C ] // Proceedings of 2023 18th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP) . Piscataway : IEEE Press , 2023 : 1 - 6 .

YANG J , DAS R K , LI H . Extended constant-Q cepstral coefficients for detection of spoofing attacks [C ] // Proceedings of 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) . Piscataway ： IEEE Press , 2018 ： 1024 - 1029 .

任延珍 , 刘晨雨 , 刘武洋 , 等 . 语音伪造及检测技术研究综述 [J ] . 信号处理 , 2021 , 37 ( 12 ): 2412 - 2439 .

REN Y Z , LIU C Y , LIU W Y , et al . A survey on speech forgery and detection [J ] . Journal of Signal Processing , 2021 , 37 ( 12 ): 2412 - 2439 .

陈暄 , 吴吉义 . 基于优化卷积神经网络的车辆特征识别算法研究 [J ] . 电信科学 , 2023 , 39 ( 10 ): 101 - 111 .

CHEN X , WU J Y . Research on vehicle feature recognition algorithm based on optimized convolutional neural network [J ] . Telecommunications Science , 2023 , 39 ( 10 ): 101 - 111 .

FUJIMOTO M , KAWAI H . Comparative evaluations of various factored deep convolutional RNN architectures for noise robust speech recognition [C ] // Proceedings of 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE Press , 2018 : 4829 - 4833 .

ALAMSYAH R D , SUYANTO H . Speech gender classification using bidirectional long short term memory [C ] // Proceedings of 2020 3rd International Seminar on Research of Information Technology and Intelligent Systems (ISRITI) . Piscataway : IEEE Press , 2020 : 646 - 649 .

ZHANG J , WANG T , WANG J , et al . A study of contrastive self-supervised learning generalization based on augmented data [C ] // Proceedings of 2023 38th Youth Academic Annual Conference of Chinese Association of Automation (YAC) . Piscataway : IEEE Press , 2023 : 659 - 664 .

张文林 , 刘雪鹏 , 牛铜 , 等 . 基于正样本对比与掩蔽重建的自监督语音表示学习 [J ] . 通信学报 , 2022 , 43 ( 7 ): 163 - 171 .

ZHANG W L , LIU X P , NIU T , et al . Self-supervised speech representation learning based on positive sample comparison and masking reconstruction [J ] . Journal on Communications , 2022 , 43 ( 7 ): 163 - 171 .

HE K , FAN H , WU Y , et al . Momentum contrast for unsupervised visual representation learning [C ] // Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE Press , 2020 : 9726 - 9735 .

ELFAKI A , ASNAWI A L , JUSOH A Z , et al . Using the short-time Fourier transform and ResNet to diagnose depression from speech data [C ] // Proceedings of 2021 IEEE International Conference on Computing (ICOCO) . Piscataway : IEEE Press , 2021 : 372 - 376 .

LIU A , LI J , YE H . A Prediction model combining convolutional neural network and LSTM neural network [C ] // Proceedings of 2023 2nd International Conference on Artificial Intelligence and Autonomous Robot Systems (AIARS) . Piscataway : IEEE Press , 2023 : 318 - 321 .

WU Z , XIONG Y , YU S X , et al . Unsupervised feature learning via non-parametric instance discrimination [C ] // Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE Press , 2018 : 3733 - 3742 .

MONTEIRO J , ALAM J , FALK T H . Generalized end-to-end detection of spoofing attacks to automatic speaker recognizers [J ] . Computer Speech & Language , 2020 ( 63 ): 101096 .

ZHANG Y , JIANG F , DUAN Z Y . One-class learning towards synthetic voice spoofing detection [J ] . IEEE Signal Processing Letters , 2021 ( 28 ): 937 - 941 .

ZHAO J , BAI X , CHEN Y , et al . Speech spoofing detection based on one-class residual attention network [C ] // Proceedings of 2023 8th International Conference on Intelligent Computing and Signal Processing (ICSP) . Piscataway : IEEE Press , 2023 : 329 - 333 .

WU Z , KINNYNEN T , EVANS N , et al . ASVspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge [C ] // Proceedings of Interspeech 2015 . [ S.l. : s.n. ] , 2015 : 462 - 467 .

BHUKYA K , RAJ A , RAJA D N . Audio deepfakes: feature extraction and model evaluation for detection [C ] // Proceedings of 5th International Conference for Emerging Technology (INCET) . Piscataway : IEEE Press , 2024 : 1 - 6 .

MATHEW L R , ANSELAM A S , PILLAI S S . Analysis of LD-CELP coder output with sound eXchange and Praat software [C ] // Proceedings of IEEE International Conference on Advanced Communications, Control and Computing Technologies . Piscataway : IEEE Press , 2014 : 1281 - 1285 .

SATRIA A , SITOMPUL O S , MAWENGKANG H . 5-fold cross validation on supporting K-nearest neighbour accuration of making consimilar symptoms disease classification [C ] // Proceedings of International Conference on Computer Science and Engineering (IC2SE) . Piscataway : IEEE Press , 2021 : 1 - 5 .

MARTÍN-DOÑAS M J , ÁLVAREZ A . The vicomtech audio deepfake detection system based on Wav2vec2 for the 2022 ADD challenge [C ] // Proceedings of ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE Press , 2022 : 9241 - 9245 .

温燕 . 基于多分支卷积神经网络的合成与转换语音检测研究 [D ] . 南昌 : 江西师范大学 , 2024 .

WEN Y . Research on synthetic and converted speech detection based on multi-branch convolutional neural network [D ] . Nanchang : Jiangxi Normal University , 2024 .

DAS R K . Known-unknown data augmentation strategies for detection of logical access, physical access and speech deepfake attacks: ASVspoof 2021 [C ] // Proceedings of 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge . Paris ： International Speech Communication Association , 2021 : 29 – 36 .

TAK H , KAMBLE M , PATINO J , et al . Rawboost: a raw data boosting and augmentation method applied to automatic speaker verification anti-spoofing [C ] // Proceedings of ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE Press , 2022 : 6382 - 6386 .

LEI Z , YAN H , LIU C , et al . GMM-ResNet2: ensemble of group resnet networks for synthetic speech detection [C ] // Proceedings of ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE Press , 2024 : 12101 - 12105 .

余惠 . 基于改进的Transformer模型的合成语音伪造检测研究 [D ] . 南昌 : 江西师范大学 , 2023 .

YU H . Research on synthetic speech deepfake detection based on improved transformer [D ] . Nanchang : Jiangxi Normal University , 2023 .

TAK H , PATINO J , TODISCO M , et al . End-to-end anti-spoofing with RawNet2 [C ] // Proceedings of ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE Press , 2021 : 6369 - 6373 .

浏览量

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

采用圆周局部三值模式纹理特征的合成语音检测方法