雙模態(tài)跨語料庫語音情感識別
摘要: 語音情感識別(SER)在雙模態(tài)的跨數(shù)據(jù)庫語音情感識別研究較少,跨數(shù)據(jù)庫情感識別過度減少數(shù)據(jù)集之間差異的同時,會忽視情感判別能力的特征的問題。YouTube數(shù)據(jù)集為源數(shù)據(jù),互動情感二元動作捕捉數(shù)據(jù)庫(IEMOCAP)為目標(biāo)數(shù)據(jù)。在源數(shù)據(jù)和目標(biāo)數(shù)據(jù)中,Opensmile工具箱用來提取語音特征,將提取的語音特征輸入到CNN和雙向長短期記憶網(wǎng)絡(luò)(BLSTM),來提取更高層次的特征,文本模態(tài)為語音信號的翻譯稿。首先雙向編碼器表示轉(zhuǎn)換器(Bert)把文本信息向量化,BLSTM提取文本特征,然后設(shè)計模態(tài)不變損失來形成2種模態(tài)的公共表示空間。為了解決跨語料庫的SER問題,通過聯(lián)合優(yōu)化線性判別分析(LDA)、最大平均差異(MMD)、圖嵌入(GE)和標(biāo)簽回歸(LSR),學(xué)習(xí)源數(shù)據(jù)和目標(biāo)數(shù)據(jù)的公共子空間。為了保留情緒辨別特征,情感判別損失與MMD+GE+LDA+LSR相結(jié)合。SVM分類器作為遷移公共子空間的最終情感分類,IEMOCAP上的實驗結(jié)果表明,此方法優(yōu)于其他先進的跨語料庫和雙模態(tài)SER.
Abstract: In the field of speech emotion recognition(SER), a heterogeneity gap exists between different modalities and most cross-corpus SER only uses the audio modality. These issues were addressed simultaneously. YouTube datasets were selected as source data and an interactive emotional dyadic motion capture database (IEMOCAP) as target data. The Opensmile toolbox was used to extract speech features from both source and target data, then the extracted speech features were input into Convolutional Neural Network (CNN) and bidirectional long short-term memory network (BLSTM) to extract higher-level speech features with the text mode as the translation of speech signals. Firstly, BLSTM was adopted to extract the text features from text information vectorized by Bidirectional Encoder Representation from Transformers (BERT), then modality-invariance loss was designed to form a common representation space for the two modalities. To solve the problem of cross-corpus SER, a common subspace of source data and target data were learned by optimizing Linear Discriminant analysis (LDA), Maximum Mean Discrepancy (MMD) and Graph Embedding (GE) and Label Smoothing Regularization (LSR) jointly. To preserve emotion-discriminative features, emotion-aware center loss was combined with MMD+GE+LDA+LSR. The SVM classifier was designed as a final emotion classification for migrating common subspaces. The experimental results on IEMOCAP showed that this method outperformed other state-of-art cross-corpus and bimodal SER.
圖 1 BMTDAFSL的方法的結(jié)構(gòu)
Figure 1. The structure of BMTDAFSL
表 1 實驗參數(shù)
Table 1 Experiment parameters
參數(shù)價值 dropout0. 5學(xué)習(xí)率0. 001批量大小64循環(huán)次數(shù)10BLSTM隱層節(jié)點數(shù)128多頭注意機制的頭數(shù)8表 2 與雙模態(tài)SER和僅使用音頻模態(tài)或文本模態(tài)的方法的其他先前工作的比較。
Table 2 Comparison with other previous work on bimoal SER and our method of using only audio mode or text mode
方法準(zhǔn)確性/% 文獻(xiàn)[3]75. 23文獻(xiàn)[4]71. 86文獻(xiàn) [26]78. 25本文方法(音頻)76. 74本文方法(文本)80. 69本文方法(音頻+文本)87. 03表 3 與以前跨語料庫情感識別的比較
Table 3 Comparison with other previous work on cross-corpus ser
方法準(zhǔn)確性/% 文獻(xiàn)[13]69. 23文獻(xiàn)[16]72. 56文獻(xiàn)[14]72. 84本文方法87. 03 [1]KOROMILAS P,GIANNAKOPOULOS T. Deep multimodal emotion recognition on human speech: a review[J]. Applied Sciences,2021,11(17): 7962.
[2]WEN H,YOU S,FU Y. Cross-modal dynamic convolution for multi-modal emotion recognition[J]. Journal of Communication and Image Representation,2021(78): 103178.
[3]SINGH P,SRIVASTAVA R,RANA K P S,et al. A multimodal hierarchical approach to speech emotion recognition from audio and text[J]. Knowledge-Based Systems,2021(229): 107316.
[4]CAI L,HU Y,DONG J,et al. Audio-textual emotion recognition based on improved neural networks[J]. Mathematical Problems in Engineering,2019(6): 1-9.
[5]WANG X S,CHEN X,CAO C. Human emotion recognition by optimally fusing facial expression and speech feature[J]. Signal Processing: Image Communication,2020(84): 115831.
[6]WANG M,HUANG Z,LI Y,et al. Maximum weight multi-modal information fusion algorithm of electroencephalographs and face images for emotion recognition[J]. Computers and Electrical Engineering,2021(94): 107319.
[7]ZHANG H,HUANG H,HAN H. A Novel heterogeneous parallel convolution Bi-LSTM for speech emotion recognition[J]. Applied Sciences,2021,11(21): 9897.
[8]CHEN G H, ZENG X P. Multi-modal emotion recognition by fusing correlation features of speech-visual[J]. IEEE Signal Processing Letters,2021(28): 533-537.
[9]ZHANG S,CHEN M,CHEN J,et al. Combining cross-modal knowledge transfer and semi-supervised learning for speech emotion recognition[J]. Knowledge-Based Systems,2021(229): 107340.
[10]DONG G N,PUN C M,ZHANG Z. Temporal relation inference network for multi-modal speech emotion recognition[J]. IEEE Transactionsfor Circuits and Systems for Video Technology,2022,32(9): 6472-6485.
[11]LI C,BAO Z,LI L,et al. Exploring temporal representations by leveraging attention-based bidirectional LSTM-RNNs for multi-modal emotion recognition[J]. Information Processing and Management,2020,57(3): 102185.
[12]LU C,ZONG Y,TANG C,et al. Implicitly aligning joint distributions for cross-corpus speech emotion recognition[J]. Electronics,2022,11(17): 2745.
[13]LIU N,ZHANG B,LIU B,et al. Transfer subspace learning for unsupervised cross-corpus speech emotion recognition[J]. IEEE Access,2021(9): 95925-95937.
[14]LI S,SONG P,ZHANG W. Transferable discriminant linear regression for cross-corpus speech emotion recognition[J]. Applied Acoustics,2022(197): 108919.
[15]SONG P,OU S,DU Z,et al. Learning corpus-invariant discriminant feature representations for speech emotion recognition[J]. IEICE Transactions on Information and Systems,2017,100(5): 1136-1139.
[16]ZHANG W,SONG P. Transfer sparse discriminant subspace learning for cross-corpus speech emotion recognition[J]. IEEE/ACM Transactions on Audio Speech and Language Processing,2019(28): 307-318.
[17]OCQUAYE E N N,MAO Q,SONG H,et al. Dual exclusive attentive transfer for unsupervised deep convolutional domain adaptation in speech emotion recognition[J]. IEEE Access,2019(7): 93847-93857.
[18]SONG P,ZHENG W,OU S,et al. Cross-corpus speech emotion recognition based on transfer non-negative matrix factorization[J]. Speech Communication,2016(83): 34-41.
[19]FU C,LIU C,ISHI C T,et al. Multi-modality emotion recognition model with GAT-based multi-head inter-modality attention[J]. Sensors,2020,20(17): 4894. DOI: 10.3390/s20174894
[20]LIU D,ChEN L,WANG Z,et al. Speech expression multimodal emotion recognition based on deep belief network[J]. Journal of Grid Computing,2021,19(2): 22. DOI: 10.1007/s10723-021-09564-0
[21] 程大雷,張代瑋,陳雅茜. 多模態(tài)情感識別綜述[J]. 西南民族大學(xué)學(xué)報(自然科學(xué)版),2022(4): 048. [22]GUO L,WANG L,DANG J,et al. Emotion recognition with multimodal transformer fusion framework based on acoustic and lexical information[J]. IEEE Multimedia,2022,29(2): 94-103. DOI: 10.1109/MMUL.2022.3161411
[23]ZOU S H,HUANG X,SHEN X D,et al. Improving multimodal fusion with main modal transformer for emotion recognition in conversation[J]. Knowledge-Based Systems,2022(258): 109978.
[24]CAO X,JIA M S,RU J W,et al. Cross-corpus speech emotion recognition using subspace learning and domain adaption[J]. EURASIP Journal on Audio, Speech, and Music Processing,2022(32): 00264.
[25]PHAN D A,MATSUMOTO Y,SHINDO H. Autoencoder for semi-supervised multiple emotion detection of conversation transcripts[J]. IEEE Transactions on Affective Computing,2018,12(3): 682-691.
[26]DENG J,ZHANG Z,EYBEN F,et al. Autoencoder-based unsupervised domain adaptation for speech emotion recognition[J]. IEEE Signal Processing Letters,2014,21(9): 1068-1072. DOI: 10.1109/LSP.2014.2324759
[27]SONG P,OU S,DU Z,et al. Learning corpus-invariant discriminant feature representations for speech emotion recognition[J]. Speech Communication,2018(99): 1136-1139.
[28]CHEN X,ZHOU X,LU C,et al. Target-adapted subspace learning for cross-corpus speech emotion recognition[J]. IEICE Transactions on Information and Systems,2019,102(12): 80-89.
[29]ZHEN L,HU P,PENG X,et al. Deep multimodal transfer learning for cross-modal retrieval[J]. IEEE Transactions on Neural Networks and Learning Systems,2020,33(2): 798-810.
[30]MAO Q,XU G,XUE W,et al. Learning emotion-discriminative and domain-invariant features for domain adaptation in speech emotion recognition[J]. Speech Communication,2017(93): 1-10.
[31]LU C,TANG C,ZHANG J,et al. Progressively discriminative transfer network for cross-corpus speech emotion recognition[J]. Entropy,2022,24(8): 1046.
[32]HO N H,YANG H J,KIM S H,et al. Multimodal approach of speech emotion recognition using multi-level multi-head fusion attention-based recurrent neural network[J]. IEEE Access,2020(8): 61672-61686.
相關(guān)知識
語音識別
CTI論壇: 認(rèn)準(zhǔn)語音識別的“內(nèi)核”
語音識別抑郁癥的關(guān)鍵技術(shù)研究
“全球健康傳播”雙語平行語料庫正式發(fā)布!
電話語音識別/114查號
語音識別在移動醫(yī)療領(lǐng)域的探索
【Android語音合成與語音識別】
全球健康傳播雙語平行語料庫正式發(fā)布
語音識別在移動醫(yī)療app中的應(yīng)用
眼部按摩儀語音控制方案:NRK3301語音識別芯片
網(wǎng)址: 雙模態(tài)跨語料庫語音情感識別 http://m.u1s5d6.cn/newsview681922.html
推薦資訊
- 1發(fā)朋友圈對老公徹底失望的心情 12775
- 2BMI體重指數(shù)計算公式是什么 11235
- 3補腎吃什么 補腎最佳食物推薦 11199
- 4性生活姿勢有哪些 盤點夫妻性 10425
- 5BMI正常值范圍一般是多少? 10137
- 6在線基礎(chǔ)代謝率(BMR)計算 9652
- 7一邊做飯一邊躁狂怎么辦 9138
- 8從出汗看健康 出汗透露你的健 9063
- 9早上怎么喝水最健康? 8613
- 10五大原因危害女性健康 如何保 7826