首頁資訊雙模態(tài)跨語料庫語音情感識(shí)別

雙模態(tài)跨語料庫語音情感識(shí)別

來源：泰然健康網(wǎng) 時(shí)間：2024年12月20日 21:11

摘要: 語音情感識(shí)別（SER）在雙模態(tài)的跨數(shù)據(jù)庫語音情感識(shí)別研究較少，跨數(shù)據(jù)庫情感識(shí)別過度減少數(shù)據(jù)集之間差異的同時(shí)，會(huì)忽視情感判別能力的特征的問題。YouTube數(shù)據(jù)集為源數(shù)據(jù)，互動(dòng)情感二元?jiǎng)幼鞑蹲綌?shù)據(jù)庫（IEMOCAP）為目標(biāo)數(shù)據(jù)。在源數(shù)據(jù)和目標(biāo)數(shù)據(jù)中，Opensmile工具箱用來提取語音特征，將提取的語音特征輸入到CNN和雙向長(zhǎng)短期記憶網(wǎng)絡(luò)（BLSTM），來提取更高層次的特征，文本模態(tài)為語音信號(hào)的翻譯稿。首先雙向編碼器表示轉(zhuǎn)換器（Bert）把文本信息向量化，BLSTM提取文本特征，然后設(shè)計(jì)模態(tài)不變損失來形成2種模態(tài)的公共表示空間。為了解決跨語料庫的SER問題，通過聯(lián)合優(yōu)化線性判別分析（LDA）、最大平均差異（MMD）、圖嵌入（GE）和標(biāo)簽回歸（LSR），學(xué)習(xí)源數(shù)據(jù)和目標(biāo)數(shù)據(jù)的公共子空間。為了保留情緒辨別特征，情感判別損失與MMD+GE+LDA+LSR相結(jié)合。SVM分類器作為遷移公共子空間的最終情感分類，IEMOCAP上的實(shí)驗(yàn)結(jié)果表明，此方法優(yōu)于其他先進(jìn)的跨語料庫和雙模態(tài)SER.

Abstract: In the field of speech emotion recognition(SER), a heterogeneity gap exists between different modalities and most cross-corpus SER only uses the audio modality. These issues were addressed simultaneously. YouTube datasets were selected as source data and an interactive emotional dyadic motion capture database (IEMOCAP) as target data. The Opensmile toolbox was used to extract speech features from both source and target data, then the extracted speech features were input into Convolutional Neural Network (CNN) and bidirectional long short-term memory network (BLSTM) to extract higher-level speech features with the text mode as the translation of speech signals. Firstly, BLSTM was adopted to extract the text features from text information vectorized by Bidirectional Encoder Representation from Transformers (BERT), then modality-invariance loss was designed to form a common representation space for the two modalities. To solve the problem of cross-corpus SER, a common subspace of source data and target data were learned by optimizing Linear Discriminant analysis (LDA), Maximum Mean Discrepancy (MMD) and Graph Embedding (GE) and Label Smoothing Regularization (LSR) jointly. To preserve emotion-discriminative features, emotion-aware center loss was combined with MMD+GE+LDA+LSR. The SVM classifier was designed as a final emotion classification for migrating common subspaces. The experimental results on IEMOCAP showed that this method outperformed other state-of-art cross-corpus and bimodal SER.

圖 1 BMTDAFSL的方法的結(jié)構(gòu)

Figure 1. The structure of BMTDAFSL

表 1 實(shí)驗(yàn)參數(shù)

Table 1 Experiment parameters

參數(shù)價(jià)值 dropout0. 5學(xué)習(xí)率0. 001批量大小64循環(huán)次數(shù)10BLSTM隱層節(jié)點(diǎn)數(shù)128多頭注意機(jī)制的頭數(shù)8

表 2 與雙模態(tài)SER和僅使用音頻模態(tài)或文本模態(tài)的方法的其他先前工作的比較。

Table 2 Comparison with other previous work on bimoal SER and our method of using only audio mode or text mode

方法準(zhǔn)確性/% 文獻(xiàn)[3]75. 23文獻(xiàn)[4]71. 86文獻(xiàn) [26]78. 25本文方法（音頻）76. 74本文方法（文本）80. 69本文方法（音頻+文本）87. 03

表 3 與以前跨語料庫情感識(shí)別的比較

Table 3 Comparison with other previous work on cross-corpus ser

方法準(zhǔn)確性/% 文獻(xiàn)[13]69. 23文獻(xiàn)[16]72. 56文獻(xiàn)[14]72. 84本文方法87. 03 [1]

KOROMILAS P,GIANNAKOPOULOS T. Deep multimodal emotion recognition on human speech: a review[J]. Applied Sciences,2021,11(17): 7962.

[2]

WEN H,YOU S,FU Y. Cross-modal dynamic convolution for multi-modal emotion recognition[J]. Journal of Communication and Image Representation,2021(78): 103178.

[3]

SINGH P,SRIVASTAVA R,RANA K P S,et al. A multimodal hierarchical approach to speech emotion recognition from audio and text[J]. Knowledge-Based Systems,2021(229): 107316.

[4]

CAI L,HU Y,DONG J,et al. Audio-textual emotion recognition based on improved neural networks[J]. Mathematical Problems in Engineering,2019(6): 1-9.

[5]

WANG X S,CHEN X,CAO C. Human emotion recognition by optimally fusing facial expression and speech feature[J]. Signal Processing: Image Communication,2020(84): 115831.

[6]

WANG M,HUANG Z,LI Y,et al. Maximum weight multi-modal information fusion algorithm of electroencephalographs and face images for emotion recognition[J]. Computers and Electrical Engineering,2021(94): 107319.

[7]

ZHANG H,HUANG H,HAN H. A Novel heterogeneous parallel convolution Bi-LSTM for speech emotion recognition[J]. Applied Sciences,2021,11(21): 9897.

[8]

CHEN G H, ZENG X P. Multi-modal emotion recognition by fusing correlation features of speech-visual[J]. IEEE Signal Processing Letters,2021(28): 533-537.

[9]

ZHANG S,CHEN M,CHEN J,et al. Combining cross-modal knowledge transfer and semi-supervised learning for speech emotion recognition[J]. Knowledge-Based Systems,2021(229): 107340.

[10]

DONG G N,PUN C M,ZHANG Z. Temporal relation inference network for multi-modal speech emotion recognition[J]. IEEE Transactionsfor Circuits and Systems for Video Technology,2022,32(9): 6472-6485.

[11]

LI C,BAO Z,LI L,et al. Exploring temporal representations by leveraging attention-based bidirectional LSTM-RNNs for multi-modal emotion recognition[J]. Information Processing and Management,2020,57(3): 102185.

[12]

LU C,ZONG Y,TANG C,et al. Implicitly aligning joint distributions for cross-corpus speech emotion recognition[J]. Electronics,2022,11(17): 2745.

[13]

LIU N,ZHANG B,LIU B,et al. Transfer subspace learning for unsupervised cross-corpus speech emotion recognition[J]. IEEE Access,2021(9): 95925-95937.

[14]

LI S,SONG P,ZHANG W. Transferable discriminant linear regression for cross-corpus speech emotion recognition[J]. Applied Acoustics,2022(197): 108919.

[15]

SONG P,OU S,DU Z,et al. Learning corpus-invariant discriminant feature representations for speech emotion recognition[J]. IEICE Transactions on Information and Systems,2017,100(5): 1136-1139.

[16]

ZHANG W,SONG P. Transfer sparse discriminant subspace learning for cross-corpus speech emotion recognition[J]. IEEE/ACM Transactions on Audio Speech and Language Processing,2019(28): 307-318.

[17]

OCQUAYE E N N,MAO Q,SONG H,et al. Dual exclusive attentive transfer for unsupervised deep convolutional domain adaptation in speech emotion recognition[J]. IEEE Access,2019(7): 93847-93857.

[18]

SONG P,ZHENG W,OU S,et al. Cross-corpus speech emotion recognition based on transfer non-negative matrix factorization[J]. Speech Communication,2016(83): 34-41.

[19]

FU C,LIU C,ISHI C T,et al. Multi-modality emotion recognition model with GAT-based multi-head inter-modality attention[J]. Sensors,2020,20(17): 4894. DOI: 10.3390/s20174894

[20]

LIU D,ChEN L,WANG Z,et al. Speech expression multimodal emotion recognition based on deep belief network[J]. Journal of Grid Computing,2021,19(2): 22. DOI: 10.1007/s10723-021-09564-0

[21] 程大雷,張代瑋,陳雅茜. 多模態(tài)情感識(shí)別綜述[J]. 西南民族大學(xué)學(xué)報(bào)(自然科學(xué)版),2022(4): 048. [22]

GUO L,WANG L,DANG J,et al. Emotion recognition with multimodal transformer fusion framework based on acoustic and lexical information[J]. IEEE Multimedia,2022,29(2): 94-103. DOI: 10.1109/MMUL.2022.3161411

[23]

ZOU S H,HUANG X,SHEN X D,et al. Improving multimodal fusion with main modal transformer for emotion recognition in conversation[J]. Knowledge-Based Systems,2022(258): 109978.

[24]

CAO X,JIA M S,RU J W,et al. Cross-corpus speech emotion recognition using subspace learning and domain adaption[J]. EURASIP Journal on Audio, Speech, and Music Processing,2022(32): 00264.

[25]

PHAN D A,MATSUMOTO Y,SHINDO H. Autoencoder for semi-supervised multiple emotion detection of conversation transcripts[J]. IEEE Transactions on Affective Computing,2018,12(3): 682-691.

[26]

DENG J,ZHANG Z,EYBEN F,et al. Autoencoder-based unsupervised domain adaptation for speech emotion recognition[J]. IEEE Signal Processing Letters,2014,21(9): 1068-1072. DOI: 10.1109/LSP.2014.2324759

[27]

SONG P,OU S,DU Z,et al. Learning corpus-invariant discriminant feature representations for speech emotion recognition[J]. Speech Communication,2018(99): 1136-1139.

[28]

CHEN X,ZHOU X,LU C,et al. Target-adapted subspace learning for cross-corpus speech emotion recognition[J]. IEICE Transactions on Information and Systems,2019,102(12): 80-89.

[29]

ZHEN L,HU P,PENG X,et al. Deep multimodal transfer learning for cross-modal retrieval[J]. IEEE Transactions on Neural Networks and Learning Systems,2020,33(2): 798-810.

[30]

MAO Q,XU G,XUE W,et al. Learning emotion-discriminative and domain-invariant features for domain adaptation in speech emotion recognition[J]. Speech Communication,2017(93): 1-10.

[31]

LU C,TANG C,ZHANG J,et al. Progressively discriminative transfer network for cross-corpus speech emotion recognition[J]. Entropy,2022,24(8): 1046.

[32]

HO N H,YANG H J,KIM S H,et al. Multimodal approach of speech emotion recognition using multi-level multi-head fusion attention-based recurrent neural network[J]. IEEE Access,2020(8): 61672-61686.

網(wǎng)址: 雙模態(tài)跨語料庫語音情感識(shí)別 http://m.u1s5d6.cn/newsview681922.html

91高清中文字幕|亚洲无码网站网址|欧美一区二区乱伦|a乱码精品一区二区三|成人一区二区毛片|国产日韩精品视频短片|不卡无码无需播放器|鲁噜精品免费视频|wwwh日韩中出|精品五月婷婷无码

雙模態(tài)跨語料庫語音情感識(shí)別

推薦資訊

從出汗看健康出汗透露你的健康信號(hào)

早上怎么喝水最健康？

91高清中文字幕|亚洲无码网站网址|欧美一区二区乱伦|a乱码精品一区二区三|成人一区二区毛片|国产日韩精品视频短片|不卡无码无需播放器|鲁噜精品免费视频|wwwh日韩中出|精品五月婷婷无码

雙模態(tài)跨語料庫語音情感識(shí)別

推薦資訊

從出汗看健康 出汗透露你的健康信號(hào)

早上怎么喝水最健康？

從出汗看健康出汗透露你的健康信號(hào)

早上怎么喝水最健康？