論文庫系統

作者

蔡浩皓

指導教授

梁艷春

學院

數據科學學院

課程

數據科學碩士學位課程(中文學制)

學位類別

碩士

畢業學年度

2020

論文中文題目

基於深度學習的粵語語音識別研究

論文英文題目

Research on Cantonese Speech Recognition Based on Deep Learning

中文關鍵詞

深度學習 ; 粵語語音識別 ; CTC ; LSTM ; GRU ; HMM

英文關鍵詞

Deep learning ; Cantonese speech recognition ; CTC ; LSTM ; GRU ; HMM

電子全文

下載

中文摘要

隨著大數據時代的到來，人工智能、物聯網及雲計算等智能技術應用到許多設備上，讓這些設備擁有智能交互的能力，爲人們帶來更多便利，語音識別技術正是衆多改變工作生活方式的智能交互技術之一。粵語是使用者在規模上僅次于中國內地官方語言普通話的中文方言，也是除普通話外唯一擁有完善文字體系的漢語。據統計，目前，粵語在世界範圍有超過 1.2 億使用者。使用粵語進行交互可以提升交互的體驗，這也是當今智能技術發展的一個重要方向。開展語音識別在粵語方面的相關研究，對智能技術的發展和粵語使用者都有重要意義。例如要進行語音識別技術應用于會議場景有關研究時，就需要進行語音識別在粵語方面的探索研究工作。本文對當前兩種主流的神經網絡模塊進行分析，幷進行了對比實驗，論文主要工作如下：
（1）對兩種基于深度卷積神經網絡的語音識別聲學模型做了優化，優化借鑒計算機視覺領域的網絡結構，將二維的卷積核用來替代傳統聲學特徵提取工具，直接對語音信號的聲學特徵語譜圖做特徵提取，一定程度上减少了傳統聲學特徵提取方法的濾波器組，由于依賴于經驗設計可能造成部分信息丟失對後續流程的影響。
（2）對卷積層模塊做了調整，對5層的卷積層模塊的後三層，採用增加卷積層的方案，來提升特征提取的效果。將雙向門控制循環單元網絡及雙向長短期記憶神經網絡部署在多層卷積層池化層後，分別構成基於DCNN-BiGRU-CTC框架及基於DCNN-BiLSTM-CTC框架的聲學模型，幷在Common Voice zh-HK數據集5.1及6.1版本上實現了模型生成預測序列的功能。
（3）考慮到粵語語言的特性，建模單元爲帶聲調的粵語拼音，在粵語訓練語料十分有限情况下，語言模型使用基于隱馬爾可夫模型的先驗概率模型作爲序列解碼模塊，實驗表明採用粵語發音作爲聲學建模單元效果較好，同時採用粵語字頻字典及粵語詞頻字典作爲解碼輔助工具以應對序列解碼同音字替換問題。
（4）將聲學模型和語言模型連接，得到了基于CTC算法的框架，通過對兩種結構的框架進行實驗幷對實驗結果進行分析，得到了對未來語音識別模型優化有啓迪的信息，爲粵語語音識別應用于會議場景有關的研究提供了實驗依據。

英文摘要

With the advent of the era of big data, intelligent technologies such as artificial intelligence, Internet of Things and cloud computing have been applied to many devices, enabling these devices to have the ability of intelligent interaction and bringing more convenience to people. Speech recognition technology is one of the many intelligent interaction technologies that change the way of working and living. Cantonese is the Chinese dialect second only to Mandarin, the official language of mainland China, in terms of the scale of its speakers and the only Chinese language with a complete writing system. According to statistics, Cantonese is spoken by more than 120 million people worldwide. Using Cantonese to interact can enhance the interaction experience, which is also an important direction in the development of smart technology today. The research on speech recognition in Cantonese is of great significance to the development of intelligent technology and Cantonese users. This paper analyzes the current two mainstream neural network modules and carries out comparative experiments. The main work of this paper is as follows:
(1) Two kinds of acoustic models for speech recognition based on CNN network are optimized. The optimized is based on the CV field, and the two-dimensional convolution kernel is used to replace the traditional acoustic feature extraction tool to directly extract the feature of the speech spectrum, which reduces the influence of information loss caused by the traditional acoustic feature extraction filter bank relying on the experience design.
(2) The convolutional layer module has been adjusted. For the last three layers of the 5-layer convolutional layer module, the solution of adding convolutional layers is adopted to improve the effect of feature extraction. After deploying the two-way gate control recurrent unit network and the two-way long and short-term memory neural network in the multi-layer convolutional layer pooling layer, the acoustic models based on the DCNN-BiGRU-CTC framework and the DCNN-BiLSTM-CTC framework are constructed respectively, and they are also used in Common Voice zh -HK data set 5.1 and 6.1 version realize the function of model generating prediction sequence.
(3) Considering the characteristics of Cantonese language, the modeling unit is Cantonese pinyin with tones. In the case of very limited Cantonese training corpus, a prior probability model based on Hidden Markov Model is used as the sequence decoding module for the language model. The experiment shows that the effect of Cantonese pronunciation as the acoustic modeling unit is better. At the same time, Cantonese word frequency dictionary and Cantonese word frequency dictionary are used as decoding AIDS to deal with the problem of homophone substitution in sequence decoding.
(4) By cascading the acoustic model and the language model, the cascading model based on CTC algorithm is obtained. Through the experiments of the two cascading models and the analysis of the experimental results, the enlightening information for the optimization of the future speech recognition model is obtained, which provides the experimental basis for the research of Cantonese speech recognition applied to the conference scene.

論文出版年

2021

語言別

中文

論文頁數

81

致謝 I
摘要 II
Abstract IV
圖目錄 VIII
表目錄 IX
第一章緒論 1
1.1 選題背景及意義 1
1.2 相關領域發展與研究現狀 2
1.2.1 粵語語音識別的發展與研究現狀 2
1.2.2 深度學習的發展與研究現狀 3
1.3 論文研究主要工作 6
1.4 論文結構 7
第二章語音識別相關理論介紹 9
2.1 語音識別有關理論介紹 9
2.1.1 語音識別技術的基本原理 9
2.1.2 語音信號預處理 10
2.1.3 聲學特徵的提取 13
2.2 深度學習有關理論介紹 16
2.2.1 卷積神經網絡 16
2.2.2 循環神經網絡RNN 20
2.2.3 門控制循環單元GRU 23
2.2.4 長短期記憶神經網絡LSTM 26
2.3 語音識別模型框架 31
2.3.1 基於連結時序分類CTC的語音模型框架 31
2.3.2模型框架的技術難點與不足 33
2.4 語音識別的性能衡量指標 33
第三章基於HMM的語言模型 35
3.1 基於HMM結構的語言模型 35
3.1.1 HMM的兩個基本假設 35
3.1.2 HMM的三個基本問題 37
3.2 語言模型數據預處理 43
3.2.1 粵語漢字發音字典 43
3.2.2 粵語字頻字典 44
3.2.3 粵語詞頻字典 44
3.2.4 拼音文本 45
3.3 模型測試 46
3.3.1 實驗環境 46
3.3.2 實驗參數 46
3.3.3 實驗數據 46
3.4 實驗結果及分析 47
第四章基於深度神經網絡的聲學模型 49
4.1 語音識別聲學模型框架優化 49
4.1.1 CNN在語音識別中的應用 49
4.1.2 基於DCNN-BiGRU-CTC的模型框架 50
4.1.3 基於DCNN-BiLSTM-CTC的模型框架 54
4.2 模型訓練 56
4.2.1 實驗環境 56
4.2.2 實驗參數 57
4.2.3 實驗數據 59
4.2.4 數據處理及特徵提取 60
4.3 實驗結果及分析 61
第五章基於CTC的深度學習模型框架 64
5.1 模型框架介紹 64
5.2 模型訓練 68
5.2.1 實驗環境 68
5.2.2 實驗參數 68
5.2.3 實驗數據 69
5.3 實驗結果與分析 70
第六章總結與展望 73
6.1 總結 73
6.2 展望 74
參考文獻 76
作者簡歷 81

參考文獻

1.王海坤, 潘嘉, 劉聰.語音識別技術的研究進展與展望[J]. 電信科學, 2018.
2.呂坤儒（2020）. 融合語言模型的端到端語音識別算法研究..未出版碩士學位論文，吉林大學，長春.
3.周志華(2016).機器學習.北京：清華大學出版社.
4.香港語言協會(2003)粵語審音配字庫http://humanum.arts.cuhk.edu.hk/Lexis/lexi-can/
5.阿斯頓·張，李沐，扎卡里·C.立頓，亞歷山大·J.斯莫拉.(2019).動手學深度學習.北京：中國工信出版社。
6.鄧力，俞棟(2016).解析深度學習語音識別實踐.北京：中國工信出版集團.
7.鄭定歐(1997).香港粵語詞典.南京：江蘇教育出版社.
8.鄭澤宇，梁博文，顧思宇(2018).TensorFlow 實戰 Google 深度學習框架（第 2 版）.北京：電子工業出版社.
9.Ardila, R., Branson, M., Davis, K., Henretty, M., Kohler, M., Meyer, J., Morais, R., Saunders, L., Tyers, F. M. and Weber, G. (2020) "Common Voice: A Massively-Multilingual Speech Corpus". Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020). pp. 4211—4215
10.Atal B.S. and Hanauer S.L. Speech Analysis and Synthesis by Liner Prediction of the Speech Wave[M], J.Acoust.Soc.Am.Vol 50 , Aug 1972, pp.637-655.
11.CHAN Y C(2005).Using Duration Information in HMM-based Automatic Speech Recognition. M.Ph. Dissertation. The Chinese University of Hong Kong,HongKong.
12.Chan∗ Y. C., Cao∗ H W, Ching∗ P. C., and Lee∗Tan. (2009) Automatic speech recognition of Cantonese-English code-mixing utterances. Computational Linguistics and Chinese Language Processing. Vol. 14, No. 3, September 2009, pp.281-304
13.CHENG Y H(1991).An Efficient tone classifier for speech recognition of Cantonese. M.Ph. Dissertation. The Chinese University of Hong Kong,HongKong.
14.Cho, K., Van Merriënboer, B., Bahdanau, D., & Bengio, Y. (2014). On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259.
15.CHOW K F(1998).CONNECTED SPEECH RECOGNITION SYSTEM FOR CANTONESE. M.Ph. Dissertation. The Chinese University of Hong Kong,HongKong.
16.Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555.
17.Devijver, P. A. (1985). Baum's forward-backward algorithm revisited. Pattern Recognition Letters, 3(6), 369-373.
18.Dong Yu, Li Deng. Deep learning and its applications to signal and information processing[J], IEEE Signal Processing Magazine, 2011, 28(1):145-154.
19.Feng, S., Lee, T. (2018) Improving Cross-Lingual Knowledge Transferability Using Multilingual TDNN-BLSTM with Language-Dependent Pre-Final Layer. Proc. Interspeech 2018, 2439-2443, DOI: 10.21437/Interspeech.2018-1182.
20.Fine S , Singer Y , Tishby N . The Hierarchical Hidden Markov Model: Analysis and Applications[J]. Machine Learning, 1998, 32(1):41-62.
21.Gatys, L. A., Ecker, A. S., & Bethge, M. (2016). Image style transfer using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2414-2423).
22.Gokul Krishnan, Rakesh Joshi, Timothy O’Connor, Filiberto Pla, and Bahram Javidi. Human gesture recognition under degraded environments using 3D-integral imaging and deep learning. Opt. Express 28, 19711-19725 (2020)
23.Graves A, Gomez F, et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neurl networks. [2017-12-12].
24.Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780.
25.Hu S k, Liu S s(2019). The CUHK Dysarthric Speech Recognition Systems for English and Cantonese. INTERSPEECH 2019: Show & Tell Contribution.September 15–19, 2019, Graz, Austria
26.Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 1097-1105.
27.LAI W M(1987).Efficient algorithms for speech recognition of Cantonese. M.Ph. Dissertation. The Chinese University of Hong Kong,HongKong.
28.Lee T(1996).AUTOMATIC RECOGNITION OF ISOLATED CANTONESE SYLLABLES USING NEURAL NETWORKS. PhD. Dissertation. The Chinese University of Hong Kong,HongKong.
29.Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3431-3440).
30.Mohamed A.R, Dahl G(2009). Deep belief networks for phone recognition[EB].
31.Ng Y P(1997).Automatic Recognition of Continuous Cantonese Speech. PhD. Dissertation. The Chinese University of Hong Kong,HongKong.
32.O'Shea, K., & Nash, R. (2015). An introduction to convolutional neural networks. arXiv preprint arXiv:1511.08458.
33.Rabiner, L., & Juang, B. (1986). An introduction to hidden Markov models. ieee assp magazine, 3(1), 4-16.
34.Shiliang Zhang, Cong Liu. et al .Feedforward Sequential Memory Networks:A New Structure to Learn Long-term Dependency. 2015.
35.Shiliang Zhang,Ming Lei .Deep-FSMN for Large Vocabulary Continuous Speech Recognition. ICASSP 2018.
36.Springenberg, J. T., Dosovitskiy, A., Brox, T., & Riedmiller, M. (2014). Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806.
37.Sundermeyer, M., Schlüter, R., & Ney, H. (2012). LSTM neural networks for language modeling. In Thirteenth annual conference of the international speech communication association.
38.Sung Y H, Jansche M, J.Moreno P.(2011) Deploying Google Search by voice in Cantonese.INTERSPEECH2011, Florence, Italy
39.Takaaki H, et al. Advances in Joint CTC-Attention based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM.In Interspeech2017, 2017.
40.Tsik C W(1994).Large Vocabulary Cantonese Speech Recognition Using Neural Networks. M.Ph. Dissertation. The Chinese University of Hong Kong,HongKong.
41.Weidong Yuan, Alan W Black.(2018).Generating Mandarin and Cantonese F0 Contours with Decision Trees and BLSTMs. Computation and Language (cs.CL). Submitted on 4 Jul 2018, from https://arxiv.org/pdf/1807.01682.pdf
42.William C, Navdeep J, et al. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. ICASSP.2016.7472621
43.Wojciech Zaremba ∗,Ilya Sutskever, Oriol Vinyals.(2015).RECURRENT NEURAL NETWORK REGULARIZATION.ICLR
44.Wong, T. C. T., Li, W. Y. C., Chiu, W. H., Lu, Q., Li, M., Xiong, D., Yu, S., & Ng, V. T. Y. (2016). Syllable based DNN-HMM Cantonese speech to text system. In Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016 (pp. 3856-3862). European Language Resources Association (ELRA).
45.Xinpei Zhou, Jiwei Li, Xi Zhou. (2019) Cascaded CNN-resBiLSTM-CTC: An End-to-End Acoustic Model For Speech Recognition. ICASSP (International 11Conference on Acoustics, Speech, and Signal Processing)
46.Zhe Yuan, Zhuoran Liu, Jiwei Li, Xi Zhou. (2019) An Improved HYBRID CTC-Attention Model For Speech Recognition. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Brighton, UK

論文說明