With the advent of the era of big data, intelligent technologies such as artificial intelligence, Internet of Things and cloud computing have been applied to many devices, enabling these devices to have the ability of intelligent interaction and bringing more convenience to people. Speech recognition technology is one of the many intelligent interaction technologies that change the way of working and living. Cantonese is the Chinese dialect second only to Mandarin, the official language of mainland China, in terms of the scale of its speakers and the only Chinese language with a complete writing system. According to statistics, Cantonese is spoken by more than 120 million people worldwide. Using Cantonese to interact can enhance the interaction experience, which is also an important direction in the development of smart technology today. The research on speech recognition in Cantonese is of great significance to the development of intelligent technology and Cantonese users. This paper analyzes the current two mainstream neural network modules and carries out comparative experiments. The main work of this paper is as follows:
(1) Two kinds of acoustic models for speech recognition based on CNN network are optimized. The optimized is based on the CV field, and the two-dimensional convolution kernel is used to replace the traditional acoustic feature extraction tool to directly extract the feature of the speech spectrum, which reduces the influence of information loss caused by the traditional acoustic feature extraction filter bank relying on the experience design.
(2) The convolutional layer module has been adjusted. For the last three layers of the 5-layer convolutional layer module, the solution of adding convolutional layers is adopted to improve the effect of feature extraction. After deploying the two-way gate control recurrent unit network and the two-way long and short-term memory neural network in the multi-layer convolutional layer pooling layer, the acoustic models based on the DCNN-BiGRU-CTC framework and the DCNN-BiLSTM-CTC framework are constructed respectively, and they are also used in Common Voice zh -HK data set 5.1 and 6.1 version realize the function of model generating prediction sequence.
(3) Considering the characteristics of Cantonese language, the modeling unit is Cantonese pinyin with tones. In the case of very limited Cantonese training corpus, a prior probability model based on Hidden Markov Model is used as the sequence decoding module for the language model. The experiment shows that the effect of Cantonese pronunciation as the acoustic modeling unit is better. At the same time, Cantonese word frequency dictionary and Cantonese word frequency dictionary are used as decoding AIDS to deal with the problem of homophone substitution in sequence decoding.
(4) By cascading the acoustic model and the language model, the cascading model based on CTC algorithm is obtained. Through the experiments of the two cascading models and the analysis of the experimental results, the enlightening information for the optimization of the future speech recognition model is obtained, which provides the experimental basis for the research of Cantonese speech recognition applied to the conference scene.
1.王海坤, 潘嘉, 劉聰.語音識別技術的研究進展與展望[J]. 電信科學, 2018.
2.呂坤儒(2020). 融合語言模型的端到端語音識別算法研究..未出版碩士學位論文,吉林大學,長春.
3.周志華(2016).機器學習.北京:清華大學出版社.
4.香港語言協會(2003)粵語審音配字庫http://humanum.arts.cuhk.edu.hk/Lexis/lexi-can/
5.阿斯頓·張,李沐,扎卡里·C.立頓,亞歷山大·J.斯莫拉.(2019).動手學 深度學習.北京:中國工信出版社。
6.鄧力,俞棟(2016).解析深度學習語音識別實踐.北京:中國工信出版集團.
7.鄭定歐(1997).香港粵語詞典.南京:江蘇教育出版社.
8.鄭澤宇,梁博文,顧思宇(2018).TensorFlow 實戰 Google 深度學習框架(第 2 版).北京:電子工業出版社.
9.Ardila, R., Branson, M., Davis, K., Henretty, M., Kohler, M., Meyer, J., Morais, R., Saunders, L., Tyers, F. M. and Weber, G. (2020) "Common Voice: A Massively-Multilingual Speech Corpus". Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020). pp. 4211—4215
10.Atal B.S. and Hanauer S.L. Speech Analysis and Synthesis by Liner Prediction of the Speech Wave[M], J.Acoust.Soc.Am.Vol 50 , Aug 1972, pp.637-655.
11.CHAN Y C(2005).Using Duration Information in HMM-based Automatic Speech Recognition. M.Ph. Dissertation. The Chinese University of Hong Kong,HongKong.
12.Chan∗ Y. C., Cao∗ H W, Ching∗ P. C., and Lee∗Tan. (2009) Automatic speech recognition of Cantonese-English code-mixing utterances. Computational Linguistics and Chinese Language Processing. Vol. 14, No. 3, September 2009, pp.281-304
13.CHENG Y H(1991).An Efficient tone classifier for speech recognition of Cantonese. M.Ph. Dissertation. The Chinese University of Hong Kong,HongKong.
14.Cho, K., Van Merriënboer, B., Bahdanau, D., & Bengio, Y. (2014). On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259.
15.CHOW K F(1998).CONNECTED SPEECH RECOGNITION SYSTEM FOR CANTONESE. M.Ph. Dissertation. The Chinese University of Hong Kong,HongKong.
16.Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555.
17.Devijver, P. A. (1985). Baum's forward-backward algorithm revisited. Pattern Recognition Letters, 3(6), 369-373.
18.Dong Yu, Li Deng. Deep learning and its applications to signal and information processing[J], IEEE Signal Processing Magazine, 2011, 28(1):145-154.
19.Feng, S., Lee, T. (2018) Improving Cross-Lingual Knowledge Transferability Using Multilingual TDNN-BLSTM with Language-Dependent Pre-Final Layer. Proc. Interspeech 2018, 2439-2443, DOI: 10.21437/Interspeech.2018-1182.
20.Fine S , Singer Y , Tishby N . The Hierarchical Hidden Markov Model: Analysis and Applications[J]. Machine Learning, 1998, 32(1):41-62.
21.Gatys, L. A., Ecker, A. S., & Bethge, M. (2016). Image style transfer using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2414-2423).
22.Gokul Krishnan, Rakesh Joshi, Timothy O’Connor, Filiberto Pla, and Bahram Javidi. Human gesture recognition under degraded environments using 3D-integral imaging and deep learning. Opt. Express 28, 19711-19725 (2020)
23.Graves A, Gomez F, et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neurl networks. [2017-12-12].
24.Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780.
25.Hu S k, Liu S s(2019). The CUHK Dysarthric Speech Recognition Systems for English and Cantonese. INTERSPEECH 2019: Show & Tell Contribution.September 15–19, 2019, Graz, Austria
26.Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 1097-1105.
27.LAI W M(1987).Efficient algorithms for speech recognition of Cantonese. M.Ph. Dissertation. The Chinese University of Hong Kong,HongKong.
28.Lee T(1996).AUTOMATIC RECOGNITION OF ISOLATED CANTONESE SYLLABLES USING NEURAL NETWORKS. PhD. Dissertation. The Chinese University of Hong Kong,HongKong.
29.Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3431-3440).
30.Mohamed A.R, Dahl G(2009). Deep belief networks for phone recognition[EB].
31.Ng Y P(1997).Automatic Recognition of Continuous Cantonese Speech. PhD. Dissertation. The Chinese University of Hong Kong,HongKong.
32.O'Shea, K., & Nash, R. (2015). An introduction to convolutional neural networks. arXiv preprint arXiv:1511.08458.
33.Rabiner, L., & Juang, B. (1986). An introduction to hidden Markov models. ieee assp magazine, 3(1), 4-16.
34.Shiliang Zhang, Cong Liu. et al .Feedforward Sequential Memory Networks:A New Structure to Learn Long-term Dependency. 2015.
35.Shiliang Zhang,Ming Lei .Deep-FSMN for Large Vocabulary Continuous Speech Recognition. ICASSP 2018.
36.Springenberg, J. T., Dosovitskiy, A., Brox, T., & Riedmiller, M. (2014). Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806.
37.Sundermeyer, M., Schlüter, R., & Ney, H. (2012). LSTM neural networks for language modeling. In Thirteenth annual conference of the international speech communication association.
38.Sung Y H, Jansche M, J.Moreno P.(2011) Deploying Google Search by voice in Cantonese.INTERSPEECH2011, Florence, Italy
39.Takaaki H, et al. Advances in Joint CTC-Attention based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM.In Interspeech2017, 2017.
40.Tsik C W(1994).Large Vocabulary Cantonese Speech Recognition Using Neural Networks. M.Ph. Dissertation. The Chinese University of Hong Kong,HongKong.
41.Weidong Yuan, Alan W Black.(2018).Generating Mandarin and Cantonese F0 Contours with Decision Trees and BLSTMs. Computation and Language (cs.CL). Submitted on 4 Jul 2018, from https://arxiv.org/pdf/1807.01682.pdf
42.William C, Navdeep J, et al. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. ICASSP.2016.7472621
43.Wojciech Zaremba ∗,Ilya Sutskever, Oriol Vinyals.(2015).RECURRENT NEURAL NETWORK REGULARIZATION.ICLR
44.Wong, T. C. T., Li, W. Y. C., Chiu, W. H., Lu, Q., Li, M., Xiong, D., Yu, S., & Ng, V. T. Y. (2016). Syllable based DNN-HMM Cantonese speech to text system. In Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016 (pp. 3856-3862). European Language Resources Association (ELRA).
45.Xinpei Zhou, Jiwei Li, Xi Zhou. (2019) Cascaded CNN-resBiLSTM-CTC: An End-to-End Acoustic Model For Speech Recognition. ICASSP (International 11Conference on Acoustics, Speech, and Signal Processing)
46.Zhe Yuan, Zhuoran Liu, Jiwei Li, Xi Zhou. (2019) An Improved HYBRID CTC-Attention Model For Speech Recognition. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Brighton, UK