本校學位論文庫
CITYU Theses & Dissertations
論文詳情
胡文瀚
王淳
數據科學學院
數據科學碩士學位課程(中文學制)
碩士
2024
基於Bert 和 BiLSTM 的特定域多標籤文本分類模型研究
Research on domain-specific multi-label text classification model based on Bert and BiLSTM
自然語言處理 ; Bert ; 焦點損失 ; 多標簽文本分類 ; 校園墻
Natural Language Processing , Bert , Focal Loss , Multi-Label Text Classification , bulletin boards System
本研究致力於探索和優化普通高校校園牆文本帖子的多標簽分類問題,旨在提升校園牆內容管理和用戶體驗的效率與質量。面對校園牆運營中的挑戰,例如運營負擔大、帖子數據雜亂、Bert模型魯棒性不足以及多標簽分類的複雜性。我們構建了一個包含23.4萬條文本的大規模的校園牆文本數據集。為了增強模型的魯棒性並解決多標簽分類的難題,我們選擇了Bert預訓練模型,並對其進行了進一步的預訓練以提升性能。我們設計了深度學習模型來實現自動化標簽分類任務,採用了Bert和BiLSTM作為特徵提取器。在損失函數的選擇上,我們對比了傳統的二元交叉熵和焦點損失兩種損失函數在不平衡數據集上的表現。
實驗結果表明,使用領域特定預訓練的Bert模型(bbc4s)在多標簽分類任務中表現更為優秀,這證實了領域預訓練在提升模型對特定類型文本理解能力方面的重要性,同時,焦點損失函數在處理類別不平衡問題時更為有效。本研究不僅為校園牆的自動化標簽分類任務提供了一套有效的解決方案,也為校園牆內容管理和信息檢索的優化提供了新的思路。
This study is dedicated to exploring and optimizing the problem of multi-signature categorization of text posts on the campus wall of a general university, aiming to enhance the efficiency and quality of campus wall content management and user experience. Facing the challenges in campus wall operation, such as high operational burden, messy post data, insufficient robustness of Bert's model, and complexity of multi-signature categorization. We construct a large-scale campus wall text dataset containing 234,237 texts. To enhance the robustness of the model and solve the difficulty of multi-tag categorization, we choose the Bert pre-trained model and conduct further pre-training on it to improve the performance. We designed a deep learning model to implement the automated label categorization task, employing Bert and BiLSTM as feature extractors. For the choice of loss function, we compare the performance of two loss functions, traditional binary cross-entropy and focal loss, on unbalanced datasets.
The experimental results show that the Bert model (bbc4s) with domain-specific pretraining performs better in the multi-signature classification task, which confirms the importance of domain-specific pretraining in improving the model's ability to understand specific types of text, and the focal loss function is more effective in dealing with class imbalance problems. This study not only provides a set of effective solutions for the automated tag categorization task of campus walls, but also provides new ideas for the optimization of campus wall content management and information retrieval.
2024
中文
63
致 謝 I
摘 要 II
Abstract III
圖目錄 VI
表目錄 7
第一章 緒 論 8
1.1 研究背景及意義 8
1.2 國內外研究現狀 10
1.2.1.1 校園牆運營存在的問題 10
1.2.2 特定領域內的多標簽文本分類 11
1.3 研究問題 12
1.4 研究內容 12
1.5 創新性 13
1.6 論文組織結構 14
1.7 本章小結 14
第二章 相關工作理論與技術研究 15
2.1 多標簽文本分類 15
2.2 BiLSTM 網絡 16
2.3 Bert 模型 18
2.4 繼續預訓練 19
2.5 Focal Loss 損失函數 20
2.6 本章小結 21
第三章 數據集構建與實驗過程 22
3.1 實驗設計 22
3.2 數據集構建 22
3.2.1 數據來源細節 23
3.2.2 數據收集方法 25
3.2.3 數據集標注 26
3.2.3.1 初步數據預處理 26
3.2.3.2 文本數據類別劃分 29
3.2.3.3 數據標注方式 30
3.2.4 數據集的不足和可能面對的問題 32
3.3 實驗過程 33
3.3.1 實驗環境 33
3.3.2 最終數據預處理 33
3.3.3 探索性數據分析 35
3.3.3.1 數據集分佈特徵 35
3.3.3.2 觀察與分析 35
3.3.4 繼續預訓練過程與參數設置 36
3.3.4.1 任務選擇 36
3.3.4.2 MLM任務執行步驟 37
3.3.4.3 主要參數設置及其影響 37
3.3.5 對比實驗過程與模型構建 38
3.3.5.1 特徵提取器部分 40
3.3.5.2 預訓練模型部分 43
3.3.5.3 損失函數部分 45
3.3.5.4 主要參數設置及其影響 47
3.3.6 評價指標 48
3.4 本章小結 48
第四章 實驗結果比較與分析 50
3.1 預訓練函數的結果分析 50
3.2 特徵提取器的結果分析 51
3.3 不同損失函數的結果分析 53
3.4 本章小結 54
第五章 總結與展望 55
5.1 結論 55
5.1.1 研究的實用性 56
5.1.2 可改進方向與展望 56
參考文獻 58
作者簡歷 62
1.ZHANG Wenfeng, XI Xuefeng, CUI Zhiming, ZOU Yichen, LUAN Jinquan. Review and Prospect of Multi-Label Text Classification Research[J]. Computer Engineering and Applications, 2023, 59(18): 28-48.
2.REN Junfei, ZHU Tong, CHEN Wenliang. Self-training with partial labeling for multi-label text classification[J]. Journal of Tsinghua University (Science and Technology), 2024, 64(4): 679-687.
3.LIU Jie, TANG Hong, YANG Haolan, GAN Chenmin, PENG Jinzhi. Multi-label text classification method fused with attention mechanism[J]. Microelectronics & Computer, 2023, 40(12): 26-34. DOI: 10.19304/J.ISSN1000-7180.2022.0416.
4.Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[J]. arXiv preprint arXiv:1810.04805, 2018.
5.MA Mingyan, CHEN Wei, WU Lifa. CNN_BiLSTM Network Based Intrusion Detection Method[J]. Computer Engineering and Applications, 2022, 58(10): 116-124.
6.Sun Z, Anbarasan M, Praveen Kumar D. Design of online intelligent English teaching platform based on artificial intelligence techniques. Computational Intelligence. 2021;37:1166-80 %@ 0824-7935.
7.Chai Y, Li Z, Liu J, Chen L, Li F, Ji D, et al. Compositional Generalization for Multi-Label Text Classification: A Data-Augmentation Approach. 2024. p. 17727-35 %@ 2374-3468.
8.Gin BC, ten Cate O, O'Sullivan PS, Hauer KE, Boscardin C. Exploring how feedback reflects entrustment decisions using artificial intelligence. Med Educ. 2022;56:303-11.
9.Wang H, Xue L, Du W, Wang F, Li P, Chen L, et al. The effect of online investor sentiment on stock movements: an LSTM approach. Springer; 2021. p. 1-14.
10.Hajek P, Henriques R. Predicting M&A targets using news sentiment and topic detection. Technological Forecasting and Social Change. 2024;201:123270 %@ 0040-1625.
11.Wang MH, Chong KK-l, Lin Z, Yu X, Pan Y. An Explainable Artificial Intelligence-Based Robustness Optimization Approach for Age-Related Macular Degeneration Detection Based on Medical IOT Systems. Electronics. 2023;12:2697.
12.Wang MH, Xing L, Pan Y, Gu F, Fang J, Yu X, et al. AI-Based Advanced Approaches and Dry Eye Disease Detection Based on Multi-Source Evidence: Cases, Applications, Issues, and Future Directions. Big Data Mining and Analytics. 2024;7:445-84.
13.Chen Y, Li C, Wang H. Big Data and predictive analytics for business intelligence: a bibliographic study (2000–2021). Forecasting. 2022;4:767-86 %@ 2571-9394.
14.Wang M, Zhao Y, Wu Q, Chen G. A YOLO-based Method for Improper Behavior Predictions. 2023 IEEE International Conference on Contemporary Computing and Communications (InC4): IEEE; 2023. p. 1-4.
15.Kjell ONE, Kjell K, Garcia D, Sikström S. Semantic measures: Using natural language processing to measure, differentiate, and describe psychological constructs. Psychological Methods. 2019;24:92 %@ 1939-463.
16.Y. Zhang, X. Wang, X. Wang, S. Fan and D. Zhang, "Using question classification to model user intentions of different levels," 2009 IEEE International Conference on Systems, Man and Cybernetics, San Antonio, TX, USA, 2009, pp. 1153-1158, doi: 10.1109/ICSMC.2009.5345957.
17.羅晶. 校園輿情分析中的意見挖掘技術研究[D].東南大學,2015.
18.唐果,陳宏剛.基於BBS熱點主題發現的文本聚類方法[J].計算機工程,2010,36(07):79-81.
19.翁捷. 高校BBS熱點話題的挖掘與分析[D].安徽農業大學,2015.
20.蘇金樹,張博鋒,徐昕.基於機器學習的文本分類技術研究進展.軟體學報,2006,17(9):1848-1859.
21.Gururangan S, Marasović A, Swayamdipta S, et al. Don't stop pretraining: Adapt language models to domains and tasks[J]. arXiv preprint arXiv:2004.10964, 2020.
22.Lin T Y, Goyal P, Girshick R, et al. Focal loss for dense object detection[C]//Proceedings of the IEEE international conference on computer vision. 2017: 2980-2988.
23.Devlin J, Chang M W, Lee K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding[J]. arXiv preprint arXiv:1810.04805, 2018.
24.王進, 徐巍, 丁一, 等. 基於圖嵌入和區域注意力的多標籤文本分類[J]. Journal of Jiangsu University (Natural Science Edition)/Jiangsu Daxue Xuebao (Ziran Kexue Ban), 2022, 43(3).
25.Zhang S, Zheng D, Hu X, et al. Bidirectional long short-term memory networks for relation classification[C]//Proceedings of the 29th Pacific Asia conference on language, information and computation. 2015: 73-78.
26.Lipton Z C, Berkowitz J, Elkan C. A critical review of recurrent neural networks for sequence learning[J]. arXiv preprint arXiv:1506.00019, 2015.
27.Graves A, Graves A. Long short-term memory[J]. Supervised sequence labelling with recurrent neural networks, 2012: 37-45.
28.Lin T Y, Goyal P, Girshick R, et al. Focal loss for dense object detection[C]//Proceedings of the IEEE international conference on computer vision. 2017: 2980-2988.
29.陳潔. BBS與中國早期互聯網技術文化研究(1991-2000)[D].山東大學,2023.DOI:10.27272/d.cnki.gshdu.2023.000671.
30.張春生.高校校園BBS:從公告板系統到校園門戶網站[J].青年研究,2005(10):16-22.
31.陳淑娟. 民主治校視野下大學生網絡話語權研究[D].浙江師範大學,2013.
32.沈菲飛. 高校BBS的網絡健康傳播研究[D].中國科學技術大學,2009.
33.張軍芳.BBS使用目的對大學生“自我表露”的影響——以“日月光華”為例[J].中國地質大學學報(社會科學版),2009,9(03):22-26.DOI:10.16493/j.cnki.42-1627/c.2009.03.013.
34.羅昌勤.高校BBS從危機管理向主動服務育人轉型的思考[J].學校黨建與思想教育(上半月),2007(10):46-48.
35.李政,何國強,孫浩哲等.網絡思政視閾下高校校園墻現象探析[J].辦公室業務,2023(19):82-84.
36.李萬平,方愛東.微時代高校主流話語權威的消解與重塑[J].北京化工大學學報(社會科學版),2019(02):101-107.
37.翁偉. 結合類屬特徵和標記關係的多標記分類研究[D].廈門大學,2022.DOI:10.27424/d.cnki.gxmdu.2019.001334.
38.肖琳, 陳博理, 黃鑫, 等. 基於標籤語義注意力的多標籤文本分類[J]. 軟體學報, 2020, 31(4): 1079-1089.
39.Toney A, Dunham J. Multi-label classification of scientific research documents across domains and languages[C]//Proceedings of the Third Workshop on Scholarly Document Processing. 2022: 105-114.
40.鄔鑫珂,孫俊,李志華.採用標籤組合與融合注意力的多標籤文本分類[J].電腦工程與應用,2023,59(06):125-133.
41.Turrisi R. Beyond original Research Articles Categorization via NLP[J]. arXiv preprint arXiv:2309.07020, 2023.
42.Zhu K, Wu J. Residual attention: A simple but effective method for multi-label recognition[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2021: 184-193.
43.司志博文,李少博,單麗莉,等.基於增量預訓練和對抗訓練的文本匹配模型[J].電腦系統應用,2022,31(11):349-357.DOI:10.15888/j.cnki.csa.008778.
44.孫毅, 裘杭萍, 鄭雨, 等. 自然語言預訓練模型知識增強方法綜述[J]. 中文信息學報, 2021, 35(7): 10-29.
45.李永聰. 基於預訓練語言模型的毒性評論分類[D].華中科技大學,2022.DOI:10.27157/d.cnki.ghzku.2022.001968.
46.Yuan W, Liu P. reStructured Pre-training[J]. arXiv preprint arXiv:2206.11147, 2022.
47.Gupta K, Thérien B, Ibrahim A, et al. Continual Pre-Training of Large Language Models: How to (re) warm your model?[J]. arXiv preprint arXiv:2308.04014, 2023.
48.Smith L N. Cyclical Focal Loss. arXiv 2022[J]. arXiv preprint arXiv:2202.08978.
49.Charoenphakdee N, Vongkulbhisal J, Chairatanakul N, et al. On focal loss for class-posterior probability estimation: A theoretical perspective[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 5202-5211.
50.肖振久,孔祥旭,宗佳旭,等.自我調整聚焦損失的圖像目標檢測演算法[J].電腦工程與應用,2021,57(23):185-192.
51.朱翌民,郭茹燕,巨家驥,等.一種結合Focal Loss的不平衡數據集提升樹分類演算法[J].軟體導刊,2021,20(11):65-69.
52.董博,褚淑美,朱欣藝.高校虛擬社群的功能研究——以“表白牆”為例[J].河北軟體職業技術學院學報,2020,22(04):37-41.DOI:10.13314/j.cnki.jhbsi.2020.04.010.
53.Liu Y, Ott M, Goyal N, et al. Roberta: A robustly optimized bert pretraining approach[J]. arXiv preprint arXiv:1907.11692, 2019.
54.Peng S, Yuan K, Gao L, et al. Mathbert: A pre-trained model for mathematical formula understanding[J]. arXiv preprint arXiv:2105.00377, 2021.
55.Yu W, Jiang Z, Chen F, et al. LV-BERT: Exploiting layer variety for BERT[J]. arXiv preprint arXiv:2106.11740, 2021.
56.Yang D, Wang X, Celebi R. Expanding the Vocabulary of BERT for Knowledge Base Construction[J]. arXiv preprint arXiv:2310.08291, 2023.
57.Tänzer M, Ruder S, Rei M. BERT memorisation and pitfalls in low-resource scenarios[J]. CoRR, abs/2105.00828, 2021.
58.Brownlee J. A gentle introduction to backpropagation through time[J]. Machine Learning Mastery,(), 2017.