This study is dedicated to exploring and optimizing the problem of multi-signature categorization of text posts on the campus wall of a general university, aiming to enhance the efficiency and quality of campus wall content management and user experience. Facing the challenges in campus wall operation, such as high operational burden, messy post data, insufficient robustness of Bert's model, and complexity of multi-signature categorization. We construct a large-scale campus wall text dataset containing 234,237 texts. To enhance the robustness of the model and solve the difficulty of multi-tag categorization, we choose the Bert pre-trained model and conduct further pre-training on it to improve the performance. We designed a deep learning model to implement the automated label categorization task, employing Bert and BiLSTM as feature extractors. For the choice of loss function, we compare the performance of two loss functions, traditional binary cross-entropy and focal loss, on unbalanced datasets.
The experimental results show that the Bert model (bbc4s) with domain-specific pretraining performs better in the multi-signature classification task, which confirms the importance of domain-specific pretraining in improving the model's ability to understand specific types of text, and the focal loss function is more effective in dealing with class imbalance problems. This study not only provides a set of effective solutions for the automated tag categorization task of campus walls, but also provides new ideas for the optimization of campus wall content management and information retrieval.
1.ZHANG Wenfeng, XI Xuefeng, CUI Zhiming, ZOU Yichen, LUAN Jinquan. Review and Prospect of Multi-Label Text Classification Research[J]. Computer Engineering and Applications, 2023, 59(18): 28-48.
2.REN Junfei, ZHU Tong, CHEN Wenliang. Self-training with partial labeling for multi-label text classification[J]. Journal of Tsinghua University (Science and Technology), 2024, 64(4): 679-687.
3.LIU Jie, TANG Hong, YANG Haolan, GAN Chenmin, PENG Jinzhi. Multi-label text classification method fused with attention mechanism[J]. Microelectronics & Computer, 2023, 40(12): 26-34. DOI: 10.19304/J.ISSN1000-7180.2022.0416.
4.Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[J]. arXiv preprint arXiv:1810.04805, 2018.
5.MA Mingyan, CHEN Wei, WU Lifa. CNN_BiLSTM Network Based Intrusion Detection Method[J]. Computer Engineering and Applications, 2022, 58(10): 116-124.
6.Sun Z, Anbarasan M, Praveen Kumar D. Design of online intelligent English teaching platform based on artificial intelligence techniques. Computational Intelligence. 2021;37:1166-80 %@ 0824-7935.
7.Chai Y, Li Z, Liu J, Chen L, Li F, Ji D, et al. Compositional Generalization for Multi-Label Text Classification: A Data-Augmentation Approach. 2024. p. 17727-35 %@ 2374-3468.
8.Gin BC, ten Cate O, O'Sullivan PS, Hauer KE, Boscardin C. Exploring how feedback reflects entrustment decisions using artificial intelligence. Med Educ. 2022;56:303-11.
9.Wang H, Xue L, Du W, Wang F, Li P, Chen L, et al. The effect of online investor sentiment on stock movements: an LSTM approach. Springer; 2021. p. 1-14.
10.Hajek P, Henriques R. Predicting M&A targets using news sentiment and topic detection. Technological Forecasting and Social Change. 2024;201:123270 %@ 0040-1625.
11.Wang MH, Chong KK-l, Lin Z, Yu X, Pan Y. An Explainable Artificial Intelligence-Based Robustness Optimization Approach for Age-Related Macular Degeneration Detection Based on Medical IOT Systems. Electronics. 2023;12:2697.
12.Wang MH, Xing L, Pan Y, Gu F, Fang J, Yu X, et al. AI-Based Advanced Approaches and Dry Eye Disease Detection Based on Multi-Source Evidence: Cases, Applications, Issues, and Future Directions. Big Data Mining and Analytics. 2024;7:445-84.
13.Chen Y, Li C, Wang H. Big Data and predictive analytics for business intelligence: a bibliographic study (2000–2021). Forecasting. 2022;4:767-86 %@ 2571-9394.
14.Wang M, Zhao Y, Wu Q, Chen G. A YOLO-based Method for Improper Behavior Predictions. 2023 IEEE International Conference on Contemporary Computing and Communications (InC4): IEEE; 2023. p. 1-4.
15.Kjell ONE, Kjell K, Garcia D, Sikström S. Semantic measures: Using natural language processing to measure, differentiate, and describe psychological constructs. Psychological Methods. 2019;24:92 %@ 1939-463.
16.Y. Zhang, X. Wang, X. Wang, S. Fan and D. Zhang, "Using question classification to model user intentions of different levels," 2009 IEEE International Conference on Systems, Man and Cybernetics, San Antonio, TX, USA, 2009, pp. 1153-1158, doi: 10.1109/ICSMC.2009.5345957.
17.羅晶. 校園輿情分析中的意見挖掘技術研究[D].東南大學,2015.
18.唐果,陳宏剛.基於BBS熱點主題發現的文本聚類方法[J].計算機工程,2010,36(07):79-81.
19.翁捷. 高校BBS熱點話題的挖掘與分析[D].安徽農業大學,2015.
20.蘇金樹,張博鋒,徐昕.基於機器學習的文本分類技術研究進展.軟體學報,2006,17(9):1848-1859.
21.Gururangan S, Marasović A, Swayamdipta S, et al. Don't stop pretraining: Adapt language models to domains and tasks[J]. arXiv preprint arXiv:2004.10964, 2020.
22.Lin T Y, Goyal P, Girshick R, et al. Focal loss for dense object detection[C]//Proceedings of the IEEE international conference on computer vision. 2017: 2980-2988.
23.Devlin J, Chang M W, Lee K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding[J]. arXiv preprint arXiv:1810.04805, 2018.
24.王進, 徐巍, 丁一, 等. 基於圖嵌入和區域注意力的多標籤文本分類[J]. Journal of Jiangsu University (Natural Science Edition)/Jiangsu Daxue Xuebao (Ziran Kexue Ban), 2022, 43(3).
25.Zhang S, Zheng D, Hu X, et al. Bidirectional long short-term memory networks for relation classification[C]//Proceedings of the 29th Pacific Asia conference on language, information and computation. 2015: 73-78.
26.Lipton Z C, Berkowitz J, Elkan C. A critical review of recurrent neural networks for sequence learning[J]. arXiv preprint arXiv:1506.00019, 2015.
27.Graves A, Graves A. Long short-term memory[J]. Supervised sequence labelling with recurrent neural networks, 2012: 37-45.
28.Lin T Y, Goyal P, Girshick R, et al. Focal loss for dense object detection[C]//Proceedings of the IEEE international conference on computer vision. 2017: 2980-2988.
29.陳潔. BBS與中國早期互聯網技術文化研究(1991-2000)[D].山東大學,2023.DOI:10.27272/d.cnki.gshdu.2023.000671.
30.張春生.高校校園BBS:從公告板系統到校園門戶網站[J].青年研究,2005(10):16-22.
31.陳淑娟. 民主治校視野下大學生網絡話語權研究[D].浙江師範大學,2013.
32.沈菲飛. 高校BBS的網絡健康傳播研究[D].中國科學技術大學,2009.
33.張軍芳.BBS使用目的對大學生“自我表露”的影響——以“日月光華”為例[J].中國地質大學學報(社會科學版),2009,9(03):22-26.DOI:10.16493/j.cnki.42-1627/c.2009.03.013.
34.羅昌勤.高校BBS從危機管理向主動服務育人轉型的思考[J].學校黨建與思想教育(上半月),2007(10):46-48.
35.李政,何國強,孫浩哲等.網絡思政視閾下高校校園墻現象探析[J].辦公室業務,2023(19):82-84.
36.李萬平,方愛東.微時代高校主流話語權威的消解與重塑[J].北京化工大學學報(社會科學版),2019(02):101-107.
37.翁偉. 結合類屬特徵和標記關係的多標記分類研究[D].廈門大學,2022.DOI:10.27424/d.cnki.gxmdu.2019.001334.
38.肖琳, 陳博理, 黃鑫, 等. 基於標籤語義注意力的多標籤文本分類[J]. 軟體學報, 2020, 31(4): 1079-1089.
39.Toney A, Dunham J. Multi-label classification of scientific research documents across domains and languages[C]//Proceedings of the Third Workshop on Scholarly Document Processing. 2022: 105-114.
40.鄔鑫珂,孫俊,李志華.採用標籤組合與融合注意力的多標籤文本分類[J].電腦工程與應用,2023,59(06):125-133.
41.Turrisi R. Beyond original Research Articles Categorization via NLP[J]. arXiv preprint arXiv:2309.07020, 2023.
42.Zhu K, Wu J. Residual attention: A simple but effective method for multi-label recognition[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2021: 184-193.
43.司志博文,李少博,單麗莉,等.基於增量預訓練和對抗訓練的文本匹配模型[J].電腦系統應用,2022,31(11):349-357.DOI:10.15888/j.cnki.csa.008778.
44.孫毅, 裘杭萍, 鄭雨, 等. 自然語言預訓練模型知識增強方法綜述[J]. 中文信息學報, 2021, 35(7): 10-29.
45.李永聰. 基於預訓練語言模型的毒性評論分類[D].華中科技大學,2022.DOI:10.27157/d.cnki.ghzku.2022.001968.
46.Yuan W, Liu P. reStructured Pre-training[J]. arXiv preprint arXiv:2206.11147, 2022.
47.Gupta K, Thérien B, Ibrahim A, et al. Continual Pre-Training of Large Language Models: How to (re) warm your model?[J]. arXiv preprint arXiv:2308.04014, 2023.
48.Smith L N. Cyclical Focal Loss. arXiv 2022[J]. arXiv preprint arXiv:2202.08978.
49.Charoenphakdee N, Vongkulbhisal J, Chairatanakul N, et al. On focal loss for class-posterior probability estimation: A theoretical perspective[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 5202-5211.
50.肖振久,孔祥旭,宗佳旭,等.自我調整聚焦損失的圖像目標檢測演算法[J].電腦工程與應用,2021,57(23):185-192.
51.朱翌民,郭茹燕,巨家驥,等.一種結合Focal Loss的不平衡數據集提升樹分類演算法[J].軟體導刊,2021,20(11):65-69.
52.董博,褚淑美,朱欣藝.高校虛擬社群的功能研究——以“表白牆”為例[J].河北軟體職業技術學院學報,2020,22(04):37-41.DOI:10.13314/j.cnki.jhbsi.2020.04.010.
53.Liu Y, Ott M, Goyal N, et al. Roberta: A robustly optimized bert pretraining approach[J]. arXiv preprint arXiv:1907.11692, 2019.
54.Peng S, Yuan K, Gao L, et al. Mathbert: A pre-trained model for mathematical formula understanding[J]. arXiv preprint arXiv:2105.00377, 2021.
55.Yu W, Jiang Z, Chen F, et al. LV-BERT: Exploiting layer variety for BERT[J]. arXiv preprint arXiv:2106.11740, 2021.
56.Yang D, Wang X, Celebi R. Expanding the Vocabulary of BERT for Knowledge Base Construction[J]. arXiv preprint arXiv:2310.08291, 2023.
57.Tänzer M, Ruder S, Rei M. BERT memorisation and pitfalls in low-resource scenarios[J]. CoRR, abs/2105.00828, 2021.
58.Brownlee J. A gentle introduction to backpropagation through time[J]. Machine Learning Mastery,(), 2017.