本校學位論文庫
CITYU Theses & Dissertations
論文詳情
杨昊
蘇清朗
數據科學學院
數據科學碩士學位課程(中文學制)
碩士
2019
基於搜索引擎的用戶畫像模型搭建研究 : 集成學習 staking 算法的改進
The Research on User Profile Model Based on Search engine : Improvement of Stacking algorithm for Ensemble learning
搜索文本 ; 用戶畫像 ; 特徵模型 ; Stacking模型
Search text ; user profile ; feature model ; Stacking model
在各行各業都在擁抱大數據的今天,用戶畫像也在互聯網各大應用得到了重視。而在最具有商業價值的互聯網應用之一搜索引擎中,因爲其用戶可以不需登錄就使用的特殊性,以及相對其他互聯網應用與使用者較淺的互動性,對于獲得用戶的用戶畫像相對較有難度。基于此,研究中利用數據科學技術對可搜集到的互聯網用戶搜索數據進行分析,從自然語言處理的角度去挖掘用戶的基本屬性,構建出基于搜索引擎的用戶畫像框架。
基于上述背景,文章從自然語言處理,用戶畫像的研究文獻出發,梳理了其關鍵技術和理論,對數據進行預處理,使用了Jieba庫結合幷行化對搜索文本進行分詞,幷根據原始數據集的情况提出了人工特徵。根據使用情况對TF-IDF算法中的問題進行了改進,提出了基于布爾模型的TF-IDF算法,驗證了其有效率,然後分析了影響用戶與搜索投放廣告之間的影響因素,選取了對于廣告點擊率影響較大的主要維度,提出了構建用戶畫像維度的用戶畫像構建流程;接著以改進後的TF-IDF算法中得到的特徵做爲輸入,通過實驗對比,選取了靈活性較高的Stacking框架,融合了多種機器學習算法幷在第二層添加了Word2vec訓練詞向量做爲新特徵預測用戶畫像的維度,通過實驗和比較確定了該框架的穩定性以及準確率。
基於以上的技術處理,本文得到的結論爲:(1)引入基于布爾模型幷加入類間分布處理的TF-IDF算法可以對特徵不平衡的數據進行改;(2)將詞向量做爲補充的語義信息引入,可以提高分類正確率;(3)利用Stacking模型預測搜索引擎用戶屬性具有較好的效果,幷且通過實驗得知,在數據較少的情况下,模型仍然具有不錯的分類精度,且隨著訓練數據增多,分類準確率也在不斷提高,因此模型具有穩定性。
Today, all industries are closely related to big data,user profile also play a importance role in application of Internet, Search engines are one of the most commercially valuable Internet applications, because the user can not login the particularity of use, and relative to other Internet application shallow interaction with users, for users profile is relatively difficult. Based on this, the research USES data science and technology to analyze the collected search data of Internet users, mining the basic attributes of users from the perspective of natural language processing, and constructs a user portrait framework based on search engine.
Based on the above background, this article from the natural language processing, user profile of research literature, combing its key technology and theory。In the data processing, using “Jieba” library combining parallel to search text participle, and put forward the human according to the original data set of features, according to the problems in use of the TF - IDF algorithm is improved, and put forward the TF - IDF algorithm based on the Boolean model, verify its efficient, then analyzed the influence factors between the user and search advertising, has chosen to influence on advertising hits the main dimension。The process of constructing user profile is proposed to construct the dimension of user profile. Then the improved TF - IDF algorithm of feature as input, through Compare the results of the experiment, selected the flexibility high Stacking framework, combined with a variety of machine learning algorithms. Meanwhile, the Word2vec training word vector is added as a new feature in the Stacking framework layer 2,then Predict the dimensions of the user profile, user profile through the experiment and comparison to determine the stability of the framework and accuracy.
Based on the above technical processing, the conclusions of this paper are as follows :(1) the TG-IDG algorithm based on Boolean model and interclass distribution processing can be introduced to modify the data with unbalanced features; (2) the introduction of word vectors as supplementary semantic information can improve the classification accuracy; (3) the Stacking model was used to predict the user attributes of the search engine with a good effect. The experiment showed that the model still had a good classification accuracy even with less data, and the classification accuracy increased with the increase of training data, so the model was stable.
2020
中文
83
致 謝 I
摘 要 II
Abstract IV
圖目錄 VIII
表目錄 IX
公式目錄 X
第一章 緒 論 1
1.1 研究背景與意義 1
1.1.1研究背景 1
1.1.2研究意義 3
1.1.3研究目標 3
1.2 研究對象 4
1.3 研究內容,研究方法及技術框架 5
1.3.1研究內容 5
1.3.2技術路線 7
第二章 文獻探討 9
2.1用戶畫像 9
2.1.1研究畫像研究綜述 9
2.1.2研究畫像文獻研究 10
2.2搜索文本提取 12
2.2.1自然語言處理研究綜述 12
2.2.2自然語言處理文獻研究 13
2.3短文本分析 14
2.3.1文本分析研究綜述 14
2.3.2文本分詞文獻研究 15
2.4基於中文文本的特征提取 18
2.5基於用戶畫像的理論模型 20
2.5.1用戶畫像的維度 21
2.5.2基於搜索文本的用户画像框架构建流程 24
第三章 研究方法與設計 26
3.1數據介紹 26
3.2數據預處理 27
3.2.1停用詞處理 27
3.2.2分詞處理 28
3.2.3數據缺失與不均衡 28
3.3特徵提取 31
3.3.1基於短文本的TF-IDF的改進與應用 31
3.3.2 Word2vec應用 34
3.4研究時間流程 35
3.5本章總結 35
第四章 數據分析與結果 37
4.1構建用戶畫像 37
4.2基於短文本的機器學習算法 38
4.2.1LR算法 38
4.2.2XGBoost算法 39
4.2.3Stacking框架 40
4.3基於短文本的Stacking框架 42
4.3.1對於Stacking模型框架第一層的構思 43
4.3.2對於Stacking模型框架第二層的構思 44
第五章 結論與建議 46
5.1用戶畫像模型結論與應用 46
5.2基於搜索文本的用戶畫像模型應用 47
5.2.1基於搜索文本的用戶畫像分類 47
5.2.2基於用戶畫像的用戶推薦 48
5.3總結 50
5.4不足與展望 51
5.4.1研究的不足 51
5.4.2對研究的展望 51
參考文獻 53
作者簡歷 60
附 錄 62
[1] C. Biao, W. Yanpeng, Z. Lina, H. Yanmei, and L. Hongjun, "Edge classification based on Convolutional Neural Networks for community detection in complex network," North-Holland, vol. 556, no. prepublish, 2020.
[2] M. Goudjil, M. Koudil, M. Bedda, and N. Ghoggali, "A Novel Active Learning Method Using SVM for Text Classification," International Journal of Automation and Computing, vol. 15, no. 03, pp. 290-298, 2018.
[3] R. Xu, J. Du, Z. Zhao, Y. He, Q. Gao, and L. Gui, "Inferring user profiles in social media by joint modeling of text and networks," Science China Press, vol. 62, no. 11, 2019.
[4] Yang, Zichao, et al. “Hierarchical attention networks for document classification .” Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2016
[5] 錢敏, "國內社會化搜索引擎發展狀况分析," 情報探索, no. 10, pp. 125-128, 2019.
[6] 王清波, 陳青青, and 王琳斌, "基于Jieba分詞的醫療設備信息查詢一站式服務系統設計," 中國醫學裝備, vol. 17, no. 01, pp. 131-134, 2020.
[7] 吳擋平, 張忠林, and 曹婷婷, "基于Stacking策略的穩定性分類器組合模型研究," 小型微型計算機系統, vol. 40, no. 05, pp. 1045-1049, 2019.
[8] 楊長春, 徐筱, 宦娟, 田迎春, and 經德林, "基于隨機森林的學生畫像特徵選擇方法," 計算機工程與設計, vol. 40, no. 10, pp. 2827-2834, 2019.
[9] 于朝輝, "CNNIC發布第44次《中國互聯網絡發展狀况統計報告》," 網信軍民融合, no. 09, pp. 30-31, 2019.
[10]張莉, "基于“用戶畫像”的精准營銷策略研究," 現代營銷(下旬刊), no. 03, pp. 93-94, 2020.
[11] Z. Anesbury, M. Winchester, and R. Kennedy, "Brand user profiles seldom change and seldom differ," Springer US, vol. 28, no. 4, 2017.
[12] N. M. M. Ramos, A. Curado, and R. M. S. F. Almeida, "Analysis of User Behavior Profiles and Impact on the Indoor Environment in Social Housing of Mild Climate Countries," Elsevier Ltd, vol. 78, 2015.
[13]丁偉, 王題, 劉新海, and 韓涵, "基于大數據技術的手機用戶畫像與征信研究," 郵電設計技術, no. 03, pp. 64-69, 2016.
[14]李映坤, "大數據背景下用戶畫像的統計方法實踐研究," 碩士, 首都經濟貿易大學, 2016.
[15]劉蓓琳 and 張琪, "基于購買决策過程的電子商務用戶畫像應用研究," 商業經濟研究, no. 24, pp. 49-51, 2017.
[16]權甜甜, "基于搜索數據的用戶畫像模型研究," 碩士, 武漢理工大學, 2018.
[17]席岩, 張乃光, 王磊, 張智軍, and 劉海濤, "基于大數據的用戶畫像方法研究綜述," 廣播電視信息, no. 10, pp. 37-41, 2017.
[18]邢彪 and 根絨切機多吉, "基于jieba分詞搜索與SSM框架的電子商城購物系統," 信息與電腦(理論版), no. 07, pp. 104-105+108, 2018.
[19]袁志會, "搜索引擎營銷對用戶選擇及體驗的影響研究," 碩士, 重慶郵電大學, 2017.
[20]周寅, "融合深度學習特徵與淺層機器學習特徵的中文分詞關鍵技術研究," 碩士, 華中師範大學, 2017.
[21]黃昌寧 and 趙海, "中文分詞十年回顧," 中文信息學報, no. 03, pp. 8-19, 2007.
[22]李躍鵬, 金翠, and 及俊川, "基于word2vec的關鍵詞提取算法," 科研信息化技術與應用, vol. 6, no. 04, pp. 54-59, 2015.
[23]劉鵬, "網絡用戶行爲分析的若干問題研究," 博士, 北京郵電大學, 2010.
[24]劉青, "産品評論挖掘技術現狀概述," 電子製作, no. 15, p. 249, 2013.
[25]冉飛, "基于用戶行爲的搜索引擎營銷策略及應用研究," 碩士, 華東理工大學, 2013.
[26]An, J. & Kwak, H. & Jung, S. & Salminen, Joni & Admad, M. & Jansen,Jim."Imaginary People Representing Real Numbers: Generating Personas from Online Social Media Data." ACM Transactions on the Web. 12. 1-26. 10.1145/3265986.2018.
[27]Yan Li, Junge Zhang, Jianguo Zhang, Kaiqi Huang; The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7463-74712018.
[28]Mao, Yihuan & Wang, Yujing & Wu, Chufan & Zhang, Chen & Wang, Yang & Yang, Yaming & Zhang, Quanlu & Tong, Yunhai & Bai, Jing. " LadaBERT: Lightweight Adaptation of BERT through Hybrid Model Compression.".2020.
[29]Lample, Guillaume et al. “Phrase-Based & Neural Unsupervised Machine Translation.” EMNLP 2018.
[30]S.Beitzel, E. Jensen, O. Frieder, D. Lewis, A. Chowdhury, and A. Kolcz. "Improving automatic query classification via semi-supervised learning". In Proc. of the 5th IEEE International Conference on Data Mining(ICDM-05), 2005.
[31]Xue, Nianwen. "Chinese word segmentation as character tagging. International Journal of Computational Linguistics and Chinese Language Processing." 8. 29-48,2003.
[32]Xue, Nianwen , Converse, Susan. "Combining Classifiers for Chinese Word Segmentation." 10.3115/1118824.1118839, 2003.
[33]Mikolov, Tomas, et al. “Efficient estimation of word representations in vector space.” arXiv preprint arXiv:1301.3781 (2013).
[34]Mesnil G, Mikolov T, Ranzato M A, et al. Ensemble of generative and discriminative techniques for sentiment analysis of movie reviews[J]. arXiv preprint arXiv:1412.5335, 2014.
[35]Liu, Yang, et al. “Topical Word Embeddings.” AAAI. 2015.
[36]Hongwei Wang, Fuzheng Zhang, Miao Zhao, Wenjie Li, Xing Xie, Minyi Guo1.(2019).Multi-Task Feature Learning for Knowledge Graph Enhanced Recommendation. arXiv:1901.08907v1 [cs.IR] 23 Jan 2019
[37]Hongwei Wang, Fuzheng Zhang, Xing Xie, Minyi Guo.(2018). DKN: Deep Knowledge-Aware Network for News Recommendation. arXiv:1801.08284v1 [stat.ML] 25 Jan 2018
[38]P.-S. Huang, X. He, J. Gao, L. Deng, A. Acero, and L. Heck, "Learning deep structured semantic models for web search using clickthrough data," presented at the Proceedings of the 22nd ACM international conference on Information & Knowledge Management, San Francisco, California, USA, 2013. Available: https://doi.org/10.1145/2505515.2505665
[39]S. Li, Z. Zhao, R. Hu, W. Li, T. Liu, and X. Du, "Analogical Reasoning on Chinese Morphological and Semantic Relations," in ACL, 2018.
[40]C. Pei et al., "Personalized re-ranking for recommendation," presented at the Proceedings of the 13th ACM Conference on Recommender Systems, Copenhagen, Denmark, 2019. Available: https://doi.org/10.1145/3298689.3347000
[41]Peter A Flach,"Stacked Generalization," in Encyclopedia of Machine Learning and Data Mining, C. Sammut and G. I. Webb, Eds. Boston, MA: Springer US, 2017, pp. 1173-1173.
[42]Breiman and Leo, "Stacked Regressions," Machine Learning, vol. 24, no. 1, pp. 49-64, 1996.
[43]M. Jahrer, A. Töscher, and R. Legenstein, "Combining predictions for accurate recommender systems," presented at the Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, Washington, DC, USA, 2010. Available: https://doi.org/10.1145/1835804.1835893
[44]X.-H. Phan, C.-T. Nguyen, D.-T. Le, L.-M. Nguyen, S. Horiguchi, and Q.-T. Ha, "A Hidden Topic-Based Framework toward Building Applications with Short Web Documents," IEEE Trans. on Knowl. and Data Eng., vol. 23, no. 7, pp. 961–976, 2011.
[45]X. Qin, L. Zong, Y. Wu, X. Wan, and J. Yang, "CRF-based Experiments for Cross-Domain Chinese Word Segmentation at CIPS-SIGHAN-2010," in CIPS-SIGHAN Joint Conference on Chinese Language Processing, 2010.
[46]D. Shen et al., "Query enrichment for web-query classification," ACM Trans. Inf. Syst., vol. 24, no. 3, pp. 320–352, 2006.
[47]郝增勇, "基于Hadoop用戶行爲分析系統設計與實現," 碩士, 北京交通大學, 2014.
[48]權甜甜, "基于搜索數據的用戶畫像模型研究," 武漢理工大學, 2018.
[49]T. Chen and C. Guestrin, "XGBoost: A scalable tree boosting system. arXiv 2016," arXiv preprint arXiv:1603.02754.
[50]G. Lample, M. Ott, A. Conneau, L. Denoyer, and M. A. Ranzato, "Phrase-Based & Neural Unsupervised Machine Translation," in EMNLP, 2018.