Today, all industries are closely related to big data,user profile also play a importance role in application of Internet, Search engines are one of the most commercially valuable Internet applications, because the user can not login the particularity of use, and relative to other Internet application shallow interaction with users, for users profile is relatively difficult. Based on this, the research USES data science and technology to analyze the collected search data of Internet users, mining the basic attributes of users from the perspective of natural language processing, and constructs a user portrait framework based on search engine.
Based on the above background, this article from the natural language processing, user profile of research literature, combing its key technology and theory。In the data processing, using “Jieba” library combining parallel to search text participle, and put forward the human according to the original data set of features, according to the problems in use of the TF - IDF algorithm is improved, and put forward the TF - IDF algorithm based on the Boolean model, verify its efficient, then analyzed the influence factors between the user and search advertising, has chosen to influence on advertising hits the main dimension。The process of constructing user profile is proposed to construct the dimension of user profile. Then the improved TF - IDF algorithm of feature as input, through Compare the results of the experiment, selected the flexibility high Stacking framework, combined with a variety of machine learning algorithms. Meanwhile, the Word2vec training word vector is added as a new feature in the Stacking framework layer 2,then Predict the dimensions of the user profile, user profile through the experiment and comparison to determine the stability of the framework and accuracy.
Based on the above technical processing, the conclusions of this paper are as follows :(1) the TG-IDG algorithm based on Boolean model and interclass distribution processing can be introduced to modify the data with unbalanced features; (2) the introduction of word vectors as supplementary semantic information can improve the classification accuracy; (3) the Stacking model was used to predict the user attributes of the search engine with a good effect. The experiment showed that the model still had a good classification accuracy even with less data, and the classification accuracy increased with the increase of training data, so the model was stable.
[1] C. Biao, W. Yanpeng, Z. Lina, H. Yanmei, and L. Hongjun, "Edge classification based on Convolutional Neural Networks for community detection in complex network," North-Holland, vol. 556, no. prepublish, 2020.
[2] M. Goudjil, M. Koudil, M. Bedda, and N. Ghoggali, "A Novel Active Learning Method Using SVM for Text Classification," International Journal of Automation and Computing, vol. 15, no. 03, pp. 290-298, 2018.
[3] R. Xu, J. Du, Z. Zhao, Y. He, Q. Gao, and L. Gui, "Inferring user profiles in social media by joint modeling of text and networks," Science China Press, vol. 62, no. 11, 2019.
[4] Yang, Zichao, et al. “Hierarchical attention networks for document classification .” Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2016
[5] 錢敏, "國內社會化搜索引擎發展狀况分析," 情報探索, no. 10, pp. 125-128, 2019.
[6] 王清波, 陳青青, and 王琳斌, "基于Jieba分詞的醫療設備信息查詢一站式服務系統設計," 中國醫學裝備, vol. 17, no. 01, pp. 131-134, 2020.
[7] 吳擋平, 張忠林, and 曹婷婷, "基于Stacking策略的穩定性分類器組合模型研究," 小型微型計算機系統, vol. 40, no. 05, pp. 1045-1049, 2019.
[8] 楊長春, 徐筱, 宦娟, 田迎春, and 經德林, "基于隨機森林的學生畫像特徵選擇方法," 計算機工程與設計, vol. 40, no. 10, pp. 2827-2834, 2019.
[9] 于朝輝, "CNNIC發布第44次《中國互聯網絡發展狀况統計報告》," 網信軍民融合, no. 09, pp. 30-31, 2019.
[10]張莉, "基于“用戶畫像”的精准營銷策略研究," 現代營銷(下旬刊), no. 03, pp. 93-94, 2020.
[11] Z. Anesbury, M. Winchester, and R. Kennedy, "Brand user profiles seldom change and seldom differ," Springer US, vol. 28, no. 4, 2017.
[12] N. M. M. Ramos, A. Curado, and R. M. S. F. Almeida, "Analysis of User Behavior Profiles and Impact on the Indoor Environment in Social Housing of Mild Climate Countries," Elsevier Ltd, vol. 78, 2015.
[13]丁偉, 王題, 劉新海, and 韓涵, "基于大數據技術的手機用戶畫像與征信研究," 郵電設計技術, no. 03, pp. 64-69, 2016.
[14]李映坤, "大數據背景下用戶畫像的統計方法實踐研究," 碩士, 首都經濟貿易大學, 2016.
[15]劉蓓琳 and 張琪, "基于購買决策過程的電子商務用戶畫像應用研究," 商業經濟研究, no. 24, pp. 49-51, 2017.
[16]權甜甜, "基于搜索數據的用戶畫像模型研究," 碩士, 武漢理工大學, 2018.
[17]席岩, 張乃光, 王磊, 張智軍, and 劉海濤, "基于大數據的用戶畫像方法研究綜述," 廣播電視信息, no. 10, pp. 37-41, 2017.
[18]邢彪 and 根絨切機多吉, "基于jieba分詞搜索與SSM框架的電子商城購物系統," 信息與電腦(理論版), no. 07, pp. 104-105+108, 2018.
[19]袁志會, "搜索引擎營銷對用戶選擇及體驗的影響研究," 碩士, 重慶郵電大學, 2017.
[20]周寅, "融合深度學習特徵與淺層機器學習特徵的中文分詞關鍵技術研究," 碩士, 華中師範大學, 2017.
[21]黃昌寧 and 趙海, "中文分詞十年回顧," 中文信息學報, no. 03, pp. 8-19, 2007.
[22]李躍鵬, 金翠, and 及俊川, "基于word2vec的關鍵詞提取算法," 科研信息化技術與應用, vol. 6, no. 04, pp. 54-59, 2015.
[23]劉鵬, "網絡用戶行爲分析的若干問題研究," 博士, 北京郵電大學, 2010.
[24]劉青, "産品評論挖掘技術現狀概述," 電子製作, no. 15, p. 249, 2013.
[25]冉飛, "基于用戶行爲的搜索引擎營銷策略及應用研究," 碩士, 華東理工大學, 2013.
[26]An, J. & Kwak, H. & Jung, S. & Salminen, Joni & Admad, M. & Jansen,Jim."Imaginary People Representing Real Numbers: Generating Personas from Online Social Media Data." ACM Transactions on the Web. 12. 1-26. 10.1145/3265986.2018.
[27]Yan Li, Junge Zhang, Jianguo Zhang, Kaiqi Huang; The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7463-74712018.
[28]Mao, Yihuan & Wang, Yujing & Wu, Chufan & Zhang, Chen & Wang, Yang & Yang, Yaming & Zhang, Quanlu & Tong, Yunhai & Bai, Jing. " LadaBERT: Lightweight Adaptation of BERT through Hybrid Model Compression.".2020.
[29]Lample, Guillaume et al. “Phrase-Based & Neural Unsupervised Machine Translation.” EMNLP 2018.
[30]S.Beitzel, E. Jensen, O. Frieder, D. Lewis, A. Chowdhury, and A. Kolcz. "Improving automatic query classification via semi-supervised learning". In Proc. of the 5th IEEE International Conference on Data Mining(ICDM-05), 2005.
[31]Xue, Nianwen. "Chinese word segmentation as character tagging. International Journal of Computational Linguistics and Chinese Language Processing." 8. 29-48,2003.
[32]Xue, Nianwen , Converse, Susan. "Combining Classifiers for Chinese Word Segmentation." 10.3115/1118824.1118839, 2003.
[33]Mikolov, Tomas, et al. “Efficient estimation of word representations in vector space.” arXiv preprint arXiv:1301.3781 (2013).
[34]Mesnil G, Mikolov T, Ranzato M A, et al. Ensemble of generative and discriminative techniques for sentiment analysis of movie reviews[J]. arXiv preprint arXiv:1412.5335, 2014.
[35]Liu, Yang, et al. “Topical Word Embeddings.” AAAI. 2015.
[36]Hongwei Wang, Fuzheng Zhang, Miao Zhao, Wenjie Li, Xing Xie, Minyi Guo1.(2019).Multi-Task Feature Learning for Knowledge Graph Enhanced Recommendation. arXiv:1901.08907v1 [cs.IR] 23 Jan 2019
[37]Hongwei Wang, Fuzheng Zhang, Xing Xie, Minyi Guo.(2018). DKN: Deep Knowledge-Aware Network for News Recommendation. arXiv:1801.08284v1 [stat.ML] 25 Jan 2018
[38]P.-S. Huang, X. He, J. Gao, L. Deng, A. Acero, and L. Heck, "Learning deep structured semantic models for web search using clickthrough data," presented at the Proceedings of the 22nd ACM international conference on Information & Knowledge Management, San Francisco, California, USA, 2013. Available: https://doi.org/10.1145/2505515.2505665
[39]S. Li, Z. Zhao, R. Hu, W. Li, T. Liu, and X. Du, "Analogical Reasoning on Chinese Morphological and Semantic Relations," in ACL, 2018.
[40]C. Pei et al., "Personalized re-ranking for recommendation," presented at the Proceedings of the 13th ACM Conference on Recommender Systems, Copenhagen, Denmark, 2019. Available: https://doi.org/10.1145/3298689.3347000
[41]Peter A Flach,"Stacked Generalization," in Encyclopedia of Machine Learning and Data Mining, C. Sammut and G. I. Webb, Eds. Boston, MA: Springer US, 2017, pp. 1173-1173.
[42]Breiman and Leo, "Stacked Regressions," Machine Learning, vol. 24, no. 1, pp. 49-64, 1996.
[43]M. Jahrer, A. Töscher, and R. Legenstein, "Combining predictions for accurate recommender systems," presented at the Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, Washington, DC, USA, 2010. Available: https://doi.org/10.1145/1835804.1835893
[44]X.-H. Phan, C.-T. Nguyen, D.-T. Le, L.-M. Nguyen, S. Horiguchi, and Q.-T. Ha, "A Hidden Topic-Based Framework toward Building Applications with Short Web Documents," IEEE Trans. on Knowl. and Data Eng., vol. 23, no. 7, pp. 961–976, 2011.
[45]X. Qin, L. Zong, Y. Wu, X. Wan, and J. Yang, "CRF-based Experiments for Cross-Domain Chinese Word Segmentation at CIPS-SIGHAN-2010," in CIPS-SIGHAN Joint Conference on Chinese Language Processing, 2010.
[46]D. Shen et al., "Query enrichment for web-query classification," ACM Trans. Inf. Syst., vol. 24, no. 3, pp. 320–352, 2006.
[47]郝增勇, "基于Hadoop用戶行爲分析系統設計與實現," 碩士, 北京交通大學, 2014.
[48]權甜甜, "基于搜索數據的用戶畫像模型研究," 武漢理工大學, 2018.
[49]T. Chen and C. Guestrin, "XGBoost: A scalable tree boosting system. arXiv 2016," arXiv preprint arXiv:1603.02754.
[50]G. Lample, M. Ott, A. Conneau, L. Denoyer, and M. A. Ranzato, "Phrase-Based & Neural Unsupervised Machine Translation," in EMNLP, 2018.