浏览全部资源
扫码关注微信
[ "寇晓淮(1989-),男,华东理工大学信息科学与工程学院硕士生,主要研究方向为信息分析与处理、智能信号处理和网络与信息安全。" ]
[ "程华(1975-),男,博士,华东理工大学信息科学与工程学院副教授,主要研究方向为信息安全、信号处理、网络行为学和流量工程。" ]
网络出版日期:2017-11,
纸质出版日期:2017-11-20
移动端阅览
寇晓淮, 程华. 基于主题模型的垃圾邮件过滤系统的设计与实现[J]. 电信科学, 2017,33(11):73-82.
Xiaohuai KOU, Hua CHENG. Design and implementation of spam filtering system based on topic model[J]. Telecommunications science, 2017, 33(11): 73-82.
寇晓淮, 程华. 基于主题模型的垃圾邮件过滤系统的设计与实现[J]. 电信科学, 2017,33(11):73-82. DOI: 10.11959/j.issn.1000-0801.2017313.
Xiaohuai KOU, Hua CHENG. Design and implementation of spam filtering system based on topic model[J]. Telecommunications science, 2017, 33(11): 73-82. DOI: 10.11959/j.issn.1000-0801.2017313.
垃圾邮件过滤技术在保证信息安全、提高资源利用、分拣信息数据等方面都发挥着重要作用。然而,垃圾邮件的出现影响了用户的体验,并且会造成不必要的经济与时间损失。针对现有的垃圾邮件过滤技术的不足,基于多个主题词理论,构建了基于朴素贝叶斯的垃圾邮件分类方法。在邮件主题获取中,采用主题模型LDA得到邮件的相关主题及主题词;并进一步采用Word2Vec寻找主题词的同义词和关联词,扩展主题词集合。在邮件分类中,对训练数据集进行统计学习得到词语的先验概率;基于扩展的主题词集合及其概率,通过贝叶斯公式推导得到某个主题和某封邮件的联合概率,以此作为垃圾邮件判定的依据。同时,基于主题模型的垃圾邮件过滤系统具有简洁易应用的特点。通过与其他典型垃圾邮件过滤方法的对比实验,证明基于主题模型的垃圾邮件分类方法及基于Word2Vec的改进方法均能有效提高垃圾邮件过滤的准确度。
Spam filtering technology plays a key role in many areas including information security
transmission efficiency
and automatic information classification.However
the emergence of spam affects the user's sense of experience
and can cause unnecessary economic and time loss.The deficiency of spam filtering technology was researched
and a method of spam classification based on naive Bayesian was put forward based on multiple keywords.In the subject of mail
the theme model was used by LDA to get the related subject and keyword of the message
and Word2Vec was further used to search keyword synonyms and related words
extending the keyword collection.In the classification of mails
the transcendental probability of the words in the training dataset was obtained by statistical learning.Based on the extended keyword collection and its probability
the joint probability of a subject and a message was deduced by the Bayesian formula as a basis for the spam judgment.At the same time
the spam filtering system based on topic model was simple and easy to apply.By comparing experiments with other typical spam filtering method
it is proved that the method of spam classification based on theme model and the improved method based on Word2Vec can effectively improve the accuracy of spam filtering.
MIKOLOV T , CHEN K , CORRADO G , et al . Efficient estimation of word representations in vector space [J ] . arXiv preprint arXiv:1301.3781 , 2012 .
祝毅鸣 , 张波 . 实时黑名单在垃圾邮件过滤系统中的应用 [J ] . 科技资讯 , 2012 ( 12 ): 33 .
ZHU Y M , ZHANG B . Application of real time blacklist in spam filtering system [J ] . Science & Technology Information , 2012 ( 12 ): 33 .
MA J , ZHANG Y , WANG Z , et al . A message topic model for multi-grain SMS spam filtering [J ] . International Journal of Technology & Human Interaction , 2016 , 12 ( 2 ): 83 - 95 .
SHEN J J , CHEN Y K , CHU K T , et al . An intelligent three-phase spam filtering method based on decision tree data mining [J ] . Security & Communication Networks , 2016 , 9 ( 17 ): 4013 - 4026 .
FENG W , SUN J , ZHANG L , et al . A support vector machine based naive Bayes algorithm for spam filtering [C ] // 2016 Performance Computing and Communications Conference,Dec 9-11,2016,Las Vegas,NV,USA . New Jersey:IEEE Press , 2017 : 1 - 8 .
BANSAL R P , HAMILTON I R A.O'CONNELL B M , et al . System and method to control email whitelists:US,US 8676903 B2 [P ] . 2014 .
CHAN P P K , YANG C , YEUNG D S , et al . Spam filtering for short messages in adversarial environment [J ] . Neurocomputing , 2015 , 155 ( C ): 167 - 176 .
DEVI K S , RAVI R . A new feature selection algorithm for Efficient Spam Filtering using Adaboost and Hashing techniques [J ] . Indian Journal of Science & Technology , 2015 , 8 ( 13 ).
AFZAL H , MEHMOOD K . Spam filtering of bi-lingual tweets using machine learning [C ] // International Conference on Advanced Communication Technology,Jan 31-Feb 3,2016,Pyeongchang,South Korea . New Jersey:IEEE Press , 2016 .
DAS M , BHOMICK A , SINGH Y J , et al . A modular approach towards image spam filtering using multiple classifiers [C ] // 2014 IEEE International Conference on Computational Intelligence and Computing Research.Dec 20,2014,Coimbatore,India . New Jersey:IEEE Press , 2015 : 1 - 8 .
曹玉东 , 刘艳洋 , 贾旭 , 等 . 基于改进的局部敏感散列算法实现图像型垃圾邮件过滤 [J ] . 计算机应用研究 , 2016 , 33 ( 6 ): 1693 - 1696 .
CAO Y D , LIU Y Y , JIA X , et al . Image spam filtering with improved LSH algorithm [J ] . Application Research of Computers , 2016 , 33 ( 6 ): 1693 - 1696 .
徐凯 , 陈平华 , 刘双印 . 基于 Adaboost-Bayes 算法的中文文本分类系统 [J ] . 微电子学与计算机 , 2016 , 33 ( 6 ): 63 - 67 .
XU K , CHEN P H , LIU S Y . A Chinese text classification system based on Adaboost-Bayes algorithm [J ] . Microelectronics & Computer , 2016 , 33 ( 6 ): 63 - 67 .
周庆良 . 一种基于 Adaboost 和分类回归树的垃圾邮件过滤算法 [D ] . 武汉:华中科技大学 , 2016 .
ZHOU Q L . A spam filtering algorithm based on Adaboost and classification regression tree [D ] . Wuhan:Huazhong University of Science and Technology , 2016 .
SMITH D A , MCMANIS C . Classification of text to subject using LDA [C ] // 2015 IEEE International Conference on Semantic Computing (ICSC),Feb 7- Feb 9,2015,Anaheim,CA,USA . New Jersey:IEEE Press , 2015 : 131 - 135 .
赵治国 , 谭敏生 , 李志敏 . 基于改进贝叶斯的垃圾邮件过滤算法综述 [J ] . 南华大学学报:自然科学版 , 2006 , 20 ( 1 ): 33 - 38 .
ZHAO Z G , TAN M S , LI Z M . Review of spam filter algorithms based on improved Bayes [J ] . Journal of Nanhua University(Science and Technology) , 2006 , 20 ( 1 ): 33 - 38 .
林巧民 , 许建真 , 许棣华 , 等 . 基于贝叶斯算法的垃圾邮件过滤技术 [J ] . 南京师范大学学报:工程技术版 , 2005 , 5 ( 4 ): 61 - 64 .
LIN Q M , XU J Z , XU D H , et al . Research on Bayes-based spam filtering [J ] . Journal of Nanjing Normal University(Engineering and Technology) , 2005 , 5 ( 4 ): 61 - 64 .
LI L , MAO T , HUANG D . Extracting location names from Chinese texts based on SVM and KNN [C ] // 2005 IEEE International Conference on Natural Language Processing and Knowledge Engineering(IEEE NLP-KE'05),Oct 30-Nov 1,Wuhan,China . New Jersey:IEEE Press , 2005 : 371 - 375 .
林文香 . 改进的KNN算法在过滤垃圾邮件中的应用研究 [D ] . 长沙:湖南大学 , 2010 .
LIN W X . Application of improved KNN algorithm in spam e-mail filtering [D ] . Changsha:Hunan University , 2010 .
0
浏览量
1190
下载量
0
CSCD
关联资源
相关文章
相关作者
相关机构