浏览全部资源
扫码关注微信
[ "石恩名(1991-),男,现就职于广州优亿信息科技有限公司,主要研究方向为数据挖掘、人工智能和地理信息系统等。" ]
[ "肖晓军(1970-),男,博士,广州优亿信息科技有限公司高级工程师,主要研究方向为大数据、数据挖掘和电信行业应用等。" ]
[ "卢宇(1983-),男,现就职于广州优亿信息科技有限公司,主要研究方向为大数据、机器学习和人工智能等。" ]
网络出版日期:2017-08,
纸质出版日期:2017-08-15
移动端阅览
石恩名, 肖晓军, 卢宇. 基于云平台的分布式高性能网络爬虫的研究与设计[J]. 电信科学, 2017,33(8):180-186.
Enming SHI, Xiaojun XIAO, Yu LU. Research and design of distributed high-performance network reptiles based on cloud platform[J]. Telecommunications science, 2017, 33(8): 180-186.
石恩名, 肖晓军, 卢宇. 基于云平台的分布式高性能网络爬虫的研究与设计[J]. 电信科学, 2017,33(8):180-186. DOI: 10.11959/j.issn.1000-0801.2017234.
Enming SHI, Xiaojun XIAO, Yu LU. Research and design of distributed high-performance network reptiles based on cloud platform[J]. Telecommunications science, 2017, 33(8): 180-186. DOI: 10.11959/j.issn.1000-0801.2017234.
随着大数据时代的到来,数据成为最宝贵的资源,而网络爬虫技术作为外部数据采集的重要手段,已然成为数据分析的标配。介绍了一种高性能、灵活和便捷的基于云平台的爬虫架构设计和实现。从爬虫的整体架构、分布式设计以及各模块的设计等角度进行了详细的阐述。爬虫各模块用 Docker 封装,Kubernetes做集群的资源调度和管理,在性能优化上采用了MD5去重树算法、DNS优化和异步I/O等多种策略组合的形式。实验表明,对比未优化的方案,爬虫在性能上具有较明显的优势。
With the arrival of large data age
data has become the most valuable resource.And web crawler technology as an important means of external data collection
has become a standard tool for data analysis.A high-performance
convenient cloud-based crawler architecture design was introduced.The overall structure of the reptile to the distributed design and the design of the sub-module was described in detail.Each module of the crawler was encapsulated in Docker
and Kubernetes was used as the resource scheduling and management of the cluster.In the performance of optimization
the MD5 reset tree algorithm
DNS optimization and asynchronous I/O were adopted.Experimental results show that the performance of crawler has obvious advantages compared with the UN optimized scheme.
NEGUS C . Docker containers from start to enterprise (includes content update program):build and deploy with Kubernetes,Flannel,Cockpit and Atomic [J ] . Vaccine , 2016 ( 19 ): S87 - S95 .
严华云 , 关佶红 . Bloom Filter 研究进展 [J ] . 电信科学 , 2010 , 26 ( 2 ): 31 - 36 .
YAN H Y , GUAN J H . Survey of Bloom Filter [J ] . Telecommunications Science , 2010 , 26 ( 2 ): 31 - 36 .
吴桦 , 龚俭 , 杨望 . 一种基于双重Counter Bloom Filter的长流识别算法 [J ] . 软件学报 , 2010 , 21 ( 5 ): 1115 - 1126 .
WU H , GONG J , YANG W . Algorithm based on double Counter Bloom Filter for large flows identification [J ] . Journal of Software , 2010 , 21 ( 5 ): 1115 - 1126 .
张进 , 邬江兴 , 刘勤让 . 4种计数型Bloom Filter的性能分析与比较 [J ] . 软件学报 , 2010 , 21 ( 5 ): 1098 - 1114 .
ZHANG J , WU J X , LIU Q R . Performance evaluation and comparison of four Counting Bloom filter schemes [J ] . Journal of Software , 2010 , 21 ( 5 ): 1098 - 1114 .
严磊 , 丁宾 , 姚志敏 , 等 . 基于 MD5 去重树的网络爬虫的设计与优化 [J ] . 计算机应用与软件 , 2015 ( 2 ): 325 - 329 ,333.
YAN L , DING B , YAO Z M , et al . Design and optimisation of md5 duplicate elimination tree-based network crawler [J ] . Computer Application and Software , 2015 ( 2 ): 325 - 329 ,333.
BRIN S , PAGE J . The anatomy of a large-scale hypertextual web search engine [J ] . Computer Networks and Isdn Systems , 1998 , 98 ( 30 ): 107 - 117 .
朱泽德 , 李淼 , 张健 , 等 . 基于文本密度模型的Web正文抽取 [J ] . 模式识别与人工智能 , 2013 ( 7 ): 667 - 672 .
ZHU Z D , LI M , ZHANG J , et al . Web content extraction based on text density model [J ] . Pattern Recognition and Artificial Intelligence , 2013 ( 7 ): 667 - 672 .
0
浏览量
1192
下载量
0
CSCD
关联资源
相关文章
相关作者
相关机构