浏览全部资源
扫码关注微信
[ "吴斌,男,北京邮电大学教授、博士生导师,主要研究方向为数据挖掘、云计算。" ]
[ "刘心光,男,北京邮电大学硕士研究生,主要研究方向为数据库、数据挖掘、复杂网络与复杂系统。" ]
纸质出版日期:2013-12-20,
网络出版日期:2013-12,
移动端阅览
吴斌, 刘心光. 一种基于改进的链式MapReduce的并行ETL应用[J]. 电信科学, 2013,29(12):1-8.
BIN WU, XINGUANG LIU. A Parallel ETL Tool Based on an Improved Chain-MapReduce Framework. [J]. Telecommunications science, 2013, 29(12): 1-8.
吴斌, 刘心光. 一种基于改进的链式MapReduce的并行ETL应用[J]. 电信科学, 2013,29(12):1-8. DOI: 10.3969/j.issn.1000-0801.2013.12.001.
BIN WU, XINGUANG LIU. A Parallel ETL Tool Based on an Improved Chain-MapReduce Framework. [J]. Telecommunications science, 2013, 29(12): 1-8. DOI: 10.3969/j.issn.1000-0801.2013.12.001.
介绍了并行ETL 的相关工作和常见的处理多MapReduce 作业流程的方法;提出一种改进的链式MapReduce 框架,并将此框架应用于一个并行ETL 工具,同时提出一些针对ETL 处理的流程级优化规则,使ETL流程产生更少的MapReduce作业,从而减少I/O以及网络传输的消耗;利用某省份手机上网数据与Hive进行了大数据对比实验,结果表明,本ETL工具的性能平均比Hive快10%~20%。
The related work in parallel ETL and common methods to deal with multiple MapReduce jobs were introduced. Then an improved chain-MapReduce framework was presented
based on this framework
a parallel ETL tool was designed. Several optimization rules on ETL which will make the ETL process generate less MapReduce jobs to avoid unnecessary I/O and network cost were presented. The ETL tool on real queries and real big datasets were evaluated. Compared with Hive
the tool reduces time on average by 10% to 20%.
ETL优化规则改进的链式MapReduce
improved chain-MapReduceETLoptimization rule
Dean J, Ghemawat S . MapReduce:simplified data processing on large clusters . Proceedings of the 6th Conference on Symposium on Opearting Systems Resign&Implementation , San Francisco, USA, 2004
Apache. Hive. http://hive.apache.org/http://hive.apache.org/, 2013
Chen S T . Cheetah:a high performance, custom data warehouse on top of MapReduce. Proceedings of the VLDB Endowment, 2010,3(1):1459~1468
Melnik S, Gubarev A, Long J J , et al. Dremel: interactive analysis of web-scale datasets. Communications of the ACM, 2011,54(6):114~123
Cloudera. Impala. https://ccp.cloudera.com/display/IMPALA10 BETADOChttps://ccp.cloudera.com/display/IMPALA10 BETADOC,
Olston C, Reed B, Srivastava U , et al. Pig Latin: a not-so-foreign language for data processing. Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, 2008: 1099~1110
IBM. JAQL. http://www-01.ibm.com/software/data/infosphere/hadoop/jaql/http://www-01.ibm.com/software/data/infosphere/hadoop/jaql/,
Chambers C, Raniwala A, Perry F , et al. FlumeJava: easy, efficient data-parallel pipelines. ACM SIGPLAN Notices, 2010,45(6): 363~375
Cascading. http://www.cascading.org/http://www.cascading.org/, 2012
Horton Works. Tez. http://hortonworks.com/blog/introducing-tez-faster-hadoop-processing/http://hortonworks.com/blog/introducing-tez-faster-hadoop-processing/, 2013
Bu Y Y, Howe B, Balazinska M , et al. HaLoop: efficient iterative data processing on large clusters. Proceedings of the VLDB Endowment, 2010,3(1/2): 285~296
Bu Y Y, Ekanayake J, Li H, Zhang B J , et al. Twister: a runtime for iterative MapReduce. Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, Chicago, Iuinois, 2010: 810~818
0
浏览量
260
下载量
0
CSCD
关联资源
相关文章
相关作者
相关机构