一种基于改进的链式MapReduce的并行ETL应用

吴斌; 刘心光

doi:10.3969/j.issn.1000-0801.2013.12.001

您当前的位置：

首页 >

文章列表页 >

一种基于改进的链式MapReduce的并行ETL应用

研究与开发 | 更新时间：2024-06-05

- 一种基于改进的链式MapReduce的并行ETL应用
- A Parallel ETL Tool Based on an Improved Chain-MapReduce Framework
- 电信科学 2013年29卷第12期页码：1-8
- 作者机构：
- 作者简介：
  
  [ "吴斌，男，北京邮电大学教授、博士生导师，主要研究方向为数据挖掘、云计算。" ]
  [ "刘心光，男，北京邮电大学硕士研究生，主要研究方向为数据库、数据挖掘、复杂网络与复杂系统。" ]
- 基金信息：
  
  国家自然科学基金资助项目(61074128)
- DOI：10.3969/j.issn.1000-0801.2013.12.001
  中图分类号：
- 纸质出版日期：2013-12-20，
  
  网络出版日期：2013-12，
- 稿件说明：
移动端阅览
吴斌, 刘心光. 一种基于改进的链式MapReduce的并行ETL应用[J]. 电信科学, 2013,29(12):1-8.

BIN WU, XINGUANG LIU. A Parallel ETL Tool Based on an Improved Chain-MapReduce Framework. [J]. Telecommunications science, 2013, 29(12): 1-8.
吴斌, 刘心光. 一种基于改进的链式MapReduce的并行ETL应用[J]. 电信科学, 2013,29(12):1-8. DOI： 10.3969/j.issn.1000-0801.2013.12.001.

BIN WU, XINGUANG LIU. A Parallel ETL Tool Based on an Improved Chain-MapReduce Framework. [J]. Telecommunications science, 2013, 29(12): 1-8. DOI： 10.3969/j.issn.1000-0801.2013.12.001.

摘要

介绍了并行ETL 的相关工作和常见的处理多MapReduce 作业流程的方法；提出一种改进的链式MapReduce 框架，并将此框架应用于一个并行ETL 工具，同时提出一些针对ETL 处理的流程级优化规则，使ETL流程产生更少的MapReduce作业，从而减少I/O以及网络传输的消耗；利用某省份手机上网数据与Hive进行了大数据对比实验，结果表明，本ETL工具的性能平均比Hive快10%～20%。

Abstract

The related work in parallel ETL and common methods to deal with multiple MapReduce jobs were introduced. Then an improved chain-MapReduce framework was presented

based on this framework

a parallel ETL tool was designed. Several optimization rules on ETL which will make the ETL process generate less MapReduce jobs to avoid unnecessary I/O and network cost were presented. The ETL tool on real queries and real big datasets were evaluated. Compared with Hive

the tool reduces time on average by 10% to 20%.

关键词

ETL优化规则改进的链式MapReduce

Keywords

improved chain-MapReduceETLoptimization rule

references

Dean J, Ghemawat S . MapReduce:simplified data processing on large clusters . Proceedings of the 6th Conference on Symposium on Opearting Systems Resign＆Implementation , San Francisco, USA, 2004

Apache. Hive. http://hive.apache.org/http://hive.apache.org/, 2013

Chen S T . Cheetah:a high performance, custom data warehouse on top of MapReduce. Proceedings of the VLDB Endowment, 2010,3（1）:1459～1468

Melnik S, Gubarev A, Long J J ， et al. Dremel: interactive analysis of web-scale datasets. Communications of the ACM, 2011,54（6）:114～123

Cloudera. Impala. https://ccp.cloudera.com/display/IMPALA10 BETADOChttps://ccp.cloudera.com/display/IMPALA10 BETADOC,

Olston C, Reed B, Srivastava U ， et al. Pig Latin: a not-so-foreign language for data processing. Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data， 2008: 1099～1110

IBM. JAQL. http://www-01.ibm.com/software/data/infosphere/hadoop/jaql/http://www-01.ibm.com/software/data/infosphere/hadoop/jaql/,

Chambers C, Raniwala A, Perry F ， et al. FlumeJava: easy, efficient data-parallel pipelines. ACM SIGPLAN Notices, 2010,45（6）: 363～375

Cascading. http://www.cascading.org/http://www.cascading.org/, 2012

Horton Works. Tez. http://hortonworks.com/blog/introducing-tez-faster-hadoop-processing/http://hortonworks.com/blog/introducing-tez-faster-hadoop-processing/, 2013

Bu Y Y, Howe B, Balazinska M ， et al. HaLoop: efficient iterative data processing on large clusters. Proceedings of the VLDB Endowment, 2010,3（1/2）: 285～296

Bu Y Y, Ekanayake J, Li H, Zhang B J ， et al. Twister: a runtime for iterative MapReduce. Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, Chicago, Iuinois, 2010: 810～818

浏览量

260

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

暂无数据