浏览全部资源
扫码关注微信
1. 浙江大学计算机科学与技术学院 杭州310027
2. 浙江大学城市学院 杭州310015
[ "金苍宏,男,浙江大学博士生,主要研究方向为信息检索、数据挖掘、大数据框架。" ]
[ "刘泽民,男,浙江大学硕士生,主要研究方向为大数据平台、流数据处理、数据挖掘。" ]
[ "吴明晖,男,博士,浙江大学城市学院教授,主要研究方向为软件工程、人工智能。" ]
[ "应晶,男,博士,浙江大学教授、博士生导师,主要研究方向为软件开发方法、软件体系结构、软件工程规范等。" ]
网络出版日期:2014-09,
纸质出版日期:2014-09-20
移动端阅览
金苍宏, 刘泽民, 吴明晖, 等. 一种流数据立方体分析挖掘框架[J]. 电信科学, 2014,30(9):61-71.
Canghong Jin, Zemin Liu, Minghui Wu, et al. A Cube Analytical Mining Framework for Stream Data[J]. Telecommunications science, 2014, 30(9): 61-71.
金苍宏, 刘泽民, 吴明晖, 等. 一种流数据立方体分析挖掘框架[J]. 电信科学, 2014,30(9):61-71. DOI: 10.3969/j.issn.1000-0801.2014.09.009.
Canghong Jin, Zemin Liu, Minghui Wu, et al. A Cube Analytical Mining Framework for Stream Data[J]. Telecommunications science, 2014, 30(9): 61-71. DOI: 10.3969/j.issn.1000-0801.2014.09.009.
流数据是目前一种重要的数据展现形式,对流数据进行OLAM(联机分析挖掘)操作可为分析人员提供多层次的数据视图。但OLAM要求在不同粒度中实现对数据的聚合操作,而流式数据内含时态特性和持续到达特性,使得数据无法被多次重复操作。使用传统OLAP(联机分析处理)方法无法生成部分物化视图且流数据规模宏大,受限于存储空间大小而无法保存全部数据单元信息。针对上述问题,提出了一种基于概要技术的流数据OLAM 框架——sketch cube(概要立方体),该框架把任意维度组合映射成唯一自然数,根据上下限单调原则对维度组合裁剪,在类线性空间中保存有效数据单元信息,并构建时间序列索引提高检索效率。通过理论分析给出使用sketch cube的前提条件,同时通过真实海量流数据实验分析表明,sketch sube在有效性、存储空间效率和正确率上可以满足实时挖掘的需求。
Stream data has been one of the most significant data format recently. OLAM(online analytical mining) operation could provide multi-level data views for analysts. However
OLAM operations depend on data aggregation
which is in conflict with the continuous incensement and dynamic nature of stream data. Thus
partial materialized view from stream data directly by typical OLAP approaches cannot be created and all data cells for the limitation of storage cannot be saved. In order to solve the above problems
an advanced sketch based OLAM framework named sketch cube to analyze stream data was proposed. Sketch cube maps a set of attributes to a unique number and stores it in sub-linear data structure
and then builds an inverted index by tiled time window. The precondition of using sketch cube by theoretical analysis was given and the storage efficiency and query performance on mass mobile data corpus was evaluated
which supports requirements of real-time analysis.
Aggarwal C C . An Introduction to Data Streams . Data Streams. Springer US , 2007 .
Hellerstein J M , Haas P J , Wang H J . Online aggregation . ACM SIGMOD Record , 1997 , 26 ( 2 ): 171 ~ 182 .
Zhang X , Chou P L , Dong G . Efficient computation of iceberg cubes by bounding aggregate functions . IEEE Transactions on Knowledge and Data Engineering , 2007 , 19 ( 7 )
Chen Y , Do ng G , Han J , et al . Multi-dimensional regression analysis of time-series data streams . Proceedings of the 28th International Conference on Very Large Data Bases , VLDB Endowment, Hong Kong, China 2002 : 323 ~ 334 .
胡文瑜 , 孙志挥 , 吴英杰 . 数据挖掘取样方法研究 . 计算机研究与发展 2009 , 48 ( 1 ): 45 ~ 54 .
De Rougemont M , Cao P T . Approximate answers to OLAP queries on streaming data warehouses . Proceedings of the Fifteenth International Workshop on Data Warehousing and OLAP , Maui, Hi, USA 2012 : 121 ~ 128
Babcock B , Shinath B , Mayur D , et al . Models and issues in data stream systems . Proceedings of the 21st ACM Symposium on Principles of Database Systems , Madison,Wiscomsin,USA 2002 : 1 ~ 16
Chandrasekaran S , Cooper O , Deshpande A . TelegraphCQ:continuous dataflow processing for an uncertain world . Proceedings of the Conf on Innovative Data Systems Research , Asilomar, CA, USA 2003
Hetal T , Nikolay L , Hamid M , et al . SMM: a data stream management system for knowledge discovery . Proceedings of International Conference on Data Engineering , Hannover, Germany 2011 : 757 ~ 768
Rosenberg A L . Efficient pairing functions-and why you should care . International Journal of Foundations of Computer Science , 2003 , 14 ( 1 ): 3 ~ 17 .
Zhang D , Zhai C , Han J , et al . Topic modeling for OLAP on multidimensional text database: topic cube and its applications . Stat Anal Data Min , 2009 , 2 ( 56 ): 378 ~ 395
Ding B , Zhao B , Lin C , et al . Topcells: keyword-based search of top-k aggregated documents in text cube . Proceedings of International Conference on Data Engineering(ICDE) , Long Beach, USA 2010 : 381 ~ 384
Lin C , Ding B , Han J , et al . TextCube: computingirmeansures for multidimensional text database analysis . Proceedings of the 8th IEEE International Conference on Data Mining(ICDM) , 2008 : 905 ~ 910
Liu X , Tang K Z , Hancock J , et al . A text cube approach to human, social and cultural behavior in the Twitter stream . LNCS 7812 , 2013 : 321 ~ 330
Cuzzocrea A . Retrieving accurate estimates to OLAP queries over uncertain and imprecise multidimensional data streams . Proceedings of the 23rd International Conference on SSDBM , Portland, OR, USA 2011
Aggarwal C C . Managing and Mining Sensor Data . New York:Springer US , 2013
张进 , 邬江兴 , 刘勤让 . 4种技术型Bloom Filter的性能分析与比较 . 软件学报 2010 , 21 ( 5 ): 1098 ~ 1114 .
Cormode G , Hadjieleftheriou M . Finding frequent items in data streams . Proceedings of the VLDB Endowment , 2008 , 1 ( 2 ): 1530 ~ 1541
Considine J , Hadjieleftheriou M , Li F . Robust approximate aggregation in sensor data management systems . ACM Transactions on Database Systems(TODS) , 2009 , 34 ( 1 )
Han J , Kamber M , Pei J . Data Mining Concepts and Techniques . Elsevier Ltd , 2012
Cormode G , Muthukrishnan S . An improved data stream summary: the count-min sketch and its applications . J Algorithms , 2005 , 55 ( 1 ): 58 ~ 75
Cormode G , Muthukrishnan S . Summarizing and mining skewed data streams . Proceedings of SDM , Trondheim, Normay 2005
Giannella C , Han J , Pei J , et al Mining frequent patterns in data streams at multiple time granularities . Next Generation Data Mining , 2003 ( 212 ): 191 ~ 212
0
浏览量
712
下载量
0
CSCD
关联资源
相关文章
相关作者
相关机构