一种高维大数据全k近邻查询算法

王忠伟; 陈叶芳; 肖四友; 钱江波

doi:10.11959/j.issn.1000-0801.2015171

您当前的位置：

首页 >

文章列表页 >

一种高维大数据全k近邻查询算法

研究与开发 | 更新时间：2024-06-05

- 一种高维大数据全k近邻查询算法
- An AkNN Algorithm for High-Dimensional Big Data
- 电信科学 2015年31卷第7期页码：52-62
- 作者机构：
- 作者简介：
  
  [ "王忠伟，男，宁波大学硕士生，主要研究方向为数据挖掘。" ]
  [ "陈叶芳，女，宁波大学讲师，主要研究方向为数据处理和挖掘。" ]
  [ "肖四友，男，博士，宁波大学副教授，主要研究方向为数据处理和挖掘。" ]
  [ "钱江波，男，博士，宁波大学教授，主要研究方向为数据处理和挖掘、逻辑电路设计。" ]
- 基金信息：
  
  国家自然科学基金资助项目;The National Natural Science Foundation of China(61472194);浙江省自然科学基金资助项目;The Natural Science Foundation of Zhejiang Province(LY13F020040);宁波市自然科学基金资助项目;The Natural Science Foundation of Ningbo City(2014A610023);宁波市自然科学基金资助项目;The Natural Science Foundation of Ningbo City(2015A610119);“信息与通信工程”浙江省重中之重学科开放基金资助项目;Zhejiang Key Discipline Fund of Information and Communication Engineering(xkxl1423)
- DOI：10.11959/j.issn.1000-0801.2015171
  中图分类号：
- 网络出版日期：2015-07，
  
  纸质出版日期：2015-07-20
- 稿件说明：
移动端阅览
王忠伟, 陈叶芳, 肖四友, 等. 一种高维大数据全k近邻查询算法[J]. 电信科学, 2015,31(7):52-62.

Zhongwei Wang, Yefang Chen, Siyou Xiao, et al. An AkNN Algorithm for High-Dimensional Big Data[J]. Telecommunications science, 2015, 31(7): 52-62.
王忠伟, 陈叶芳, 肖四友, 等. 一种高维大数据全k近邻查询算法[J]. 电信科学, 2015,31(7):52-62. DOI： 10.11959/j.issn.1000-0801.2015171.

Zhongwei Wang, Yefang Chen, Siyou Xiao, et al. An AkNN Algorithm for High-Dimensional Big Data[J]. Telecommunications science, 2015, 31(7): 52-62. DOI： 10.11959/j.issn.1000-0801.2015171.

摘要

全k近邻（all k-nearest neighbor，AkNN）查询，是k近邻查询的一个变型，旨在在一个查询过程中为给定数据集的每个对象确定k个最近邻。提出了一种在Hadoop分布式平台下处理高维大数据的AkNN查询算法。首先使用行条化思想结合p-stable LSH算法将高维数据对象降维，然后结合空间填充曲线Z-order的优良特性，把降维后的数据嵌入一维空间中，接着进行范围查询。整个过程使用MapReduce框架分布式并行处理。实验结果表明，所提出的算法可以高效处理高维大数据的AkNN查询。

Abstract

A new variant of k nearest neighbor queries，which called as all k-nearest neighbor queries（AkNN），is a process to search the k nearest neighbors of each object in a data set.An AkNN query algorithm for high-dimensional big data on the Hadoop system was proposed.Using the banding technique and the p-stable LSH algorithm，dimensionality reduction was performed，then the data was embeded in a Z-order curve.The preprocessed data were continued to be treated on a MapReduce framework in a distributed parallel manner.Experimental results show that the proposed algorithm can efficiently handle AkNN queries for large-scale high-dimensional data.

关键词

Keywords

references

Böhm C ， Krebs F . k-nearest neighbour join：Turbo charging the KDD process . Knowledge and Information Systems ， 2004 ， 6 （ 6 ）： 728 ～ 749

Xia C ， Lu H ， Ooi B C ， et al . Gorder：an efficient method for kNN join processing . Proceedings of the 30th International Conference on Very Large Data Bases ， Toronto，Canada ， 2004 ： 756 ～ 767

Yu C ， Cui B ， Wang S ， et al . Efficient index-based kNN join processing for high-dimensional data . Information and Software Technology ， 2007 ， 49 （ 4 ）： 332 ～ 344

Chen Y ， Patel J M Efficient evaluation of all-nearest-neighbor queries . Proceedings of the 23rd International Conference on Data Engineering ， Istanbul，Turkey ， 2007 ： 1056 ～ 1065

Emrich T ， Graf F ， Kriegel H P ， et al . Optimizing all-nearestneighbor queries with trigonometric pruning . Lecture Notes in Computer Science ， 2010 （ 6187 ）： 501 ～ 518

Zhang J ， Mamoulis N ， Papadias D ， et al . All-nearest-neighbors queries in spatial databases . Proceedings of the 16th International Conference on Scientific and Statistical Database Management ， Santorini Island，Greece ， 2004 ： 297 ～ 306

Kouiroukidis N ， Evangelidis G . The effects of dimensionality curse in high dimensional kNN search . Proceedings of 15th Panhellenic Conference on Informatics （PCI）， Gastonia，USA ， 2011 ： 41 ～ 45

Weber R ， Schek H J ， Blott S . A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces . Proceedings of the 24rd International Conference on Very Large Data Bases ， New York，USA ， 1998 ： 194 ～ 205

Arya S ， Mount D M ， Netanyahu N S ， et al . An optimal algorithm for approximate nearest neighbor searching fixed dimensions.Journal of the ACM（JACM） . Journal of Artificial Intelligence Research ， 1998 ， 45 （ 6 ）： 891 ～ 923

Indyk P ， Motwani R . Approximate nearest neighbors：towards removing the curse of dimensionality . Proceedings of the 30th Annual ACM Symposium on Theory of Computing ， Dallas，Texas，USA ， 1998 ： 604 ～ 613

Hadoop . http://hadoop.apache.org/ http://hadoop.apache.org/ ， 2015

Afrati F N ， Ullman J D . Optimizing joins in a MapReduce environment . Proceedings of the 13th International Conference on Extending Database Technology，Palais de Beaulieu ， Lausanne，Switzerlan ， 2010 ： 99 ～ 110

Jiang D ， Tung A ， Chen G . Map-join-reduce：toward scalable and efficient data analysis on large clusters . IEEE Transactions on Knowledge and Data Engineering ， 2011 ， 23 （ 9 ）： 1299 ～ 1311

Vernica R ， Carey M J ， Li C . Efficient parallel set-similarity joins using MapReduce . Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data ， Indianapolis，Indiana，USA ， 2010 ： 495 ～ 506

Zhang C ， Li F ， Jestes J . Efficient parallel kNN joins for large data in MapReduce . Proceedings of the 15th International Conference on Extending Database Technology ， Berlin，Germany ， 2012 ： 38 ～ 49

Rajaraman A ， Ullman J D . Mining of Massive Datasets . Cambridge ： Cambridge University Press ， 2011

Datar M ， Immorlica N ， Indyk P ， et al . Locality-sensitive hashing scheme based on p-stable distributions . Proceedings of the 20th Annual Symposium on Computational Geometry ， New York，USA ， 2004 ： 253 ～ 262

Tao Y ， Yi K ， Sheng C ， et al . Quality and efficiency in high dimensional nearest neighbor search . Proceedings of the 35th SIGMOD International Conference on Management of Data ， Providence，Rhode Island，USA ， 2009 ： 563 ～ 576

Labelme . http://labelme.csail.mit.edu http://labelme.csail.mit.edu ， 2015

Fergus R ， Torralba A ， Freeman W T . Tiny Images Dataset . http://horatio.cs.nyu.edu/mit/tiny/data/index.html http://horatio.cs.nyu.edu/mit/tiny/data/index.html ， 2015

Pan J ， Manocha D . Bi-level locality sensitive hashing for k-nearest neighbor computation . Proceedings of the 28th International Conference on Data Engineering（ICDE）， Washington DC，USA ， 2012 ： 378 ～ 389

Spark . http://spark.apache.org/ http://spark.apache.org/ ， 2015

浏览量

769

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

基于用户移动网络接入位置的高效分布式相似矩阵计算方法

集成BP神经网络预测模型的研究与应用

主流大数据处理开源架构的分析及对比评测

基于Hadoop的网络分流和流特征计算

大规模图数据划分算法综述