基本信息来源于合作网站,原文需代理用户跳转至来源网站获取       
摘要:
Computer clusters with the shared-nothing architecture are the major computing platforms for big data processing and analysis.In cluster computing,data partitioning and sampling are two fundamental strategies to speed up the computation of big data and increase scalability.In this paper,we present a comprehensive survey of the methods and techniques of data partitioning and sampling with respect to big data processing and analysis.We start with an overview of the mainstream big data frameworks on Hadoop clusters.The basic methods of data partitioning are then discussed including three classical horizontal partitioning schemes:range,hash,and random partitioning.Data partitioning on Hadoop clusters is also discussed with a summary of new strategies for big data partitioning,including the new Random Sample Partition(RSP)distributed model.The classical methods of data sampling are then investigated,including simple random sampling,stratified sampling,and reservoir sampling.Two common methods of big data sampling on computing clusters are also discussed:record-level sampling and blocklevel sampling.Record-level sampling is not as efficient as block-level sampling on big distributed data.On the other hand,block-level sampling on data blocks generated with the classical data partitioning methods does not necessarily produce good representative samples for approximate computing of big data.In this survey,we also summarize the prevailing strategies and related work on sampling-based approximation on Hadoop clusters.We believe that data partitioning and sampling should be considered together to build approximate cluster computing frameworks that are reliable in both the computational and statistical respects.
推荐文章
基于语义的Data Cube数字水印技术
数字水印
语义
数据立方体
版权
Data Transfer Object模式探讨
Data Transfer Object 三层应用 DataSet
Statistics matters in interpretations of non-traditional stable isotopic data
Isotopic data processing
Error propagation
Significant digits
Difference between means with uncertainties
内容分析
关键词云
关键词热度
相关文献总数  
(/次)
(/年)
文献信息
篇名 A Survey of Data Partitioning and Sampling Methods to Support Big Data Analysis
来源期刊 大数据挖掘与分析(英文) 学科 工学
关键词 big data analysis data partitioning data sampling distributed and parallel computing approximate computing
年,卷(期) 2020,(2) 所属期刊栏目
研究方向 页码范围 85-101
页数 17页 分类号 TP311.13
字数 语种
DOI
五维指标
传播情况
(/次)
(/年)
引文网络
引文网络
二级参考文献  (0)
共引文献  (0)
参考文献  (0)
节点文献
引证文献  (0)
同被引文献  (0)
二级引证文献  (0)
2020(0)
  • 参考文献(0)
  • 二级参考文献(0)
  • 引证文献(0)
  • 二级引证文献(0)
研究主题发展历程
节点文献
big
data
analysis
data
partitioning
data
sampling
distributed
and
parallel
computing
approximate
computing
研究起点
研究来源
研究分支
研究去脉
引文网络交叉学科
相关学者/机构
期刊影响力
大数据挖掘与分析(英文)
季刊
2096-0654
10-1514/G2
出版文献量(篇)
91
总下载数(次)
3
总被引数(次)
0
论文1v1指导