A Survey of Data Partitioning and Sampling Methods to Support Big Data Analysis

Joshua Zhexue Huang; Kuanishbay Sadatdiynov; Mohammad Sultan Mahmud; Salman Salloum; Tamer Z.Emara

文献导航

搜索文章

搜索思路

钛学术文献服务平台 \
学术期刊 \
综合期刊 \
其它期刊 \
大数据挖掘与分析(英文)期刊 \
A Survey of Data Partitioning and Sampling Methods to Support Big Data Analysis

A Survey of Data Partitioning and Sampling Methods to Support Big Data Analysis

作者：

Joshua Zhexue Huang Kuanishbay Sadatdiynov Mohammad Sultan Mahmud Salman Salloum Tamer Z.Emara

基本信息来源于合作网站，原文需代理用户跳转至来源网站获取

big

data

analysis

data

partitioning

data

sampling

distributed

and

parallel

computing

approximate

computing

摘要：

Computer clusters with the shared-nothing architecture are the major computing platforms for big data processing and analysis.In cluster computing,data partitioning and sampling are two fundamental strategies to speed up the computation of big data and increase scalability.In this paper,we present a comprehensive survey of the methods and techniques of data partitioning and sampling with respect to big data processing and analysis.We start with an overview of the mainstream big data frameworks on Hadoop clusters.The basic methods of data partitioning are then discussed including three classical horizontal partitioning schemes:range,hash,and random partitioning.Data partitioning on Hadoop clusters is also discussed with a summary of new strategies for big data partitioning,including the new Random Sample Partition(RSP)distributed model.The classical methods of data sampling are then investigated,including simple random sampling,stratified sampling,and reservoir sampling.Two common methods of big data sampling on computing clusters are also discussed:record-level sampling and blocklevel sampling.Record-level sampling is not as efficient as block-level sampling on big distributed data.On the other hand,block-level sampling on data blocks generated with the classical data partitioning methods does not necessarily produce good representative samples for approximate computing of big data.In this survey,we also summarize the prevailing strategies and related work on sampling-based approximation on Hadoop clusters.We believe that data partitioning and sampling should be considered together to build approximate cluster computing frameworks that are reliable in both the computational and statistical respects.

内容分析

关键词云

关键词热度

相关文献

推荐文献

根据相关规定，获取原文需跳转至原文服务方进行注册认证身份信息

完成下面三个步骤操作后即可获取文献，阅读后请点击下方页面【继续获取】按钮

钛学术文献服务平台

学术出版新技术应用与公共服务实验室出品

原文合作方

获取文献流程

1.访问原文合作方请等待几秒系统会自动跳转至登录页，首次访问请先注册账号，填写基本信息后，点击【注册】

2.注册后进行实名认证，实名认证成功后点击【返回】

3.检查邮箱地址是否正确，若错误或未填写请填写正确邮箱地址，点击【确认支付】完成获取，文献将在1小时内发送至您的邮箱

*若已注册过原文合作方账号的用户，可跳过上述操作，直接登录后获取原文即可

点击【获取原文】按钮，跳转至合作网站。

首次获取需要在合作网站进行注册。

注册并实名认证，认证后点击【返回】按钮。

确认邮箱信息，点击【确认支付】，订单将在一小时内发送至您的邮箱。

* 若已经注册过合作网站账号，请忽略第二、三步，直接登录即可。

期刊分类
期刊（年）
期刊（期）
期刊推荐

其它

大数据挖掘与分析(英文)2021 大数据挖掘与分析(英文)2020 大数据挖掘与分析(英文)2019 大数据挖掘与分析(英文)2018

大数据挖掘与分析(英文)2020年第4期大数据挖掘与分析(英文)2020年第3期大数据挖掘与分析(英文)2020年第2期大数据挖掘与分析(英文)2020年第1期

按字母查找期刊：

A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
其他

联系合作广告推广: shenyukuan@paperpass.com

篇名	A Survey of Data Partitioning and Sampling Methods to Support Big Data Analysis
来源期刊	大数据挖掘与分析(英文)	学科	工学
关键词	big data analysis data partitioning data sampling distributed and parallel computing approximate computing
年，卷（期）	2020,（2）	所属期刊栏目
研究方向		页码范围	85-101
页数	17页	分类号	TP311.13
字数		语种
DOI