基本信息来源于合作网站,原文需代理用户跳转至来源网站获取       
摘要:
The fundamental problem of similarity studies, in the frame of data-mining, is to examine and detect similar items in articles, papers, and books with huge sizes. In this paper, we are interested in the probabilistic, and the statistical and the algorithmic aspects in studies of texts. We will be using the approach of k-shinglings, a k-shingling being defined as a sequence of k consecutive characters that are extracted from a text (k ≥ 1). The main stake in this field is to find accurate and quick algorithms to compute the similarity in short times. This will be achieved in using approximation methods. The first approximation method is statistical and, is based on the theorem of Glivenko-Cantelli. The second is the banding technique. And the third concerns a modification of the algorithm proposed by Rajaraman et al. ([1]), denoted here as (RUM). The Jaccard index is the one being used in this paper. We finally illustrate these results of the paper on the four Gospels. The results are very conclusive.
推荐文章
内容分析
关键词云
关键词热度
相关文献总数  
(/次)
(/年)
文献信息
篇名 Probabilistic, Statistical and Algorithmic Aspects of the Similarity of Texts and Application to Gospels Comparison
来源期刊 数据分析和信息处理(英文) 学科 医学
关键词 SIMILARITY Web MINING Jaccard SIMILARITY RU Algorithm Minhashing Data MINING Shingling Bible’s GOSPELS Glivenko-Cantelli EXPECTED SIMILARITY STATISTICAL Estimation
年,卷(期) 2015,(4) 所属期刊栏目
研究方向 页码范围 112-127
页数 16页 分类号 R73
字数 语种
DOI
五维指标
传播情况
(/次)
(/年)
引文网络
引文网络
二级参考文献  (0)
共引文献  (0)
参考文献  (0)
节点文献
引证文献  (0)
同被引文献  (0)
二级引证文献  (0)
2015(0)
  • 参考文献(0)
  • 二级参考文献(0)
  • 引证文献(0)
  • 二级引证文献(0)
研究主题发展历程
节点文献
SIMILARITY
Web
MINING
Jaccard
SIMILARITY
RU
Algorithm
Minhashing
Data
MINING
Shingling
Bible’s
GOSPELS
Glivenko-Cantelli
EXPECTED
SIMILARITY
STATISTICAL
Estimation
研究起点
研究来源
研究分支
研究去脉
引文网络交叉学科
相关学者/机构
期刊影响力
数据分析和信息处理(英文)
季刊
2327-7211
武汉市江夏区汤逊湖北路38号光谷总部空间
出版文献量(篇)
106
总下载数(次)
0
总被引数(次)
0
论文1v1指导