基本信息来源于合作网站,原文需代理用户跳转至来源网站获取       
摘要:
The documents contain a large amount of valuable knowledge on various subjects and, more recently, documents on the Internet are available from various sources. Therefore, automatic, rapid and accurate classification of these documents with less human interaction has become necessary. In this paper, we introduce a new algorithm called the highest repetition of words in a text document (HRWiTD) to classify the automatic Arabic text. The corpus is divided into a train set and a test set to be applied to proposed classification technique. The train set is analyzed for learning and the learning data is stored in the Learning Dataset file. The category that contains the highest repetition for each word is assigned as a category for the word in Learning Dataset file. This file includes non-duplicate words with the value of higher repetition and categories and they get from all texts in the train set. For each text in the test set, the category of words is assigned to a specific category by using Learning Dataset file. The category that contains the largest number of words is assigned as the predicted category of the text. To evaluate the classification accuracy of the HRWiTD algorithm, the confusion matrix method is used. The HRWiTD algorithm has been applied to convergent samples from six categories of Arabic news at SPA (Saudi Press Agency). As a result, the accuracy of the HRWiTD algorithm is 86.84%. In addition, we used the same corpus with the most popular machine learning algorithms which are C5.0, KNN, SVM, NB and C4.5, and their results of classification accuracy are 52.86%, 52.38%, 51.90%, 51.90% and 30%, respectively. Thus, the HRWiTD algorithm gives better classification accuracy compared to the most popular machine learning algorithms on the selected domain.
内容分析
关键词云
关键词热度
相关文献总数  
(/次)
(/年)
文献信息
篇名 Automatic Arabic Document Classification Based on the HRWiTD Algorithm
来源期刊 软件工程与应用(英文) 学科 医学
关键词 AUTOMATIC TEXT Classification CONFUSION Matrix SPA Machine Learning ALGORITHMS
年,卷(期) rjgcyyyyw_2018,(4) 所属期刊栏目
研究方向 页码范围 167-179
页数 13页 分类号 R73
字数 语种
DOI
五维指标
传播情况
(/次)
(/年)
引文网络
引文网络
二级参考文献  (0)
共引文献  (0)
参考文献  (0)
节点文献
引证文献  (0)
同被引文献  (0)
二级引证文献  (0)
2018(0)
  • 参考文献(0)
  • 二级参考文献(0)
  • 引证文献(0)
  • 二级引证文献(0)
研究主题发展历程
节点文献
AUTOMATIC
TEXT
Classification
CONFUSION
Matrix
SPA
Machine
Learning
ALGORITHMS
研究起点
研究来源
研究分支
研究去脉
引文网络交叉学科
相关学者/机构
期刊影响力
软件工程与应用(英文)
月刊
1945-3116
武汉市江夏区汤逊湖北路38号光谷总部空间
出版文献量(篇)
885
总下载数(次)
0
论文1v1指导