基本信息来源于合作网站,原文需代理用户跳转至来源网站获取       
摘要:
Purpose:With more and more digital collections of various information resources becoming available,also increasing is the challenge of assigning subject index terms and classes from quality knowledge organization systems.While the ultimate purpose is to understand the value of automatically produced Dewey Decimal Classification(DDC)classes for Swedish digital collections,the paper aims to evaluate the performance of six machine learning algorithms as well as a string-matching algorithm based on characteristics of DDC.Design/methodology/approach:State-of-the-art machine learning algorithms require at least 1,000 training examples per class.The complete data set at the time of research involved 143,838 records which had to be reduced to top three hierarchical levels of DDC in order to provide sufficient training data(totaling 802 classes in the training and testing sample,out of 14,413 classes at all levels).Findings:Evaluation shows that Support Vector Machine with linear kernel outperforms other machine learning algorithms as well as the string-matching algorithm on average;the string-matching algorithm outperforms machine learning for specific classes when characteristics of DDC are most suitable for the task.Word embeddings combined with different types of neural networks(simple linear network,standard neural network,1 D convolutional neural network,and recurrent neural network)produced worse results than Support Vector Machine,but reach close results,with the benefit of a smaller representation size.Impact of features in machine learning shows that using keywords or combining titles and keywords gives better results than using only titles as input.Stemming only marginally improves the results.Removed stop-words reduced accuracy in most cases,while removing less frequent words increased it marginally.The greatest impact is produced by the number of training examples:81.90%accuracy on the training set is achieved when at least 1,000 records per class are available in the training set,and 66.13%when too few recor
推荐文章
内容分析
关键词云
关键词热度
相关文献总数  
(/次)
(/年)
文献信息
篇名 Automatic Classification of Swedish Metadata Using Dewey Decimal Classification: A Comparison of Approaches
来源期刊 数据与情报科学学报:英文版 学科 工学
关键词 LIBRIS Dewey Decimal Classification Automatic classification Machine learning Support Vector Machine Multinomial Na?ve Bayes Simple linear network Standard neural network 1D convolutional neural network Recurrent neural network Word embeddings String matching
年,卷(期) 2020,(1) 所属期刊栏目
研究方向 页码范围 18-38
页数 21页 分类号 TP181
字数 语种
DOI
五维指标
传播情况
(/次)
(/年)
引文网络
引文网络
二级参考文献  (0)
共引文献  (0)
参考文献  (0)
节点文献
引证文献  (0)
同被引文献  (0)
二级引证文献  (0)
2020(0)
  • 参考文献(0)
  • 二级参考文献(0)
  • 引证文献(0)
  • 二级引证文献(0)
研究主题发展历程
节点文献
LIBRIS
Dewey
Decimal
Classification
Automatic
classification
Machine
learning
Support
Vector
Machine
Multinomial
Na?ve
Bayes
Simple
linear
network
Standard
neural
network
1D
convolutional
neural
network
Recurrent
neural
network
Word
embeddings
String
matching
研究起点
研究来源
研究分支
研究去脉
引文网络交叉学科
相关学者/机构
期刊影响力
数据与情报科学学报:英文版
季刊
2096-157X
10-1394/G2
北京市中关村北四环西路33号
82-563
出版文献量(篇)
445
总下载数(次)
1
总被引数(次)
0
论文1v1指导