作者:
基本信息来源于合作网站,原文需代理用户跳转至来源网站获取       
摘要:
Purpose:The objectives of this study are to explore an effective technique to extract information from weblogs and develop an experimental system to extract structured information as much as possible with this technique.The system will lay a foundation for evaluation,analysis,retrieval,and utilization of the extracted information.Design/methodology/approach:An improved template extraction technique was proposed.Separate templates designed for extracting blog entry titles,posts and their comments were established,and structured information was extracted online step by step.A dozen of data items,such as the entry titles,posts and their commenters and comments,the numbers of views,and the numbers of citations were extracted from eight major Chinese blog websites,including Sina,Sohu and Bokee.Findings:Results showed that the average accuracy of the experimental extraction system reached 94.6%.Because the online and multi-threading extraction technique was adopted,the speed of extraction was improved with the average speed of 15 pages per second without considering the network delay.In addition,entries posted by Ajax technology can be extracted successfully.Research limitations:As the templates need to be established in advance,this extraction technique can be effectively applied to a limited range of blog websites.In addition,the stability of the extraction templates was affected by the source code of the blog pages.Practical implications:This paper has studied and established a blog page extraction system,which can be used to extract structured data,preserve and update the data,and facilitate the collection,study and utilization of the blog resources,especially academic blog resources.Originality/value:This modified template extraction technique outperforms the Web page downloaders and the specialized blog page downloaders with structured and comprehensive data extraction.
内容分析
关键词云
关键词热度
相关文献总数  
(/次)
(/年)
文献信息
篇名 Implementation of a weblog extraction system with an improved template extraction technique
来源期刊 中国文献情报:英文版 学科 工学
关键词 Weblog(Blog) Web information EXTRACTION EXTRACTION
年,卷(期) 2013,(1) 所属期刊栏目
研究方向 页码范围 52-63
页数 12页 分类号 TP393.092
字数 语种
DOI
五维指标
传播情况
(/次)
(/年)
引文网络
引文网络
二级参考文献  (0)
共引文献  (0)
参考文献  (45)
节点文献
引证文献  (0)
同被引文献  (0)
二级引证文献  (0)
2008(1)
  • 参考文献(1)
  • 二级参考文献(0)
2009(3)
  • 参考文献(3)
  • 二级参考文献(0)
2010(2)
  • 参考文献(2)
  • 二级参考文献(0)
2011(1)
  • 参考文献(1)
  • 二级参考文献(0)
2012(3)
  • 参考文献(3)
  • 二级参考文献(0)
2013(0)
  • 参考文献(0)
  • 二级参考文献(0)
  • 引证文献(0)
  • 二级引证文献(0)
研究主题发展历程
节点文献
Weblog(Blog)
Web
information
EXTRACTION
EXTRACTION
研究起点
研究来源
研究分支
研究去脉
引文网络交叉学科
相关学者/机构
期刊影响力
数据与情报科学学报:英文版
季刊
2096-157X
10-1394/G2
北京市中关村北四环西路33号
82-563
出版文献量(篇)
445
总下载数(次)
1
总被引数(次)
0
论文1v1指导