Eliminating Useless Parts in Semi-structured Documents Using Alternation Counts - Collections

＜conference paper＞
Eliminating Useless Parts in Semi-structured Documents Using Alternation Counts

Creator	Author PID K000021 Creator Name Ikeda, Daisuke 池田, 大輔イケダ, ダイスケ Affiliation Affiliation Name Computing and Communications Center, Kyushu University 九州大学情報基盤センター
	Author PID L002646 Creator Name Yamada, Yasuhiro 山田, 泰寛ヤマダ, ヤスヒロ Affiliation Affiliation Name Graduate School of Information Science and Electrical Engineering, Kyushu University 九州大学システム情報科学府
	Author PID K000008 Creator Name Hirokawa, Sachio 廣川, 佐千男ヒロカワ, サチオ Affiliation Affiliation Name Computing and Communications Center, Kyushu University 九州大学情報基盤センター
Language	English
Publisher	Springer
Date	2001-11
Source Title	Lecture Notes in Computer Science
Vol	2226
First Page	113
Last Page	127
Publication Type	Accepted Manuscript
Access Rights	open access
Rights	© 2001 Springer
Related DOI
Related DOI	Lecture Notes in Computer Science \|\| 2226 \|\| p113-127
Related DOI	http://www.i.kyushu-u.ac.jp/index.html
Related URI	Lecture Notes in Computer Science \|\| 2226 \|\| p113-127
Related URI	http://www.i.kyushu-u.ac.jp/index.html
Related URI	isIdenticalTo http://springerlink.com/content/x76jcuew8np2q09b/?p=f4146f31080b4fd1a8d2afc89fd45b25&pi=0
Related HDL
Relation	Lecture Notes in Computer Science \|\| 2226 \|\| p113-127
Relation	http://www.i.kyushu-u.ac.jp/index.html
Abstract	We propose a preprocessing method for Web mining which, given semi-structured documents with the same structure and style, distinguishes useless parts and non-useless parts in each document without an...y knowledge on the documents. It is based on a simple idea that any -gram is useless if it appears frequently. To decide an appropriate pair of length and frequency , we introduce a new statistic measure alternation count. It is the number of alternations between useless parts and non-useless parts. Given news articles written in English or Japanese with some non-articles, the algorithm eliminates frequent -grams used for the structure and style of articles and extracts the news contents and headlines with more than 97% accuracy if articles are collected from the same site. Even if input articles are collected from different sites, the algorithm extracts contents of articles from these sites with at least 95% accuracy. Thus, the algorithm does not depend on the language, is robust for noises, and is applicable to multiple formats.show more

Hide fulltext details.

File	FileType	Size	Views	Description
DS01_alternation	pdf	152 KB	630

Details

Record ID	6078
Peer-Reviewed	Unrefereed
Related URI	http://springerlink.com/content/x76jcuew8np2q09b/?p=f4146f31080b4fd1a8d2afc89fd45b25&pi=0
Subject Terms	Computer Science
	Artificial Intelligence
	Web mining
ISSN	0302-9743
NCID	BA54461995
Notes	Discovery Science : 4th InternationalConference, DS 2001, Washington, DC, USA, November 25-28, 2001. Proceedings
Type	会議発表論文
Created Date	2009.04.22
Modified Date	2020.11.02

Export

Link to this page

Search Other Services

Statistics

＜conference paper＞ Eliminating Useless Parts in Semi-structured Documents Using Alternation Counts

Hide fulltext details.

Details

People who viewed this item also viewed

＜conference paper＞
Eliminating Useless Parts in Semi-structured Documents Using Alternation Counts