Eliminating Useless Parts in Semi-structured Documents Using Alternation Counts - 九大コレクション

＜会議発表論文＞
Eliminating Useless Parts in Semi-structured Documents Using Alternation Counts

作成者	著者識別子 100021285 作成者名 Ikeda, Daisuke 池田, 大輔イケダ, ダイスケ所属機関所属機関名 Computing and Communications Center, Kyushu University 九州大学情報基盤センター
	著者識別子 L002646 作成者名 Yamada, Yasuhiro 山田, 泰寛ヤマダ, ヤスヒロ所属機関所属機関名 Graduate School of Information Science and Electrical Engineering, Kyushu University 九州大学システム情報科学府
	著者識別子 K000008 作成者名 Hirokawa, Sachio 廣川, 佐千男ヒロカワ, サチオ所属機関所属機関名 Computing and Communications Center, Kyushu University 九州大学情報基盤センター
本文言語	英語
出版者	Springer
発行日	2001-11
収録物名	Lecture Notes in Computer Science
巻	2226
開始ページ	113
終了ページ	127
出版タイプ	Accepted Manuscript
アクセス権	open access
権利関係	© 2001 Springer
関連DOI
関連DOI	Lecture Notes in Computer Science \|\| 2226 \|\| p113-127
関連DOI	http://www.i.kyushu-u.ac.jp/index.html
関連URI	Lecture Notes in Computer Science \|\| 2226 \|\| p113-127
関連URI	http://www.i.kyushu-u.ac.jp/index.html
関連URI	以下と同一 http://springerlink.com/content/x76jcuew8np2q09b/?p=f4146f31080b4fd1a8d2afc89fd45b25&pi=0
関連HDL
関連情報	Lecture Notes in Computer Science \|\| 2226 \|\| p113-127
関連情報	http://www.i.kyushu-u.ac.jp/index.html
概要	We propose a preprocessing method for Web mining which, given semi-structured documents with the same structure and style, distinguishes useless parts and non-useless parts in each document without an...y knowledge on the documents. It is based on a simple idea that any -gram is useless if it appears frequently. To decide an appropriate pair of length and frequency , we introduce a new statistic measure alternation count. It is the number of alternations between useless parts and non-useless parts. Given news articles written in English or Japanese with some non-articles, the algorithm eliminates frequent -grams used for the structure and style of articles and extracts the news contents and headlines with more than 97% accuracy if articles are collected from the same site. Even if input articles are collected from different sites, the algorithm extracts contents of articles from these sites with at least 95% accuracy. Thus, the algorithm does not depend on the language, is robust for noises, and is applicable to multiple formats.続きを見る

本文ファイル

ファイル	ファイルタイプ	サイズ	閲覧回数	説明
DS01_alternation	pdf	152 KB	789

詳細

レコードID	6078
査読有無	査読無
関連URI	http://springerlink.com/content/x76jcuew8np2q09b/?p=f4146f31080b4fd1a8d2afc89fd45b25&pi=0
主題	Computer Science
	Artificial Intelligence
	Web mining
ISSN	0302-9743
NCID	BA54461995
注記	Discovery Science : 4th InternationalConference, DS 2001, Washington, DC, USA, November 25-28, 2001. Proceedings
タイプ	会議発表論文
登録日	2009.04.22
更新日	2020.11.02

この情報を出力する

このページのリンク

他の検索サイト

利用統計

＜会議発表論文＞ Eliminating Useless Parts in Semi-structured Documents Using Alternation Counts

本文ファイル

詳細

この資料を見た人はこんな資料も見ています

＜会議発表論文＞
Eliminating Useless Parts in Semi-structured Documents Using Alternation Counts