<会議発表論文>
Eliminating Useless Parts in Semi-structured Documents Using Alternation Counts

作成者
本文言語
出版者
発行日
収録物名
開始ページ
終了ページ
出版タイプ
アクセス権
権利関係
関連DOI
関連DOI
関連URI
関連URI
関連HDL
関連情報
概要 We propose a preprocessing method for Web mining which, given semi-structured documents with the same structure and style, distinguishes useless parts and non-useless parts in each document without an...y knowledge on the documents. It is based on a simple idea that any -gram is useless if it appears frequently. To decide an appropriate pair of length and frequency , we introduce a new statistic measure alternation count. It is the number of alternations between useless parts and non-useless parts. Given news articles written in English or Japanese with some non-articles, the algorithm eliminates frequent -grams used for the structure and style of articles and extracts the news contents and headlines with more than 97% accuracy if articles are collected from the same site. Even if input articles are collected from different sites, the algorithm extracts contents of articles from these sites with at least 95% accuracy. Thus, the algorithm does not depend on the language, is robust for noises, and is applicable to multiple formats.続きを見る

本文ファイル

pdf DS01_alternation pdf 152 KB 678  

詳細

レコードID
査読有無
関連URI
主題
ISSN
NCID
注記
タイプ
登録日 2009.04.22
更新日 2020.11.02

この資料を見た人はこんな資料も見ています