<会議発表論文>
Gathering Text Files Generated from Templates

作成者
本文言語
発行日
収録物名
出版タイプ
アクセス権
関連DOI
関連DOI
関連URI
関連URI
関連HDL
関連情報
概要 Information integration comprises the three steps: data discovery; information extraction; and information integration. In this paper, we focus on the data discovery step which is crucial for the foll...owing steps. We first define what the data discovery is from the viewpoint of information extraction. The problem is, given a large amount of files, to find some sets of files such that found files in each set share some template. Each set corresponds to a template and multiple templates could be hidden in given files. We exploits a linear time algorithm which was originally developed by the authors for the common parts detection problem. The algorithm found different templates from collected Web pages including many noise files. We can cluster files according to the found templates. Files of a cluster is used as input data for an information extraction algorithm.続きを見る

本文ファイル

pdf 14 pdf 957 KB 429  

詳細

レコードID
査読有無
関連URI
主題
注記
タイプ
登録日 2009.04.22
更新日 2020.10.13

この資料を見た人はこんな資料も見ています