作成者 |
|
|
本文言語 |
|
発行日 |
|
収録物名 |
|
出版タイプ |
|
アクセス権 |
|
関連DOI |
|
関連DOI |
|
|
関連URI |
|
|
関連URI |
|
関連HDL |
|
関連情報 |
|
|
概要 |
Information integration comprises the three steps: data discovery; information extraction; and information integration. In this paper, we focus on the data discovery step which is crucial for the foll...owing steps. We first define what the data discovery is from the viewpoint of information extraction. The problem is, given a large amount of files, to find some sets of files such that found files in each set share some template. Each set corresponds to a template and multiple templates could be hidden in given files. We exploits a linear time algorithm which was originally developed by the authors for the common parts detection problem. The algorithm found different templates from collected Web pages including many noise files. We can cluster files according to the found templates. Files of a cluster is used as input data for an information extraction algorithm.続きを見る
|