<conference paper>
Gathering Text Files Generated from Templates

Creator
Language
Date
Source Title
Publication Type
Access Rights
Related DOI
Related DOI
Related URI
Related URI
Related HDL
Relation
Abstract Information integration comprises the three steps: data discovery; information extraction; and information integration. In this paper, we focus on the data discovery step which is crucial for the foll...owing steps. We first define what the data discovery is from the viewpoint of information extraction. The problem is, given a large amount of files, to find some sets of files such that found files in each set share some template. Each set corresponds to a template and multiple templates could be hidden in given files. We exploits a linear time algorithm which was originally developed by the authors for the common parts detection problem. The algorithm found different templates from collected Web pages including many noise files. We can cluster files according to the found templates. Files of a cluster is used as input data for an information extraction algorithm.show more

Hide fulltext details.

pdf 14 pdf 957 KB 450  

Details

Record ID
Peer-Reviewed
Related URI
Subject Terms
Notes
Type
Created Date 2009.04.22
Modified Date 2020.10.13

People who viewed this item also viewed