<conference paper>
Automatic Wrapper Generation for Multilingual Web Resources

Creator
Language
Publisher
Date
Source Title
Vol
First Page
Last Page
Publication Type
Access Rights
Rights
Related DOI
Related DOI
Related URI
Related URI
Related HDL
Relation
Abstract We present a wrapper generation system to extract contents of semi-structured documents which contain instances of a record. The generation is done automatically using general assumptions on the struc...ture of instances. It outputs a set of pairs of left and right delimiters surrounding instances of a field. In addition to input documents, our system also receives a set of symbols with which a delimiter must begin or end. Our system treats semi-structured documents just as strings so that it does not depend on markup and natural languages. It does not require any training examples which show where instances are. We show experimental results on both static and dynamic pages which are gathered from 13 Web sites, markuped in HTML or XML, and written in four natural languages. In addition to usual contents, generated wrappers extract useful information hidden in comments or tags which are ignored by other wrapper generation algorithms. Some generated delimiters contain whitespaces or multibyte characters.show more

Hide fulltext details.

pdf DS02 pdf 133 KB 584  

Details

Record ID
Peer-Reviewed
Related URI
Subject Terms
ISSN
NCID
Notes
Type
Created Date 2009.04.22
Modified Date 2020.11.02

People who viewed this item also viewed