<会議発表論文>
Expressive Power of Tree and String Based Wrappers

作成者
本文言語
出版者
発行日
収録物名
開始ページ
終了ページ
出版タイプ
アクセス権
関連DOI
関連URI
関連情報
概要 There exist two types of wrappers: the string based wrapper such as the LR wrapper, and the tree based wrapper. A tree based wrapper designates extraction regions by nodes on the trees of semistructur...ed documents. The tree based wrapper seems to be more powerful than the string based one. There exist, however, manyHTML documents on the Web such that a standard tree based wrapper fails to extract contents because they are structured by presentational tags, punctuation symbols, and white spaces. Moreover, some of such documents use multi-byte characters for structuring. To treat some of such documents, we propose automatic wrapper generation based on common substring detection and to use input documents without any modification. In this framework, a part of text elements including white spaces and multibyte characters can be a part of a wrapper. We show the superiority such wrappers to usual wrappers created after document are parsed and modified. However, there still exist HTML documents such that wrappers with text elements fail to extract contents. Thus, we propose another class of wrappers, called the regional tree wrapper, which utilize the tree structures of input documents as well as addressing functions on strings.続きを見る

本文ファイル

pdf 2003_a_2 pdf 63.3 KB 280  

詳細

レコードID
査読有無
主題
注記
タイプ
登録日 2009.04.22
更新日 2018.08.31

この資料を見た人はこんな資料も見ています