Expressive Power of Tree and String Based Wrappers - 九大コレクション | 九州大学附属図書館

＜会議発表論文＞
Expressive Power of Tree and String Based Wrappers

作成者	著者識別子 K000021 作成者名 Ikeda, Daisuke 池田, 大輔所属機関所属機関名 Computing and Communications Center, Kyushu University 九州大学情報基盤センター
	著者識別子 L002646 作成者名 Yamada, Yasuhiro 山田, 泰寛所属機関所属機関名 Graduate School of Information Science and Electrical Engineering, Kyushu University 九州大学システム情報科学府
	著者識別子 K000008 作成者名 Hirokawa, Sachio 廣川, 佐千男所属機関所属機関名 Computing and Communications Center, Kyushu University 九州大学情報基盤センター
本文言語	英語
出版者	International Joint Conference on Artificial Intelligence
発行日	2003-08
収録物名	Proceedings of IJCAI-03 Workshop on Information Integration on the Web (IIWeb-03)
開始ページ	21
終了ページ	26
出版タイプ	Version of Record
アクセス権	open access
関連DOI	Proceedings of IJCAI-03 Workshop on Information Integration on the Web (IIWeb-03) \|\| \|\| p21-26
関連DOI	http://matu.cc.kyushu-u.ac.jp/
関連URI	Proceedings of IJCAI-03 Workshop on Information Integration on the Web (IIWeb-03) \|\| \|\| p21-26
関連URI	http://matu.cc.kyushu-u.ac.jp/
関連情報	Proceedings of IJCAI-03 Workshop on Information Integration on the Web (IIWeb-03) \|\| \|\| p21-26
関連情報	http://matu.cc.kyushu-u.ac.jp/
概要	There exist two types of wrappers: the string based wrapper such as the LR wrapper, and the tree based wrapper. A tree based wrapper designates extraction regions by nodes on the trees of semistructur...ed documents. The tree based wrapper seems to be more powerful than the string based one. There exist, however, manyHTML documents on the Web such that a standard tree based wrapper fails to extract contents because they are structured by presentational tags, punctuation symbols, and white spaces. Moreover, some of such documents use multi-byte characters for structuring. To treat some of such documents, we propose automatic wrapper generation based on common substring detection and to use input documents without any modification. In this framework, a part of text elements including white spaces and multibyte characters can be a part of a wrapper. We show the superiority such wrappers to usual wrappers created after document are parsed and modified. However, there still exist HTML documents such that wrappers with text elements fail to extract contents. Thus, we propose another class of wrappers, called the regional tree wrapper, which utilize the tree structures of input documents as well as addressing functions on strings.続きを見る

本文ファイル

ファイル	ファイルタイプ	サイズ	閲覧回数	説明
2003_a_2	pdf	63.3 KB	280

詳細

レコードID	2960
査読有無	査読有
主題	Wrapping and Extracting
主題	ラッパー生成
注記	Proceedings of IJCAI-03 Workshop on Information Integration on the Web (IIWeb-03) August 9-10, 2003 Acapulco, Mexico
タイプ	会議発表論文
登録日	2009.04.22
更新日	2018.08.31