Expressive Power of Tree and String Based Wrappers - Collections | Kyushu University Library

＜conference paper＞
Expressive Power of Tree and String Based Wrappers

Creator	Author PID K000021 Creator Name Ikeda, Daisuke 池田, 大輔 Affiliation Affiliation Name Computing and Communications Center, Kyushu University 九州大学情報基盤センター
	Author PID L002646 Creator Name Yamada, Yasuhiro 山田, 泰寛 Affiliation Affiliation Name Graduate School of Information Science and Electrical Engineering, Kyushu University 九州大学システム情報科学府
	Author PID K000008 Creator Name Hirokawa, Sachio 廣川, 佐千男 Affiliation Affiliation Name Computing and Communications Center, Kyushu University 九州大学情報基盤センター
Language	English
Publisher	International Joint Conference on Artificial Intelligence
Date	2003-08
Source Title	Proceedings of IJCAI-03 Workshop on Information Integration on the Web (IIWeb-03)
First Page	21
Last Page	26
Publication Type	Version of Record
Access Rights	open access
Related DOI	Proceedings of IJCAI-03 Workshop on Information Integration on the Web (IIWeb-03) \|\| \|\| p21-26
Related DOI	http://matu.cc.kyushu-u.ac.jp/
Related URI	Proceedings of IJCAI-03 Workshop on Information Integration on the Web (IIWeb-03) \|\| \|\| p21-26
Related URI	http://matu.cc.kyushu-u.ac.jp/
Relation	Proceedings of IJCAI-03 Workshop on Information Integration on the Web (IIWeb-03) \|\| \|\| p21-26
Relation	http://matu.cc.kyushu-u.ac.jp/
Abstract	There exist two types of wrappers: the string based wrapper such as the LR wrapper, and the tree based wrapper. A tree based wrapper designates extraction regions by nodes on the trees of semistructur...ed documents. The tree based wrapper seems to be more powerful than the string based one. There exist, however, manyHTML documents on the Web such that a standard tree based wrapper fails to extract contents because they are structured by presentational tags, punctuation symbols, and white spaces. Moreover, some of such documents use multi-byte characters for structuring. To treat some of such documents, we propose automatic wrapper generation based on common substring detection and to use input documents without any modification. In this framework, a part of text elements including white spaces and multibyte characters can be a part of a wrapper. We show the superiority such wrappers to usual wrappers created after document are parsed and modified. However, there still exist HTML documents such that wrappers with text elements fail to extract contents. Thus, we propose another class of wrappers, called the regional tree wrapper, which utilize the tree structures of input documents as well as addressing functions on strings.show more

Hide fulltext details.

File	FileType	Size	Views	Description
2003_a_2	pdf	63.3 KB	291

Details

Record ID	2960
Peer-Reviewed	Refereed
Subject Terms	Wrapping and Extracting
Subject Terms	ラッパー生成
Notes	Proceedings of IJCAI-03 Workshop on Information Integration on the Web (IIWeb-03) August 9-10, 2003 Acapulco, Mexico
Type	会議発表論文
Created Date	2009.04.22
Modified Date	2018.08.31