Automatic Wrapper Generation for Multilingual Web Resources - Collections | Kyushu University Library

＜departmental bulletin paper＞
Automatic Wrapper Generation for Multilingual Web Resources

Creator	Author PID K000021 Creator Name 池田, 大輔 Ikeda, Daisuke イケダ, ダイスケ Affiliation Affiliation Name 九州大学情報基盤センター Computing and Communications Center, Kyushu University
	Creator Name 山田, 泰寛 Yamada, Yasuhiro ヤマダ, ヤスヒロ Affiliation Affiliation Name 九州大学システム情報科学府 Graduate School of Information Science and Electrical Engineering, Kyushu University
	Author PID 40126785 Creator Name 廣川, 佐千男 Hirokawa, Sachio ヒロカワ, サチオ Affiliation Affiliation Name 九州大学情報基盤センター Computing and Communications Center, Kyushu University
Language	Japanese
Publisher	九州大学情報基盤センター
Publisher	Computing and Communications Center, Kyushu University
Date	2003-03
Source Title
Vol	3
First Page	7
Last Page	14
Publication Type	Version of Record
Access Rights	open access
JaLC DOI	https://doi.org/10.15017/4784011
Abstract	本稿では、半構造化テキストデータからコンテンツ部分を抽出するラッパーを自動生成するシステムを提案する。入力として、テキストデータ以外にコンテンツを囲む区切り文字の最初と最後に現われ得る文字の集合を与えるものとする。入カテキストに対して同種のコンテンツ（レコード）が複数回現われるものと仮定するほかは、特に背最知識等は不要であり、入力に対し全自動でラッパーの生成を行なう。システムは、コンテンツの種類（...フィールド）ごとに左と右区切り文字のペアを出力する。半構造化テキストデータを単なる文字列として扱うので、入力は任意のマークアップ言語や自然言語で書かれていて構わない。様々な言語で書かれたWebページを対象とした実験によりその有効性を示す。マークアップ言語はXMLとHTMLで、4つの自然言語で書かれており、検索機能により動的に生成されたものもあれば、静的なページもある。本システムでは、XMLやHTMLのコメントやタグの属性なども通常の文字として扱うが、ある実験では、通常のコンテンツ部分だけでなく、コメントやタグの内部から有用な情報を抽出することもできた。また、タグにマルチバイト文字が含まれているようなデータでも問題なく扱える。 We present a wrapper generation system to extract contents of semi-structured documents. In addition to input documents, our system receives a set of symbols with which a delimiter string must begin or end. We assume that input documents contain instances of a record. Wrapper generation is done automatically. The system outputs a set of pairs of left and right delimiters each of which surrounds instances of a field. Our system treats semi-structured documents just as strings so that it does not depend on markup and natural languages. We show experimental results on text files markuped in HTML or XML. Contents of them are written in four natural languages. Some of them are dynamic pages, that is, produced by a search facility, and the others are static pages. In addition to usual contents, some generated wrappers extract useful information hidden in comments or tags which are ignored by other wrapper generation algorithms. Some generated delimiters contain whitespaces or multibyte characters.show more
Table of Contents	1 はじめに 2 アーキテクチャ 3 実験結果 4 おわりに

Hide fulltext details.

File	FileType	Size	Views	Description
3_p007	pdf	6.69 MB	83

Details

PISSN	1346-9010
NCID	AA1158221X
Record ID	4784011
Created Date	2022.05.16
Modified Date	2023.08.18