Detecting Common Parts and Wrapper Generation for Multilingual Web Documents Using Alternation Counts(Data Mining/Data Warehousing) - Collections

＜journal article＞
Detecting Common Parts and Wrapper Generation for Multilingual Web Documents Using Alternation Counts(Data Mining/Data Warehousing)

Creator	Creator Name YAMADA YASUHIRO 山田泰寛 Affiliation Affiliation Name Department of Informatics, Kyushu University 九州大学大学院システム情報科学府
	Author PID 00294992 Creator Name IKEDA DAISUKE 池田大輔 Affiliation Affiliation Name Kyushu University Library 九州大学附属図書館
	Author PID 40126785 Creator Name HIROKAWA SACHIO 廣川佐千男 Affiliation Affiliation Name Computing and Communications Center, Kyushu University 九州大学情報基盤センター
Language	Japanese
Publisher	Information Processing Society of Japan (IPSJ)
Publisher	一般社団法人情報処理学会
Date	2004-09-15
Source Title	IPSJ Journal
Source Title	情報処理学会論文誌
Vol	45
Issue	9
First Page	2138
Last Page	2145
Publication Type	Version of Record
Access Rights	open access
Rights	(C) 2004 by the Information Processing Society of Japan
Related DOI
Related URI	isIdenticalTo http://ci.nii.ac.jp/naid/110002712261
Related HDL
Abstract	We propose an algorithm to generate a wrapper which extracts contents of Web pages written with the same template. First, the algorithm separates each page into template and contents parts using an al...ternation count. An alternation count with respect to (n, a%) is the sum of boundaries between frequent parts and non-frequent parts, where n is the length of a substring and a is the frequency of the substring. The algorithm searches for a local minimal (n, a) of the alternation count then specifies the template parts as ones on which frequent substrings appear. Next, the algorithm determines the strings which enclose the contents assuming that the last character and the first character is one of the symbols in ">", "<", new line, tab and space. The algorithm does not use any preprocessing depending on mark-up and natural languages and knowledge for each site. Experiments show this algorithm works well for inputs written in four natural languages and markuped with HTML and XML. 本稿では,同種の項目を多数含むWebページから各項目を抽出するラッパーを自動生成するアルゴリズムを提案する.提案手法では,まず部分文字列の出現頻度に着目し,交代数という指標を用いてテンプレート部分とコンテンツ部分を識別する.部分文字列の長さnと,出現頻度の割合a%に対する交代数とは,長さnの部分文字列で頻度が上位a%以内に含まれるものが連続して出現する領域とそうでない領域の境界の総数である.提案手法では交代数が極小となる(n,a)を求め,高頻度な部分文字列の出現する領域をテンプレート部分とする.次に,テンプレートの先頭あるいは末尾の文字が">","<",改行,タブ,空白のような特徴的な文字となっていることを用い,各項目を囲む文字列の組を特定する.この文字列の組からラッパーを生成する、提案手法は自然言語やマークアップ言語に依存する前処理や,サイトごとの特別な知識を用いない.実験では,4種類の自然言語,2種類のマークアップ言語によるページ群について評価を行い,高い再現率を示すことを確認した.show more

Hide fulltext details.

File	FileType	Size	Views	Description
hirokawa_221	pdf	240 KB	266

Details

Record ID	1284598
Peer-Reviewed	Refereed
Related URI	http://ci.nii.ac.jp/naid/110002712261
ISSN	03875806
NCID	AN00116647
Notes	利用は著作権の範囲内に限られます
Created Date	2013.12.09
Modified Date	2023.07.28

Export

Link to this page

Search Other Services

Statistics

＜journal article＞ Detecting Common Parts and Wrapper Generation for Multilingual Web Documents Using Alternation Counts(Data Mining/Data Warehousing)

Hide fulltext details.

Details

People who viewed this item also viewed

＜journal article＞
Detecting Common Parts and Wrapper Generation for Multilingual Web Documents Using Alternation Counts(Data Mining/Data Warehousing)