交代数を用いた多言語Webテキストからの共通部分特定とラッパーの生成法(データマイニング) - 九大コレクション

＜学術雑誌論文＞
交代数を用いた多言語Webテキストからの共通部分特定とラッパーの生成法(データマイニング)

作成者	作成者名 YAMADA YASUHIRO 山田泰寛所属機関所属機関名 Department of Informatics, Kyushu University 九州大学大学院システム情報科学府
	著者識別子 00294992 作成者名 IKEDA DAISUKE 池田大輔所属機関所属機関名 Kyushu University Library 九州大学附属図書館
	著者識別子 40126785 作成者名 HIROKAWA SACHIO 廣川佐千男所属機関所属機関名 Computing and Communications Center, Kyushu University 九州大学情報基盤センター
本文言語	日本語
出版者	Information Processing Society of Japan (IPSJ)
出版者	一般社団法人情報処理学会
発行日	2004-09-15
収録物名	IPSJ Journal
収録物名	情報処理学会論文誌
巻	45
号	9
開始ページ	2138
終了ページ	2145
出版タイプ	Version of Record
アクセス権	open access
権利関係	(C) 2004 by the Information Processing Society of Japan
関連DOI
関連URI	以下と同一 http://ci.nii.ac.jp/naid/110002712261
関連HDL
概要	We propose an algorithm to generate a wrapper which extracts contents of Web pages written with the same template. First, the algorithm separates each page into template and contents parts using an al...ternation count. An alternation count with respect to (n, a%) is the sum of boundaries between frequent parts and non-frequent parts, where n is the length of a substring and a is the frequency of the substring. The algorithm searches for a local minimal (n, a) of the alternation count then specifies the template parts as ones on which frequent substrings appear. Next, the algorithm determines the strings which enclose the contents assuming that the last character and the first character is one of the symbols in ">", "<", new line, tab and space. The algorithm does not use any preprocessing depending on mark-up and natural languages and knowledge for each site. Experiments show this algorithm works well for inputs written in four natural languages and markuped with HTML and XML. 本稿では,同種の項目を多数含むWebページから各項目を抽出するラッパーを自動生成するアルゴリズムを提案する.提案手法では,まず部分文字列の出現頻度に着目し,交代数という指標を用いてテンプレート部分とコンテンツ部分を識別する.部分文字列の長さnと,出現頻度の割合a%に対する交代数とは,長さnの部分文字列で頻度が上位a%以内に含まれるものが連続して出現する領域とそうでない領域の境界の総数である.提案手法では交代数が極小となる(n,a)を求め,高頻度な部分文字列の出現する領域をテンプレート部分とする.次に,テンプレートの先頭あるいは末尾の文字が">","<",改行,タブ,空白のような特徴的な文字となっていることを用い,各項目を囲む文字列の組を特定する.この文字列の組からラッパーを生成する、提案手法は自然言語やマークアップ言語に依存する前処理や,サイトごとの特別な知識を用いない.実験では,4種類の自然言語,2種類のマークアップ言語によるページ群について評価を行い,高い再現率を示すことを確認した.続きを見る

本文ファイル

ファイル	ファイルタイプ	サイズ	閲覧回数	説明
hirokawa_221	pdf	240 KB	321

詳細

レコードID	1284598
査読有無	査読有
関連URI	http://ci.nii.ac.jp/naid/110002712261
ISSN	03875806
NCID	AN00116647
注記	利用は著作権の範囲内に限られます
登録日	2013.12.09
更新日	2023.07.28

この情報を出力する

このページのリンク

他の検索サイト

利用統計

＜学術雑誌論文＞ 交代数を用いた多言語Webテキストからの共通部分特定とラッパーの生成法(データマイニング)

本文ファイル

詳細

この資料を見た人はこんな資料も見ています

＜学術雑誌論文＞
交代数を用いた多言語Webテキストからの共通部分特定とラッパーの生成法(データマイニング)