頻度情報を用いたWeb文書群からのテンプレート抽出 - 九大コレクション | 九州大学附属図書館

＜会議発表論文＞
頻度情報を用いたWeb文書群からのテンプレート抽出

作成者	作成者名野口, 龍太郎 Noguchi, Ryutaro 所属機関所属機関名九州大学理学部物理学科 Department of Physics, Kyushu University
	著者識別子 L002646 作成者名山田, 泰寛 Yamada, Yasuhiro 所属機関所属機関名九州大学大学院システム情報科学府 Department of Informatics, Kyushu University
	著者識別子 100021285 作成者名池田, 大輔 Ikeda, Daisuke 所属機関所属機関名九州大学情報基盤センター Computing and Communications Center, Kyushu University
	著者識別子 K000008 作成者名廣川, 佐千男 Hirokawa, Sachio 所属機関所属機関名九州大学情報基盤センター Computing and Communications Center, Kyushu University
本文言語	日本語
発行日	2004-03-05
出版タイプ	Accepted Manuscript
アクセス権	open access
関連DOI	http://matu.cc.kyushu-u.ac.jp/
関連URI	http://matu.cc.kyushu-u.ac.jp/
関連情報	http://matu.cc.kyushu-u.ac.jp/
概要	大学のシラバスやレシピなど，Web上には同一テンプレートで記述されたページ群が多数ある．各ページ群に対するテンプレートが分かれば，ページに書かれた個別データを抽出し，データベースとしての活用が期待される．本論文では，Webページに含まれるn-gramの出現頻度情報だけを用いて効率よくテンプレートを発見するアルゴリズムを提案する．また，これを実装したシステムを用いてWeb上に存在する大学のシラバス，...検索エンジンの検索結果について行なった実験結果について報告する． There are a lot of Web documents written with the same template. Recipes, stuff pages and syllabus pages of universities are typical examples. If we have the templates of these documents, we can extract the contents and can store them into a database. This paper proposes a template detection algorithm using the frequency of frequencies of n-grams in the documents. Experimental results are shown for series of Web documents.続きを見る

本文ファイル

ファイル	ファイルタイプ	サイズ	閲覧回数	説明
6-b-01	pdf	402 KB	866

詳細

レコードID	2986
査読有無	査読有
主題	Web mining
	a series Web documents
	template detection
	substring amplification
	Zipf's Law
	Web マイニング
	シリーズ型HTML文書
	テンプレート発見
	部分文字列増幅法
	ジップの法則
注記	平成17年2月28日-3月2日電子情報通信学会データ工学ワークショップ(DEWS)
タイプ	会議発表論文
登録日	2009.04.22
更新日	2017.01.19