A Fast Algorithm for Discovering Optimal String Patterns in Large Text Databases - 九大コレクション | 九州大学附属図書館

＜テクニカルレポート＞
A Fast Algorithm for Discovering Optimal String Patterns in Large Text Databases

作成者	作成者名 Arimura, Hiroki 有村, 博紀所属機関所属機関名 Department of Informatics, Kyushu University 九州大学大学院システム情報科学研究院情報理学部門
	作成者名 Wataki, Atsushi 渡木, 厚所属機関所属機関名 Department of Informatics, Kyushu University 九州大学大学院システム情報科学研究院情報理学部門
	作成者名 Fujino, Ryoichi 藤野, 亮一所属機関所属機関名 Department of Informatics, Kyushu University 九州大学大学院システム情報科学研究院情報理学部門
	作成者名 Arikawa, Setsuo 有川, 節夫所属機関所属機関名 Department of Informatics, Kyushu University 九州大学大学院システム情報科学研究院情報理学部門
本文言語	英語
出版者	Department of Informatics, Kyushu University
出版者	九州大学大学院システム情報科学研究院情報理学部門
発行日	1998-03-19
収録物名	DOI Technical Report
巻	148
出版タイプ	Accepted Manuscript
アクセス権	open access
関連DOI	DOI Technical Report \|\| 148
関連DOI	http://www.i.kyushu-u.ac.jp/research/report.html
関連URI	DOI Technical Report \|\| 148
関連URI	http://www.i.kyushu-u.ac.jp/research/report.html
関連情報	DOI Technical Report \|\| 148
関連情報	http://www.i.kyushu-u.ac.jp/research/report.html
概要	We consider a data mining problem in a large collection of unstructured texts based on association rules over subwords of texts. A two-word association pattern is an expression such as $ (TATA, 30, AG...GAGGT) Rightarrow C $ that expresses a rule that if a text contains a subword TATA followed by another subword AGGAGGT with distance no more than 30 letters then a property C will hold with a probability. We present an efficient algorithm for computing frequent patterns ($ alpha $, $k$ , $\beta $) that optimize the confidence with respect to a given collection of texts. The algorithm runs in time $ O(mn^2) $ and space $ O(kn) $, where $ m $ and $ n $ are the number and the total length of classification examples, respectively, and $ k $ is a small constant around 30 ~ 50. Furthermore for most random and nearly random texts like DNA sequences, the algorithm runs very efficiently in time $ O(kn log^2 n) $. Thus, this algorithm is much faster than a straightforward algorithm that enumerates all the possible patterns in time $ O(n^5) $. We also discuss some heuristics such as sampling and pruning for practical improvement. Then, we evaluate the efficiency and the performance of the algorithm with experiments on genetic sequences.続きを見る

本文ファイル

ファイル	ファイルタイプ	サイズ	閲覧回数	説明
trcs148	pdf	276 KB	511
trcs148.ps	gz	152 KB	42

詳細

レコードID	3016
査読有無	査読無
タイプ	テクニカルレポート
登録日	2009.04.22
更新日	2017.01.20