<テクニカルレポート>
A Fast Algorithm for Discovering Optimal String Patterns in Large Text Databases

作成者
本文言語
出版者
発行日
収録物名
出版タイプ
アクセス権
関連DOI
関連URI
関連情報
概要 We consider a data mining problem in a large collection of unstructured texts based on association rules over subwords of texts. A two-word association pattern is an expression such as $ (TATA, 30, AG...GAGGT) Rightarrow C $ that expresses a rule that if a text contains a subword TATA followed by another subword AGGAGGT with distance no more than 30 letters then a property C will hold with a probability. We present an efficient algorithm for computing frequent patterns ($ alpha $, $k$ , $\beta $) that optimize the confidence with respect to a given collection of texts. The algorithm runs in time $ O(mn^2) $ and space $ O(kn) $, where $ m $ and $ n $ are the number and the total length of classification examples, respectively, and $ k $ is a small constant around 30 ~ 50. Furthermore for most random and nearly random texts like DNA sequences, the algorithm runs very efficiently in time $ O(kn log^2 n) $. Thus, this algorithm is much faster than a straightforward algorithm that enumerates all the possible patterns in time $ O(n^5) $. We also discuss some heuristics such as sampling and pruning for practical improvement. Then, we evaluate the efficiency and the performance of the algorithm with experiments on genetic sequences.続きを見る

本文ファイル

pdf trcs148 pdf 276 KB 481  
gz trcs148.ps gz 152 KB 41  

詳細

レコードID
査読有無
タイプ
登録日 2009.04.22
更新日 2017.01.20

この資料を見た人はこんな資料も見ています