## ＜テクニカルレポート＞A Fast Algorithm for Discovering Optimal String Patterns in Large Text Databases

作成者 作成者姓名 所属機関 所属機関名 Department of Informatics, Kyushu University 九州大学大学院システム情報科学研究院情報理学部門 作成者姓名 所属機関 所属機関名 Department of Informatics, Kyushu University 九州大学大学院システム情報科学研究院情報理学部門 作成者姓名 所属機関 所属機関名 Department of Informatics, Kyushu University 九州大学大学院システム情報科学研究院情報理学部門 作成者姓名 所属機関 所属機関名 Department of Informatics, Kyushu University 九州大学大学院システム情報科学研究院情報理学部門 英語 Department of Informatics, Kyushu University 九州大学大学院システム情報科学研究院情報理学部門 1998-03-19 148 accepted open access We consider a data mining problem in a large collection of unstructured texts based on association rules over subwords of texts. A two-word association pattern is an expression such as $(TATA, 30, AG...GAGGT) Rightarrow C$ that expresses a rule that if a text contains a subword TATA followed by another subword AGGAGGT with distance no more than 30 letters then a property C will hold with a probability. We present an efficient algorithm for computing frequent patterns ($alpha$, $k$ , $\beta$) that optimize the confidence with respect to a given collection of texts. The algorithm runs in time $O(mn^2)$ and space $O(kn)$, where $m$ and $n$ are the number and the total length of classification examples, respectively, and $k$ is a small constant around 30 ~ 50. Furthermore for most random and nearly random texts like DNA sequences, the algorithm runs very efficiently in time $O(kn log^2 n)$. Thus, this algorithm is much faster than a straightforward algorithm that enumerates all the possible patterns in time $O(n^5)$. We also discuss some heuristics such as sampling and pruning for practical improvement. Then, we evaluate the efficiency and the performance of the algorithm with experiments on genetic sequences.続きを見る

### 本文情報を非表示

trcs148 pdf 276 KB 78
trcs148.ps gz 152 KB 6

### 詳細

レコードID 3016 査読無 DOI Technical Report || 148 http://www.i.kyushu-u.ac.jp/research/report.html テクニカルレポート 2009.04.22 2017.01.20