作成者 |
|
|
|
|
本文言語 |
|
出版者 |
|
|
発行日 |
|
収録物名 |
|
巻 |
|
出版タイプ |
|
アクセス権 |
|
関連DOI |
|
|
関連URI |
|
|
関連情報 |
|
|
概要 |
We consider a data mining problem in a large collection of unstructured texts based on association rules over subwords of texts. A two-word association pattern is an expression such as $ (TATA, 30, AG...GAGGT) Rightarrow C $ that expresses a rule that if a text contains a subword TATA followed by another subword AGGAGGT with distance no more than 30 letters then a property C will hold with a probability. We present an efficient algorithm for computing frequent patterns ($ alpha $, $k$ , $\beta $) that optimize the confidence with respect to a given collection of texts. The algorithm runs in time $ O(mn^2) $ and space $ O(kn) $, where $ m $ and $ n $ are the number and the total length of classification examples, respectively, and $ k $ is a small constant around 30 ~ 50. Furthermore for most random and nearly random texts like DNA sequences, the algorithm runs very efficiently in time $ O(kn log^2 n) $. Thus, this algorithm is much faster than a straightforward algorithm that enumerates all the possible patterns in time $ O(n^5) $. We also discuss some heuristics such as sampling and pruning for practical improvement. Then, we evaluate the efficiency and the performance of the algorithm with experiments on genetic sequences.続きを見る
|