作成者 |
|
|
|
|
本文言語 |
|
出版者 |
|
|
発行日 |
|
収録物名 |
|
巻 |
|
号 |
|
開始ページ |
|
終了ページ |
|
出版タイプ |
|
アクセス権 |
|
JaLC DOI |
|
関連DOI |
|
関連URI |
|
関連情報 |
|
概要 |
We propose using long and low-frequency part of speech (POS) strings for document separation between native English documents and non-native English documents. The long POS strings were ignored in pre...vious works because their frequencies in training data are too small to estimate their probabilities. Meanwhile, a research of language identification showed that the long and low-frequency byte strings were useful for language identification among similar languages. There are some similarity between language identification and document separation between native English documents and non-native English documents, for example long POS strings are more peculiar to one class than short ones, though there is a difference between POS and byte. Therefore, we can expect higher accuracy by using long and low-frequency POS strings. Some experiments are described in this paper. These experiments show that the proposed method has higher accuracy than previous ones.続きを見る
|