<departmental bulletin paper>
Document Separation between Native English and Nonnative English Using Long POS Strings

Creator
Language
Publisher
Date
Source Title
Vol
Issue
First Page
Last Page
Publication Type
Access Rights
JaLC DOI
Related DOI
Related URI
Relation
Abstract We propose using long and low-frequency part of speech (POS) strings for document separation between native English documents and non-native English documents. The long POS strings were ignored in pre...vious works because their frequencies in training data are too small to estimate their probabilities. Meanwhile, a research of language identification showed that the long and low-frequency byte strings were useful for language identification among similar languages. There are some similarity between language identification and document separation between native English documents and non-native English documents, for example long POS strings are more peculiar to one class than short ones, though there is a difference between POS and byte. Therefore, we can expect higher accuracy by using long and low-frequency POS strings. Some experiments are described in this paper. These experiments show that the proposed method has higher accuracy than previous ones.show more

Hide fulltext details.

pdf p115 pdf 693 KB 173  

Details

PISSN
EISSN
NCID
Record ID
Peer-Reviewed
Subject Terms
Created Date 2015.06.19
Modified Date 2020.11.17

People who viewed this item also viewed