Document Separation between Native English and Nonnative English Using Long POS Strings - Collections

＜departmental bulletin paper＞
Document Separation between Native English and Nonnative English Using Long POS Strings

Creator	Creator Name 行野, 顕正 Yukino, Kensei ユキノ, ケンセイ Affiliation Affiliation Name 九州大学大学院システム情報科学府知能システム学専攻 : 博士後期課程 Department of Intelligent Systems, Graduate School of Information Science and Electrical Engineering, Kyushu University : Doctoral Program
	Creator Name 青木, さやか Aoki, Sayaka アオキ, サヤカ Affiliation Affiliation Name 九州大学大学院システム情報科学府知能システム学専攻 : 修士課程 Department of Intelligent Systems, Graduate School of Information Science and Electrical Engineering, Kyushu University : Master's Program
	Creator Name 谷川, 龍司 Tanigawa, Ryuji タニガワ, リュウジ Affiliation Affiliation Name 九州大学大学院システム情報科学府知能システム学専攻 : 修士課程 Department of Intelligent Systems, Graduate School of Information Science and Electrical Engineering, Kyushu University : Master's Program
	Creator Name 冨浦, 洋一 Tomiura, Yoichi トミウラ, ヨウイチ Affiliation Affiliation Name 九州大学大学院システム情報科学研究院知能システム学部門 Department of Intelligent Systems, Faculty of Information Science and Electrical Engineering, Kyushu University
Language	Japanese
Publisher	九州大学大学院システム情報科学研究院
Publisher	Faculty of Information Science and Electrical Engineering, Kyushu University
Date	2006-09-26
Source Title	Research reports on information science and electrical engineering of Kyushu University
Vol	11
Issue	2
First Page	115
Last Page	119
Publication Type	Version of Record
Access Rights	open access
JaLC DOI	https://doi.org/10.15017/1516865
Related DOI	https://portal.isee.kyushu-u.ac.jp/
Related URI	https://portal.isee.kyushu-u.ac.jp/
Relation	https://portal.isee.kyushu-u.ac.jp/
Abstract	We propose using long and low-frequency part of speech (POS) strings for document separation between native English documents and non-native English documents. The long POS strings were ignored in pre...vious works because their frequencies in training data are too small to estimate their probabilities. Meanwhile, a research of language identification showed that the long and low-frequency byte strings were useful for language identification among similar languages. There are some similarity between language identification and document separation between native English documents and non-native English documents, for example long POS strings are more peculiar to one class than short ones, though there is a difference between POS and byte. Therefore, we can expect higher accuracy by using long and low-frequency POS strings. Some experiments are described in this paper. These experiments show that the proposed method has higher accuracy than previous ones.show more

Hide fulltext details.

File	FileType	Size	Views	Description
p115	pdf	693 KB	195

Details

PISSN	1342-3819
EISSN	2188-0891
NCID	AN10569524
Record ID	1516865
Peer-Reviewed	Refereed
Subject Terms	Document separation
	文書の判別
	Corpus construction
	コーパス作成
	Native English corpus
	母語話者コーパス
	Non-native English corpus
	非母語話者コーパス
	Low-frequent features
	低頻度の素性
Created Date	2015.06.19
Modified Date	2020.11.17

Export

Link to this page

Search Other Services

Statistics

＜departmental bulletin paper＞ Document Separation between Native English and Nonnative English Using Long POS Strings

Hide fulltext details.

Details

People who viewed this item also viewed

＜departmental bulletin paper＞
Document Separation between Native English and Nonnative English Using Long POS Strings