<departmental bulletin paper>
Application and Improvement of the Purity Measure to Texts

Creator
Language
Publisher
Date
Source Title
Vol
Issue
First Page
Last Page
Publication Type
Access Rights
JaLC DOI
Related DOI
Related URI
Relation
Abstract The purity measure is an unusualness measure for substrings of a given string. Although we have shown its usefulness on characterization of specific regions of genome sequences in the previous work, i...t has not been examined deeply how well the measure can be applied to text data, where much more symbols are used than in genome sequences. In this paper, we investigate its usefulness on texts and also show that the purity measure cannot differentiate the unusualness of substrings when many symbols are used in an input string. Therefore, we propose an improved measure called atomicity measure and show it can differentiate the unusualness of substrings better. Our experiment on alphabet sequences in texts shows both the measures distinguish word-like sequences and non-word sequences. Another experiment on word sequences (phrases), which is the case that there are a lot of symbols, shows the atomicity measure gives high values to phrases such as proper nouns and low values to idiomatic phrases that might reflect genres of texts while the purity measure is not so suggestive on phrases. We conclude that especially the atomicity measure can characterize texts well, and it will potentially be useful in text mining.show more

Hide fulltext details.

pdf paper1(19-1) pdf 98.9 KB 448  

Details

Record ID
Peer-Reviewed
Subject Terms
NCID
Created Date 2014.09.26
Modified Date 2018.12.21

People who viewed this item also viewed