Creator |
|
|
|
|
Language |
|
Publisher |
|
|
Date |
|
Source Title |
|
Vol |
|
Publication Type |
|
Access Rights |
|
Related DOI |
|
|
Related URI |
|
|
Relation |
|
|
Abstract |
We propose an unsupervised method for detecting spam documents from Web page data, based on equivalence relations on strings. We propose 3 measures for quantifying the alienness (i.e. how different it... is from others) of substring equivalence classes within a given set of strings. A document is then classified as spam if it contains a characteristic equivalence class as a substring. The proposed method is unsupervised, independent of language, and is very efficient. Computational experiments conducted on data collected from Japanese web forums show fairly good results.show more
|