Tom Brøndsted: Document Classifier

Upload | Help | Back (more online)

Version 2.0 April 2008 (NEW agglomerative clustering)

[Ex.1] [Ex.2] [Ex.3]

Example 2: Spam: "I AM THE SON OF ALHAJI ISMAILA GWARZO THE NATIONAL SECURITY ADVISER TO GENERAL SANI ABACHA WHO WAS THE FORMER MILITARY HEAD OF STATE OF NIGERIA" ... The documents are texts found under the category "Nigerian" or similar on sites fighting spam. These kind of spam emails are intuitively so stereotyped that it could be interesting to look for "fingerprints". Hence, we use word triplets as terms, switch off the inverse document frequency weighting, and apply no stemming. Conclusion: doc 3, 5, and 6 definitely have the "fingerprints" of one an the same author. 1, 2, and 4 are further away and have been written by different authors. The 4th document ends up in the group with 3, 5, and 6, however with a high cost (0.969311) that could have been prevented with a threshold. Press "Measure similarity" at the bottom of the page or select "Clear URIs" and input your own settings.

[Clear URIs]

inverse doc. freq. apply English stemming (Porter)
term=word term=wordpair term=wordtriplet

view!
view!
view!
view!
view!
view!
view!
view!
view!
(Patience! Calculation can take 10-40 sec.)


An experimental document classifier based on the vector space model and agglomerative clustering. Input is a number of links to documents to be analyzed. Output is a distance matrix depicting the similarities of the documents and how they cluster.