Tom Brøndsted: Document Classifier

Input form | Help | Select examples

Example 2: "Fingerprinting" Spam: " The documents are texts found under the category "Nigerian" or similar on sites fighting spam. To find "fingerprints", we use word triplets as terms, switch off the inverse document frequency weighting, and apply no stemming. The clustering into four clusters C1(1) C2(2) C3(3,5,6) C4(4) gives evidence that doc 3, 5, and 6 have the "fingerprints" of one an the same author. Clustering further into three and two clusters is possible only with a high cost (over 0.9). Press "Measure similarity" at the bottom of the page or select "Clear URIs" and input your own settings.

[Clear URIs]

inverse doc. freq. apply English stemming (Porter)
term=word term=wordpair term=wordtriplet

view!
view!
view!
view!
view!
view!
view!
view!
view!
(Patience! Calculation can take 10-40 sec.)


An experimental document classifier based on the vector space model and agglomerative clustering. Input is a number of links to documents to be analyzed. Output is a distance matrix depicting the similarities of the documents and how they cluster.