Tom Brøndsted: Document Classifier

Upload | Help | Back (more online)

Version 2.0 April 2008 (NEW agglomerative clustering)

[Ex.1] [Ex.2] [Ex.3]

Example 3: Spam vs. NOT Spam: Same as example 2, except we have added a 7th doc also describing investment oppertunities in Nigeria. The settings are brought back to "normal" employing idf and words as terms. The clustering isolates the 7th document from the real spam emails.! Press "Measure similarity" at the bottom of the page or select "Clear URIs" and input your own settings.

[Clear URIs]

inverse doc. freq. apply English stemming (Porter)
term=word term=wordpair term=wordtriplet

view!
view!
view!
view!
view!
view!
view!
view!
view!
(Patience! Calculation can take 10-40 sec.)


An experimental document classifier based on the vector space model and agglomerative clustering. Input is a number of links to documents to be analyzed. Output is a distance matrix depicting the similarities of the documents and how they cluster.