Tom Brøndsted: Document Classifier

Input form | Help | Select examples

Example 3: Spam vs. NOT Spam: Same as example 2, except a 7th doc describing "serious" investment oppertunities in Nigeria has been added. Using n=1, the clustering C1(1,2,4,3,5,6) C2(7) isolates the 7th document from the Nigerian spam emails. Press "Measure similarity" at the bottom of the page or select "Clear URIs" and input your own settings.

[Clear URIs]

inverse doc. freq. apply English stemming (Porter)
term=word term=wordpair term=wordtriplet

view!
view!
view!
view!
view!
view!
view!
view!
view!
(Patience! Calculation can take 10-40 sec.)


An experimental document classifier based on the vector space model and agglomerative clustering. Input is a number of links to documents to be analyzed. Output is a distance matrix depicting the similarities of the documents and how they cluster.