Statistical content filtering

Statistical filtering was first proposed in 1998 by Mehran Sahami et al., at the AAAI-98 Workshop on Learning for Text Categorization. A statistical filter is a kind of document classification system, and a number of machine learning researchers have turned their attention to the problem. Statistical filtering was popularized by Paul Graham's influential 2002 article A Plan for Spam, which proposed the use of naive Bayes classifiers to predict whether messages are spam or not – based on collections of spam and nonspam ("ham") email submitted by users.

Statistical filtering, once set up, requires no maintenance per se: instead, users mark messages as spam or nonspam and the filtering software learns from these judgements. Thus, a statistical filter does not reflect the software author's or administrator's biases as to content, but it does reflect the user's biases as to content; a biochemist who is researching Viagra won't have messages containing the word "Viagra" flagged as spam, because "Viagra" will show up often in his or her legitimate messages. Spam emails containing the word "Viagra", however, do get

filtered because of their unique content compared to legitimate messages. A statistical filter can also respond quickly to changes in spam content, without administrative intervention. Statistical filters should also look at message headers thereby considering not just the content but also peculiarities of the transport mechanism of the email. Spammers have attempted to fight statistical filtering by inserting many random but valid "noise" words or sentences into their messages while attempting to hide them from view, making it more likely that the filter will classify the message as neutral. (See Word salad (computer science).) Attempts to hide the noise words include setting them in tiny font or the same colour as the background. However, these noise countermeasures seem to have been largely ineffective. Software programs that implement statistical filtering include Bogofilter, DSPAM, SpamBayes the e-mail programs Mozilla and Mozilla Thunderbird, Mailwasher, and later revisions of SpamAssassin. Another interesting project is CRM114 which hashes phrases and does bayesian classification on the phrases. There is also the free mail filter POPFile which sorts mail in as many categories as you want (family, friends, co-worker, spam, whatever) with bayesian filtering.