Bogofilter Anti-Spam "how to" (bayesian spam filter)



Home


Bogofilter is a bayesian spam filter that will index two mailboxes of mail, one good mail and one known spam, to classify mail.

Bogofilter is a mail filter that classifies mail as spam or ham (non-spam) by a statistical analysis of the message's header and content (body). The program is able to learn from the user's classifications and corrections.

The statistical technique is known as the Bayesian technique and its use for spam was described by Paul Graham in his article A Plan For Spam in August 2002. Gary Robinson, in his web log Rants (September 2002), suggested some refinements for improved discrimination between spam and ham. Bogofilter's primary algorithm uses the f(w) parameter and the Fisher inverse chi-square technique that he describes. Paul Graham's new article Better Bayesian Filtering (January 2003) suggests some useful parsing improvements.

Bogofilter is run by an MDA script to classify an incoming message as spam or ham (using word lists stored by BerkeleyDB). Bogofilter provides processing for plain text and HTML. It supports multi-part mime message with decoding of base64, quoted-printable, and uuencoded text and ignores attachments, such as images. Bogofilter.Sourceforge



Getting Started

The first thing to check is whether bogofilter is installed and you can see it in your path. You can do a "which bogofilter" and if you do not see it then make sure it is installed. If not go get a package or build it from source. Once you have it make sure bogofilter is in the path that users on the system can see.

Secondly, take some time and put all of your mail that you know is spam into a separate mail box. This mail box will be named "SPAM" for our example. Then put all of the mail you know is good mail, mail you want to receive in the future, into another box. We will use "archive" for good mail mail. If you have other email that you is good mail, but is in another place then we will use the mailbox "saved" for that example.



Running the job from cron

To use bogofilter in its easiest capacity you can choose to run a cron job every 15 minutes or so with the following lines. You can run these commands on one line or with line separators like we have below for easier reading.

rm /home/username/.bogofilter/wordlist.db; \  
  bogofilter -s < ~username/Mail/SPAM;     \
  bogofilter -n < ~username/Mail/archive;  \
  bogofilter -n < ~username/Mail/saved

These lines will remove the wordlist.db database bogofilter makes and re-make the list. The argument "-s" is for mailboxes that contain know samples of spam you have received. The argument "-n" is for non-spam or good mail you have. For our example we have labeled all mail in the SPAM mailbox as spam and all good mail in "archive" and "saved" as non-spam or mail we want to recieve.



How does it work?

The idea is that you can edit your mail folders, moving mail from one box to another and not have to worry about the database containing false positive or false negatives because the database is always remade when your cron job runs. You may find this method a lot easier to explain to any users you have to support instead of trying to get them to update the bogofilter database to remove mis-classified mail.



Setting up the mda

The last task is to setup your mail delivery agent (mda) to filter out the mail bogofilter marked as spam. All mail filtered through bogofilter will have a X-header called "X-Bogosity" with a spam rating. Procmail is our mda of choice and you can find detailed examples of procmail (.procmailrc) on the main page /



Questions?

What is "Ham" and "Spam"?

Simply put, ham is mail that you conceder good mail and want to receive. Spam on the other hand is unsolicited commercial email or UCE you may not have asked to receive or wish not to receive in the future.

How do I tell how mail was classified?

Look at the headers of the email in question and look for the "X-Bogosity" header. Obviously, this header will only be in the headers after it has run through bogofilter. You should see something lie "X-Bogosity: Ham" for good mail and "X-Bogosity: Spam" for spam mail. Some old versions of bogofilter use Yes for Spam and No for Ham. You can also look in the logs of your mda or mail delivery agent like procmail. In the log you will see where the mail was saved to.

How much mail do I need to make bogofilter work properly?

It is suggested that you have hundreds to thousands OS pieces of mail to have bogofilter work well, but having less then a hundred will work in the beginning. As you get more spam and ham mail bogofilter will get more accurate.

How accurate is bogofilter?

The accuracy of bogofilter is directly proportional to the amount of spam and ham you have. In the beginning you may see an accuracy rate of 60%. once you have hundreds of emails in both spam and ham mailboxes do not be suppressed to see an accuracy in upwards of 90%.

All of my mail is being classified as spam!

This happens when you do not have any good mail (ham) and a mailbox full of spam. According to bogofilter the occurrence of ham words in the database is effectively zero and spam words are 100%. So all mail goes into the SPAM mail box. To fix this make sure you have good mail and spam mail for bogofilter to look at.

At what point does bogofilter concider mail spam?

When bogofilter is 90% sure that a piece of mail is spam. Look at the X-Bogosity header for the value for "spamicity". If you see spamicity=1.00 then bogofilter is 100% sure this spam and a value of spamicity=0.50 means bogofilter is only 50% sure. Anything less than spamicity=0.90 or 90% will not be considered spam.

Where can I find more detailed information about bayesian filtering?

Paul Graham gave a talk at the 2003 Spam Conference. The following link describes the work he has done to improve the performance of the algorithm described in "A Plan for Spam," and what he is planing to do in the future. White paper: "Better Bayesian Filtering"





Questions, comments, or suggestions? Contact Calomel.org