Anti-spam Part 2, Bayesian Spam Filtering
Well, Andrew and I kinda stepped on each others toes last month, but I’ll go into a little more depth on some of the things he touched on. Last month I talked about the frontend of our anti-spam filtering via Greylisting.
At the opposite end of our anti-spam system is content filtering. We use a third party vendor for this, MailFoundry in the form of two appliances. An appliance is a machine that you plug in, and is suppose to work with minimal configuration.
Now the MailFoundry appliances are “black box” systems. We don’t know how they work exactly, but we’re pretty sure that one of the techniques they use is Bayesian spam filtering.
Bayesian spam filtering uses the concept of probability to evaluate each token in a message, assign a weight to each, give the overall message a rating based on this weight, and evaluate the message based on a preset threshold.
Ok, unless you’re up on your statistics or logic based calculus, or a computer nerd with Wikipedia handy, I know your eyes just glazed over. Rest assured, you are not alone.
Basically, what it boils down to is that every “token” is a series of characters separated by whitespace. During this discussion, most “tokens” are words. Certain tokens are negative, they tend to appear in spam messages. Others are positive, they tend to appear in good (or ham) messages. Each token has a value (or a “weight”). A Bayesian spam filtering system reads the message, adds up all of the negative and positive weights of the tokens which produces an overall probability rating. If the rating is too negative, it considers the message spam. If it’s positive, it doesn’t.
Now, how does the Bayesian filter know which tokens are bad or good? Well, you have to give it examples of each. If a message is spam, and it gets through the filter, you have to tell the filter that it’s spam. When a message is marked as spam, all of the tokens in the message have their ratings lowered in the filter’s database. Ideally, you’d also mark good messages as good, but most people don’t. Most Bayesian filtering schemes are configured to mark all delivered messages as good unless they are marked as bad. Over time, the good tokens get “gooder” and the bad tokens get “badder” and the system can determine what is spam and what is not.
Bayesian spam filtering works amazingly well on individual email accounts, as it will be able to determine an individual’s taste in what is spam and what is not. Unfortunately, it’s not as effective across hundreds or thousands of users, but it still helps. Your personal spam filter will usually outperform anything on our end, because you may have a different definition of what is spam than other users out there. On a large system like ours, tokens that would be marked as negative for you, are nullified by others marking them as positive. So, say mail sent from a list that you signed up for, but no longer want may be spam to you, but may not be to other people out there. That’s why it’s best to unsubscribe from those lists rather than try to get our system to recognize it as spam.
If you want to help feed our filters, feel free to send examples of any spam you receive via our system, as attachments, to spam at iphouse.com.
I hope that helps!