Legend:  new window    outside link    tools page  glossary link   

Filtering your spam

If you’re getting very much spam in your mailbox right now, then the tricks I describe elsewhere for avoiding spam won’t do you much good; your address is already in circulation, and you can’t put the milk back in the bottle.

What you’re going to want to do in this case is to find a way to isolate the spam from the mail you want, so that you can delete it or report it. In other words, you want a spam filter. There are several ways to filter spam, and at least a couple of places along the mail chain to do the filtering; we’ll look at these here. The good news is that using a good filtering setup (provided by your ISP, or assembed by you independently of your ISP) can reduce your spam deliveries to a very small number.

What does a spam filter do?

First, indulge me in an engineering metaphor and imagine a spam filter as a “black box” (or, for variety’s sake, a “red” one as shown in the picture below). We’re not yet certain exactly what is inside of it (because it is a “black” box), but we do know what goes into it and what then comes out of it. 

For the input, you shovel all your incoming e-mail in one end, and out of the other end pop two streams: your “real” mail, and the spam. Now, it’s a matter of figuring out where each stream will go: your wanted mail can of course go straight to your computer’s inbox list (as normal), while the spam could go into a “bit bucket” (a nice way of saying “sh*t can”).

An ideal filter, if you had one, would be able to distinguish accurately and consistently between spam and real mail, and would never make an error. You could then just let it trash all of the spam without worrying. Needless to say, however, there’s no such thing as an ideal filter. You’ll encounter two types of errors:

For most of us, false positives are perhaps more worrisome than false negatives, since you probably would rather receive an occasional “stealth” spam than to lose a potentially important “honest” message. In any case, this suggests that allowing a filter to discard your spam automatically without your supervision is probably asking for trouble. You’re going to have to check the trap periodically to make sure no innocent mail has been snared. It’s unfortunate that you have to do this checking, but that’s why everybody hates spam. Besides, an accurate filter will probably save you a great deal of time over the long haul, compared to checking and deleting spam by hand.

How does a filter work?

Now, to continue with our black-box metaphor, we will open up the black box and see what might be inside it.

There are basically two approaches to filtering spam:

Which is better? I think there is probably plenty of room for both, and comprehensive spam filtering systems usually take advantage of the strengths of both techniques.

Most of the reporting I’ve seen in the popular and trade presses, and most of the buzz from software developers and high-profile anti-spam consultants continues to focus exclusively on content-based filtering, perhaps because this seems like the most logical approach to those who aren’t fully briefed on the mechanics of e-mail transmission. In fact, however, the most effective single class of spam fiter operating these days relies not upon content, but upon routing information. Such systems use external block lists to determine whether the messages are sent by hosts known to have been responsible for spamming in the recent past. The best block lists are kept scrupulously up-to-date, adding threats as they appear, and deleting them when they no longer appear to be spewing spam. 

As we will see below, routing filters can be used to reject spam deliveries outright, so that they are never seen by users and do not have to be further dealt with by providers. Even if a given ISP does not wish to reject spam (and many do not), the route filter can provide an efficient means to tag incoming mail as spam, and segregate it from non-spam mail. Furthermore, in addition to stopping spam, route-based filters can also help you develop the information you will need in order to file reports of spam activity with the services whose resources were used to send or support it.

Content filtering uses complex semantic analyses that require continual update to keep them abreast of the latest spammer tricks, and they’re inherently prone to ambiguity (and therefore higher error rates). At best, content filtering can only tell you whether or not a message is spam (and that’s something a seasoned internet user can usually do just by glancing at a subject line). It gives you no tools to go after spammers or their (witting or unwitting) helpers. In short, content filtering is like putting buckets under an increasing profusion of roof leaks; route filtering, by contrast, gives you the opportunity to patch the roof to stop the leaks where they start.

Where can the filtering be done?

Generally, you can filter your mail in one or both of two places:

Filtering on your computer

Frankly, I’m not sold on the notion of using your mail program (or other software on your own computer) to filter spam.

Of late, many publishers of mail programs have worked hard to improve the spam-filtering capabilities of their products. Many of these (such as Thunderbird and Apple Mail) include built-in Bayesian filtering that can do a very accurate job of identifying the spam, and this is very good news. However, simply identifying the spam isn’t enough for many of us. Filtering your messages after you’ve had to download them all is like putting up a “No Trespassing” sign after the fence has already been cut; the spammer has already made you pay completely for the downloading, and also for the CPU cycles your computer must expend to run the mail filter. Plus, if he’s put in any little surprises like beacon URLs (“web bugs”), these will almost certainly be triggered at some point. Finally, even if your incoming spam has been tagged as such, you must still expend time and effort to deal with it, even if only to delete it.

In short, then, trying to use your mail program to sort your spam is a time-consuming and ultimately ineffective affair. At best, and with a lot of work on your part, you’ll stay only slightly behind the tide of spam.

So, I’d really recommend skipping ahead to read about on-network filtering. Nevertheless, if you are still interested, I’ll spend a few minutes describing some things you might (or might not) be able to get your mail program to do with your incoming spam.

Automatic segregation: If your mail program does a good job of identifying spam messages, you should be able to set it up to move these messages to a "junk" folder; you can periodically inspect this folder (to make sure that no honest messages have been caught), and then delete its contents. If you can, you should set your mail client not to provide "previews" of these messages when this folder is opened, so that tricks like web-bugs and scripts will be disarmed.

Filtering by whitelist: You could set up your mail program to consult a “whitelist” of trusted correspondents, and move any message whose from-address doesn’t match the whitelist into a junk folder. This, however, is not a project to be undertaken lightly. Even if you go to the the trouble to compile a comprehensive whitelist of addresses up front, you’ll probably miss one or two that will then be blocked. You’ll also be taking on the burden of maintaining and updating this list (e.g., if your Aunt Edna decides to leave AOL and go to Comcast). You’ll be spending so much time checking after your filter and modifying it that you might just as well have not used it at all.

Filtering by the to-address: Because some spammers do not bother to put your address in the to-address field (read here to find out why they are able to do this), you can actually create a fairly good filter (but not a comprehensive one) by checking the to-address against of list of your own e-mail addresses. That is, you can put into the junk folder any message that you receive that does not have any of your own addresses in the To: or Cc: fields. Unfortunately, however, at least one of the more popular mail programs (Microsoft Outlook running behind a Microsoft Exchange mail agent, as used in many corporate setups) won’t let you do this without writing a bunch of cryptic macro code. Also, fewer and fewer spammers are this lazy anymore.

Filtering on (simple) content: We know that spam mailings often feature particular words or phrases that set them apart from other kinds of e-mail. Early on, many basic spam filters consisted simply of lists of such “forbidden” terms; any message containing one or more of these terms would be treated as spam. This sort of thing, however, turned out to be even more problematic than filtering on addresses. For example, suppose you decide to trash-can any incoming porn spam by looking for the word “sex” in the subject line or body. If you’re not careful, you could end up deleting a message from your car-collecting Uncle Bert who wants to brag about his 1927 Essex, or you might miss a cute dirty joke from an old college friend, or a hot love letter from your current date. You could refine the filter, but this would take more and more work, and the filter would get bigger and more unwieldy as time went by.

Filtering on routing information: As we noted above, many of the most accurate and effective spam filters use a message’s routing information and header analysis to identify open relays, open proxies, and forged routing information; this sort of thing, unfortunately, is completely beyond the capability of any standard consumer-grade mail program of which I am aware. Although e-mail headers can be parsed mechanically, this is not a job for the novice programmer, and in any case probably can’t be done with the tools available to you in the typical mail client software.

Add-on anti-spam software: Some people offer third-party software that can work with your mail program to detect spam and shunt it out of your inbox. Most of this software is only available for Windows systems (I use Macs), so I don’t know very much about these products.

Using your ISP’s spam filtering

Rather than try to roll your own spam filter, you would be very well advised to use filtering provided by the experts — perhaps those at your own internet provider. Most ISPs these days provide in-house spam filtering services to customers who want it. Indeed, ISPs often tout their spam protection and mail security tools as big selling points for their services. More than likely, your own ISP may offer spam filtering that you can use for your own incoming mail. You may not actually be using these services right now; if not, you may enjoy an immediate and steep drop in spam deliveries if you turn them on.

Technicaly speaking, there are several recognized means for ISPs to run spam filters for their customers’ incoming mail. They can set their incoming mail hosts to consult various spam blocking lists (such as those offered by SpamCop or The Spamhaus Project); or, they can install their own MDA-based filters (like SpamAssassin) that can evaluate entire messages (including their contents). Specialized mail-security firms can offer such services on a “turn-key” basis to even the smallest ISPs for use by their subscribers. Providers can even install dedicated spam-filtering hardware in their “back office” server farms (in the form of specialized computers such as those offered by IronPort or Barracuda) to provide simple and effective filtering.

The advantages of letting your ISP filter your spam are compelling:

Using third-party filtering services

You say your ISP doesn’t offer spam filtering, or what they do offer is ineffective or hard to use? If so, you may be able to use a third-party service to filter your spam after it is received by your ISP, but before you pick it up with your computer.

Perhaps the best-known such service is SpamCop. By becoming a SpamCop member (for a small charge), you can have all of your incoming mail forwarded to your SpamCop e-mail address; SpamCop will then filter out the spam (and get rid of virus attachments) and re-forward the cleaned mail to you at another e-mail address you set up. You can then periodically log on to the SpamCop website to view and report the accumulated spam. (Note: You can use SpamCop for free if you just want to report your spam and don’t need the filtering services.)

I have used the SpamCop service for several years now (and over 90,000 spam messages by my estimate), and find it to be accurate and reliable. There may well be other services of equal merit that I simply don’t know about. In choosing such a service, here are some things you might look for:

Performance of spam filters

As we noted above, you cannot have an ideal or perfect spam filter; however, you can have a very, very good one if your provider uses proven filtering methods.

Filters that work well

Presently, the best practice for spam filtering among retail ISPs includes some form of reputable DNS block list checking (of the IP address of the hosts offering mail), plus possibly some supplemental content checking (via SpamAssassin or a Bayesian filter, typically), which picks up the few spams that somehow evade a DNSBL check. If your ISP uses such a setup, you can expect that the large majority of the spam you are sent will be kept out of your computer's inbox (i.e., the false negative rate will be low). Also, such setups are usually very good at recognizing non-spam mail and passing it through, so the corresponding false-positive rate will also be very, very low. The reason for this good performance is that the vast majority of hardcore spam tends to follow a consistent profile (i.e., direct-to-MX or zombie transmission, high Bayesian scores), one that differs markedly from honest (non-spam) mail.

There’s a reason why this standard approach has become a standard: it works. It can be very difficult to get exact numbers, but I estimate that my own ISP (a major U.S. retail internet provider) routinely turns in a false-negative rate well below 20%. That is, for every 100 spams received, fewer than 20 will be put into my normal mail spool. Using a dedicated spam-filtering service (SpamCop) as a second line of defense reduces this rate lower still, to something on the order of 1-2%. So, although I am routinely offered well over 100 spams per day, I generally see only one or two of these showing up in my inbox. There are many days on which I see no spam at all in my inbox.

Filters that don’t work so well

On the other hand, many small providers, tinkerers, and entrepreneurs sometimes concoct their own spam filtering approaches, whether to save money and time, or to implement some whim of the ISP admins or operators. These approaches are often not very effective (and, in fact, counterproductive due to high false-positive rates). For example, one can frequently encounter ISPs that still filter spam based mainly on the e-mail address in the From: field, even though these addresses are forged and are frequently changed. Still others may use proven software for filtering (such as SpamAssassin) but may not configure it properly, allowing too many spams through, or blocking too many honest messages. Such providers may use DNS blocking lists, but may choose lists that are not very accurate or driven by subjective or personal considerations rather than verifiable facts; these lists therefore may not be very accurate (and may tend to be too aggressive, resulting in high false-positive rates).

Challenge-response filters (see below) can pretty much be guaranteed to have a false-negative rate of nearly zero; that is, they will generally stop all spam they see. Unfortunately, this comes at the expense of a very high false-positive rate (since all untrusted incoming mail is assumed to be unwanted until a response to the challenge is received). Also, these filters lead to misdirected blowback mail to innocent parties, so they tend to cause more problems for other people than they may solve for their own users.

Evaluating spam filter performance

The two best measures of the effectiveness of a spam filter are, as we noted, the false-positive and false-negative rates. Unfortunately, we can't always see how many mailings the filter has blocked (and we usually don’t want to know about all this spam in the first place), so we can’t compute the false-negative rate. The only figure we see, then, is the number of spam messages that land in our inboxes. Because spam volumes vary widely over time, it can be very deceptive to rely upon this number as a measure of filter performance.

Many people are suprised to find that their incoming spam actually increases after they have started using a filter. Since these folks are already a bit newly-paranoid about spam (hence their interest in filtering), they may conclude that the filter itself has led to the increase, or even that the filter operators have betrayed or exposed their addresses to spammers. In many cases, however, this apparent increase in spam can be attributed to random variations in spam traffic. You will find a more detailed discussion of this question on the FAQ page of this site.

What about challenge-response filters?

There’s an altogether different type of spam filter in use by some (mercifully not many) folks; it is called the challenge-response filter. I understand how challenge-response (C/R) can be very effective for its users, but it is a nuisance to almost everyone else (except the spammers, who have little to fear from it other than the loss of a few deliveries). Frankly, I do not recommend the use of C/R filtering by normal e-mail users.

In operation, the C/R filter is pretty simple: it examines each incoming e-mail against a set of rules or perhaps a “whitelist” of trusted from-addresses. If the message fails the test, the filter will hold it from delivery to the recipient and will send a “challenge” message to the from-address. The challenge message requires the original sender to take some simple but positive action, such as clicking on a URL, in order for his message to be delivered. If the sender accepts the challenge and makes a response, his message (and any future messages from him) will be immediately released for delivery. In this way, legitimate correspondents can easily get their messages through, while spammers will not (since they won’t respond to the challenge). Sounds cool, right?

Well, no.

The problem with C/R filters in general is that they are based on a couple of faulty assumptions:

  1. The from-address in every e-mail will be genuine; it will also point back to the actual sender of the message, and not to an uninvolved third party.
  2. Any legitimate sender will be happy to answer the challenge in order to have his mail delivered.

As we well know, practically no high-volume spammers use valid e-mail addresses that belong to them. They will either make up a from-address, or worse, steal one from their mailing list, one that presumably works and points to an innocent party. Since the C/R filter uses the from-address to send its challenge, it will thus inevitably send out challenge messages to folks who never sent you any mail and have no earthly idea what’s going on. These poor folks may even misinterpret the challenge as some sort of spam or attack, and may (mis)direct complaints accordingly.

“What do I care,” you might ask, "I didn’t get the spam!" Well, it’s true, you did not get the spam, but you have in fact contributed to the more general problem of unwanted “blowback” e-mail. Suppose that an ISP has, say, 10,000 subscribers, and that each of them gets 100 spams per day on average. Now, suppose that the ISP institutes C/R filtering for all of these subscribers. This means that this ISP will be launching at least one million challenge messages per day, of which nearly all will go to innocent strangers. Not a big number compared to the wholesale spammers, but it still adds another brick to the hod.

Another problem with C/R filtering is that many people are rather suspicious of challenge messages, and don’t want to play the game. For example, I myself generally do not respond to challenge messages in any but the most exceptional circumstances. If you choose to use C/R filtering, then, you must live with the possibility that many of your correspondents will not want to accept the challenges and will therefore be unable to reach you via e-mail. At the very least, you should make sure that the addresses to which you yourself send mail are added to the C/R filter’s whitelist so that these folks won’t be challenged just for replying to you.

In addition to the above, C/R filters can also be a major souce of annoyance to folks with whom you are in indirect mail communications (e.g., on a mailing list). I myself belong to a list on which one former member used C/R filtering on the address he used for the list traffic. Occasionally, when I posted a message, his C/R filter somehow decided that it was spam and duly sent me a challenge message. I tried early on to ask this gent to tame his filters, but the messages never got through (of course not; I hadn’t taken the challenge and had been put in some kind of “penalty box”).

Joining a mailing list with an address protected by a C/R filter is, in effect, telling your list-mates that you do not want to hear from them unless they meet your terms. Not good etiquette, in my book. It may be possible to tune a C/R filter to recognize and pass list mail, but why bother? If you need the protection of the C/R filter for your personal mail, then simply open another e-mail account (e.g., on Hotmail or Yahoo) for your list mail, and do not use C/R with it.



 Legend:  new window    outside link    tools page  glossary link   

(c) 2003-2008, Richard C. Conner ( )

14297 hits since March 27 2009

Updated:Sun, 22 Jun 2008

Document made with KompoZer