Anatomy of an e-mail message

The message | The header | What's a forged header? | The body | Summary

In order to understand how spam works and what can be done to prevent it or report it, you need to know something about how e-mail works. Here, we'll look under the hood of a typical "normal" e-mail message to find out what information it contains and how it can be traced. Once we understand how normal e-mail works, we are in a better position to learn how it can be perverted into spam.

How an e-mail message is sent

You don't normally think about it (unless you happen to work for the Post Office perhaps), but sending a simple postal letter (I won't say "snail mail") can be a fairly involved process. Once you stuff your message in an envelope, stamp it, and put it in your box, the postman will pick it up and carry it to the local post office, whence it goes to a distribution center, then perhaps to other distribution centers, over road or by air, ever closer to its destination. Finally, it reaches another local post office, whereupon a postman at the other end carries it by hand to the recpient and drops it in his box. The e-mail system we currently use works in a perhaps surprisingly similar way.

When you hit the "send" button for an outgoing e-mail, your computer contacts a machine called a "mail host," usually one belonging to your ISP (or your employer, your school, etc.) and uses a protocol known as SMTP (simple mail transfer protocol) to transfer the message. Then, this mail host will contact a mail host in the recipient's domain, again using SMTP to transfer the message. These two mail hosts correspond to the local post offices that we use for postal mail.

Larger ISPs may be set up to shuffle mail around internally to other mail hosts (like the postal mail "distribution centers"), such that your mail may actually get "relayed" within your own ISP's service until it is transferred to a mail host at the recipient's domain. Similar relaying may take place within the recipient's network. Generally, the transfers over the network (and usually within the domains) use SMTP.

Anyway, your message eventually reaches a special mail host (called a mail delivery agent or MDA) at your recipient's domain; there the message will sit until the recipient logs on and picks it up using yet another protocol, either post office protocol (POP3) or internet message access protocol (IMAP). We're not so interested in the picking-up part of the process, since there's little potential for spam to be injected there. Therefore, we're mainly interested in the SMTP "handoffs" from one machine to another as the message makes its way from the sender to the recipient.

The message

I sent the following message from my work account to myself at home. It represents a typical simple text e-mail from one person to another. I have removed certain routing information and e-mail addresses to protect myself from spambot harvesting, but the message is otherwise complete, and would be deliverable. Note this is the full e-mail packet, not just the parts you would normally see in your mail program (you might as well get used to this cryptic stuff, there will be plenty of it on this site).

From <<my-work-address>> Sat Aug 17 16:00:24 2002
Return-Path:
<<my-work-address>>
Received: from exanpcn4.arinc.com ([144.243.4.70]) by mta009.verizon.net
     (InterMail vM.5.01.05.09 201-253-122-126-109-20020611) with ESMTP
     id <20020817200009.CWZT20372.mta009.verizon.net@exanpcn4.arinc.com>
     for
<<my-home-address>>; Sat, 17 Aug 2002 15:00:09 -0500
Received: from exanpcn2.arinc.com (unverified) by exanpcn4.arinc.com
     (Content Technologies SMTPRS 4.1.5) with ESMTP id      <T90f3203cca5cc55c0da9@exanpcn4.arinc.com> for
<<my-home-address>>;
    Sat, 17 Aug 2002 16:02:15 -0400
Received: by exanpcn2.arinc.com with Internet Mail Service (5.5.2653.19)
     \tid <QRZ549XW>; Sat, 17 Aug 2002 16:00:27 -0400
Message-ID: <09328AED5429D311A3000008C7911B100778B52C@exanpmb1.arinc.com>
From: "Conner, Richard C. \\(RCONNER\\)"
<<my-work-address>>
To:
"my-home-address" <<my-home-address>>
Subject: Hello
Date: Sat, 17 Aug 2002 16:00:26 -0400
MIME-Version: 1.0
X-Mailer: Internet Mail Service (5.5.2653.19)
Content-Type: text/plain

How are you? I am fine. Now, get back to work.

-- Rick

This is a standard internet e-mail message following the format of RFC-2822 (the internet standard for e-mail message composition). It consists of two parts: a header (in teal), and the body (in red). The header is separated from the body by one blank line, per RFC-2822.

Let's look at each of these two parts.

The header

Every e-mail must have at least a rudimentary header. The header contanis a variety of information about the message that is typically of no interest to the recipient (at least not for a routine non-spam message); this information is generally written and read by mail-transfer software during the mail transfer process, and therefore provides a sort of audit trail for the message. Most user mail programs hide the header by default (except for the familiar "From," "To," "Date," and "Subject" fields, and "Cc" ("carbon copy") field if any).

In order to analyze and report a spam e-mail, you will have to figure out how to display all headers of the message (including, particularly, the routing lines that begin with "Received") so that you can include them in your report. No one can act on your report without this information. Unfortunately, the instructions for revealing the header vary from one program to the next, and some Microsoft mail applications (the MS Exchange-based business version of Outlook, in particular) make it almost criminally difficult to reveal headers. If you need instructions for how to show headers, try this page from the SpamCop FAQ.

Now, let's go line-by-line through the header of the message above and see what's going on:

Preliminaries

From <<my-work-address>> Sat Aug 17 16:00:24 2002
Return-Path:
<<my-work-address>>

The first line is missing a colon, so it probably won't parse as an actual header line, and may not be trustworthy. The second line is the "Return-Path:" field; this optional field, if it is present in an unforged header, is a fairly trustworthy identification of the address of the account from which the message was originated (I hid the address, so that web-crawling "spambots" cannot see it if they check this page).

I say that the Return-Path is "trustworthy" because it is explicitly collected by the mail agent that first picked up the mail for sending, and represents the address given to the outgoing mail host during authentication. It isn't completely trustworthy, since it is technically possible to "game" it (even in legitimate, non-spam mail). The "From:" line (with colon) that we'll see a few lines down is actually not a trustworthy source for the original address. Note that if you find the header to be forged in any way, you should not trust even the Return-Path address, since it may be forged as well. Now, on to the next three lines.

Routing info

Received: from exanpcn4.arinc.com ([144.243.4.70]) by mta009.verizon.net
     (InterMail vM.5.01.05.09 201-253-122-126-109-20020611) with ESMTP
     id <20020817200009.CWZT20372.mta009.verizon.net@exanpcn4.arinc.com>
     for <<my-home-address>>; Sat, 17 Aug 2002 15:00:09 -0500

Received: from exanpcn2.arinc.com (unverified) by exanpcn4.arinc.com
     (Content Technologies SMTPRS 4.1.5) with ESMTP id      <T90f3203cca5cc55c0da9@exanpcn4.arinc.com> for <<my-home-address>>;
     Sat, 17 Aug 2002 16:02:15 -0400

Received: by exanpcn2.arinc.com with Internet Mail Service (5.5.2653.19)
     id <QRZ549XW>; Sat, 17 Aug 2002 16:00:27 -0400

These three lines describe the routing of the message; that is, the path that the message took from one mail host to another on its way from my office to my home account. Each "Received" line represents one handoff (using SMTP) between machnes, and the closer to the top of the message a "Received" line is, the later in the sequence it falls. As each new host receives the message during its travels, the host will add its own routing information to the top of this stack. Here's how we can read this information:

Together, then, the received lines should form an unbroken chain. To make this more clear, we can simplify these lines as follows:

You can see, then, that by following the lines from bottom to top, you can trace the progress of the message all the way from host-B to host-D (there is a "host-A" implied, namely the computer on my desk at work, but this machine is not reported in the header—probably for security reasons).

Now, let's take a closer look at one of these lines (the second) to decode some of the excess gobbledegook:

Received: from exanpcn2.arinc.com (unverified)
     by
exanpcn4.arinc.com
     
(Content Technologies SMTPRS 4.1.5)
     with
ESMTP id <T90f3203cca5cc55c0da9@exanpcn4.arinc.com>
     for
<<my-home-address>>; Sat, 17 Aug 2002 15:00:09 -0500

To review, this particular record represents an internal relay handoff within my company's mail system. Here's how to translate this line into something approximating English:

Now, let's break these entries down one at a time:

from [sending-host's-name] [sending-host's-address]

This field identifies the host (computer) that provided (sent) the message. Normally, at least the name should be given. This is the name that the sending host reported when it signed on; this name is also known as the "HELO," (short for "hello") from the SMTP statement in which it is sent. The HELO by itself is not at all trustworthy:

It's good practice (not always followed) for the receiving host to record the IP address of this sending host, along with the HELO. In the other Received lines, you can see where this has been done.

In the line we're looking at, host exanpcn4.arinc.com did not bother to record the IP address. This isn't a big deal here, since this is an "internal" relay within the arinc.com domain, and exanpcn4 can therefore presumably trust exanpcn2. It's arguably a waste of bandwidth and a mild security risk to reveal this host's IP address in this case.

As an experiment, you might like to try an nslookup on the host names and IP addresses that appear in the headers of your own e-mail messages. What you should find is that the host name resolves to the given IP address, i.e. "host-name = host-address." If this doesn't turn out to be the case, you may be looking at a forged record.

by [receiving-host's-name]

This field identifies the host to which the message is being handed off. There's no need to supply the address here (since this host should presumably trust itself), but this host MUST use this same name when it transfers the message to the next host (in the next line up) so that the "chain" is preserved. Not to do so is clear evidence of header tampering.

[software-used]

This field identifies the mail-host software used to transfer the message. Most of these are programs you will never have heard of, and there isn't much info here for the spam-hunter to work with, although if you are an expert you can occasionally spot buggy or misconfigured versions of particular software packages being used as relays.

with [message-ID]

This field provides the message ID that the receiving host has assigned to this transaction. This ID is used mainly to tell possibly-identcal messages apart. There may be two message IDs in the header, by the way: one assigned at the sending end (by the outgoing mail host), and one at the receiving end (by the incoming mail exchanger host or MX).

[protocol]

This line gives the protocol used to transfer the mail. For nearly all transfers across the internet, this protocol will be either SMTP or ESMTP ("extended SMTP"); you may see different protocols used in internal relays.

for [recipient's-address];

This field seems to be optional, and usually appears in the earliest Received line (i.e., the lowest on the list). It gives the e-mail address that the sender intends the message to be delivered to. Do not confuse the for-address with the To: field, which we will get to shortly. If this information appears, it is usually trustworthy, but many mail hosts don't bother to include it. Again, if the header is found to be forged, you should not trust this field. This field, by the way, has a semicolon at the end, which marks the end of the "Received" clause (but not the end of the header line)

[date] [time] [time-zone-offset]

This field simply gives the time and date (plus the time zone difference from GMT or UTC) at which the by-host received the message. Note that this is taken from the by-host's own onboard clock, which may or may not be properly set (it's fairly common to receive grossly misdated messages from computers with bad time). Sometimes, when forging a header, spammers may get some portions of this wrong (such as by giving a time-zone offset that places them in the middle of some ocean, or by just generally mangling this data due to poor understanding of time/date computations).

Message ID

Now, we move out of the routing section and on to the next header line:

Message-ID: <09328AED5429D311A3000008C7911B100778B52C@exanpmb1.arinc.com>

This line gives the unique message ID that the by-host (here, the mail transfer agent for my work address) has assigned to the outgoing message. This ID has no particular meaning; it is simply provided as an easy means to tell possibly-identical messages apart. Note that this particular message actually has three different message-IDs; one corresponding to each by-host that received the mail along its path from work to home.

Visible header info

The next four header lines are the familiar ones you see in your mail reading program at the top of each message.

From: "Conner, Richard C. \\(RCONNER\\)" <<my-work-address>>
To:
"my-home-address" <<my-home-address>>
Subject: Hello
Date: Sat, 17 Aug 2002 16:00:26 -0400

Surprisingly, this is probably the least trustworthy portion of the header, because neither the "To:" address nor the "From:" address are actually used in the mail transfer process. In fact, none of these fields are required by RFC-821 (SMTP) for the transfer of mail, although they are required in RFC-2822 (which tells us only what a mail message should look like). I frequently get vestigal mail messages that have none of these fields at all (I suspect they are list-laundering probes or test runs for spam).

Until fairly recently, most consumer-grade mail programs relied upon checking of the "From:" address as their principal (or only) means of spam filtering. Spammers, however, can insert any address they want into the "From:" field, including addresses they've made up or (worse) the addresses of innocent third parties. Therefore, doing anything at all with the from-address of a suspect message is pretty much a waste of time. Because these addresses are bogus, you should avoid replying to spam, at least by hitting the "reply" button of your mail program. See my page on mail filtering for more information on using your e-mail program to detect spam.

MIME labels and X-records

Finally, the last lines of the header:

MIME-Version: 1.0
X-Mailer: Internet Mail Service (5.5.2653.19)
Content-Type: text/plain

The final lines in a mail header may offer useful data in the spam hunt. In this case, these records were placed by the original mail host (the one that received the message from my work computer), and they identify the MIME-type of the data in the body, and the software used to transfer the message. Note that many mail agents and mail handling services will add "experimental" records beginning with "X-" (e.g., the X-Mailer: line here); these may contain information of value (for example, the SpamAssassin filter will add them to report on its analysis of the message), but they are not required portions of the header.

What's a "forged header?"

I've referred several times to the notion of a forged header. What does this mean? Simply put, it means that the spammer has tampered with the header information so as to throw you off his track. Forging usually takes either of two forms:

  1. Giving a false host name (or "HELO") when the message is transferred, and
  2. Adding bogus routing lines to give a false history to the message.

#1 is easy to do with the right software; mail hosts generally don't care about the HELO names they're given by hosts who want to leave mail for them, although they may also (and should) capture the IP address of the machine from the underlying socket connection.

#2 is a bit harder, and requires a cooperative computer to send the mail—either an "open relay" or an "open proxy."

An open relay mail host is one that will accept mail from outsiders for delivery to outsiders. Such a host doesn't care who you are when you send mail, and likewise doesn't care where you send the mail. This is kinda like the old spy-movie routine where a smuggler plants items on an innocent traveler for delivery to a third party somewhere else in the world.

Indiscriminate relaying was quite widespread in the earlier days of e-mail, before spam developed into the threat that it poses today. More recently, however, system admins have wised up and realized the dangers of accepting relay traffic; they've configured their mail hosts NOT to relay mail—you must send your mail from a machine within the same domain as the mail host. This is why, for example, you usually cannot use your LAN-connected office computer to send mail from your modem-connected AOL account (unless you use telnet, web mail, a virtual private network, or some other ruse). See my page on spammer tricks for more info about open relays.

As the supply of open relays has dried up in recent years, spammers have found a more effective alternative: the open proxy or "zombie." Zombie computers are usually Windows machines belonging to innocent and unwitting home users who have allowed special software to be implanted on them (through viruses, trojans, etc.). These machines, when connected via broadband DSL or cable modem, can send spam all day long, most likely without the knowledge of the owner. Such messages are pretty much impossible for the recipient to trace any further back than the zombie machine itself. Since the Zombie can be programmed to send its spam using "direct-to-MX," as the term goes, there will be no traces of the outgoing mail in the mail logs of the Zombie's ISP. See my page on spammer tricks for more info about open proxies.

Once a spammer has found a cooperative open relay or open proxy, he can use the right software to fabricate a bogus routing history for his messages. Naive spam-hunters may simply go all the way back to the earliest of these bogus records and file their complaints with innocent (or possibly non-existent) ISPs.

The key to identifying forged headers, then, is to verify the chain of mail transfer from host to host by using the host-name and address information in the Received lines. At each point, you should be able to verify that the given host name resolves to the given IP address. If it does not, you almost certainly have a forgery. If the IP address of any host isn't provided, you may have a forgery, but you can't be sure since you lack the necessary information.

As we saw above, it is also possible to "forge" a header by supplying a bogus from-address or to-address, but these are considered less serious infractions than tampering with the routing chain. Indeed, many mail applications do fiddle with these addresses for benign reasons.

The body

After wading all the way through the header of this message, we find the body to be rather anticlimactic:

How are you? I am fine. Now, get back to work.

-- Rick

This is, of course, a plain text message (MIME-type "text/plain" as shown in the raw message at the top of this page). Had I wanted, I could have used Microsoft Outlook (my company's standard mail program) to format this as an HTML message, which would have allowed me to use various fonts, colors, icons, images, backgrounds, sounds, video, and other niceties to decorate the message (as well as to swell it up to megabyte size). Unfortunately, HTML e-mails can also contain jolly tidbits like the following (see my page on spammer tricks for more info):

For these and other reasons, I am very skeptical about the use of HTML in e-mails, and I avoid it in my own outgoing mail wherever possible. This, however, is an uphill battle and I won't belabor it here.

In order to analyze an HTML message body properly, you'll have to view its "markup" or source code. If the message has already been downloaded to your computer, you can do this by bringing the message up for view and then right-clicking on it (or, if you have a Mac, control-clicking) and selecting a "view source" or similar command from the popup menu. If this doesn't seem to work, search your program's menus for a view-source command, or consult your program's online help or documentation. Once you have this source code in front of you, you can save it to a disk file for later use if you need to.

Occasionally, a spammer will try to prevent you from viewing the source by using an annoying little JavaScript trick to disable the right-click function. The view-source command elsewhere in your program should still work; or, if it does not, try saving the message in "raw" or HTML form (NOT "text") to a text file, then opening it with a text editor.

If you use SpamCop or a similar service that quarantines your spam (i.e., does not allow it to be downloaded to your computer), you will often see message bodies encoded using the base64 algorithm (look at sample spam #1 for an example of what this looks like). If you need to decode such data, find an appropriate shareware or freeware utility (or try the "mimencode –d" command if you have a Unix-like system).

Craftier spammers will use obscure JavaScript code to mangle the message body, making it unreadable to you. Unless you are willing to load and view the message, or else have the secret decoder ring, you should treat these bodies very gingerly and carefully scrutinize any information you get from them.

In addition to the message text itself, a message body may include one or more attachments. These attachments may contain more text or HTML, or may be composed of binary data (like a computer program 'executable' or an image file). PC viruses are commonly transmitted in this way. Refer to the MIME RFCs (beginning with RFC-2045) for more information on how attachments work.

Summary

Here's what I've tried to put across on this page:

Now that you've had the cheap tour of header analysis, you can see some real-world examples elsewhere on the site.


(c) 2003-2006, Richard C. Conner ( )

14249 hits since March 28 2009

Updated: Fri, 02 Feb 2007