home | legal stuff | glossary | blog | search

 Legend:  new window    outside link    tools page  glossary link   

Exposing the bodies of spam e-mails

What we’re doing on this page: We are exposing the contents of the body of a spam message in a form that we can read and analyze. This may involve the extraction and decoding of MIME parts within the message.

Earlier, we saw how to expose and analyze the header of a spam message, and to find the parties responsible for the mail hosts that sent us the spam. In many cases, you will be finished at this point (after you file a report, that is).

If, however, you also want to report the websites (and other resources) that are linked in spam messages, or if you just want to see what kinds of tricks the spammer is up to, you’ll have to fully expose the body of the message, including any plain-text or HTML portion that carries the spam message, and any images (or other binary files) that support it.

On this page, we’ll take a fairly comprehensive look at deconstructing the raw message body into the important parts. Most of this involves standard e-mail formatting and encoding procedures that are used every day in non-spam mail, although spammers have some scope to abuse these techniques for purposes of evasion and deception.

Viewing the body

Obviously, if you bring a spam message into view with your mail program, you are “viewing the body” of the message. It may not surpise you to learn, however, that when you deal with spam, what you see ain’t always exactly what you got. You can never really be sure what’s going on until you expose the raw source of the message body.

The steps you take for this depend upon where the spam currently resides: either on a mail server somewhere (i.e., you haven’t downloaded it yet), or else on your computer.

If the mail is still in “cyberspace”

If your spam hasn’t yet been downloaded to your computer, and you’re viewing it from a spam filter service (like SpamCop) or a webmail service (like Mail2Web (http://mail2web.com/)) you may be able to view the raw source using a feature of this service (like the “Preview” link associated with each SpamCop held mail). You can usually save this raw source by saving the web page where it appears, or (simpler) copying-and-pasting the data into a text editor program.

If the mail is already on your computer

If, on the other hand, you’ve already downloaded the spam into your mail program’s inbox, you can see the raw body by bringing the message into view and then finding your mail program’s command for “view source,” “view raw text,” etc. In many programs, you can simply right-click the mouse somewhere over the message in order to get a pop-up menu that has a “view source” command on it.

Often, spammers will try to stop you from viewing the source of their messages using a JavaScript that “intercepts” your right-click and stops the pop-up menu with its view-source command. This trick only works on certain mail programs; if it works on yours, you’ll have to find an alternative means of exposing the body:

Body types

Once you’ve exposed the raw body and saved it to a file, the first thing to look for is its type. By “type,” I mean the structure and format of the message body. Some spammers still use simple plain-text messages, but most have moved onto more advanced (i.e., MIME) body types.

Plain text (non-MIME) message

According to RFC2822, the standard that dictates the format of e-mail message packets on the internet, an e-mail message is required to contain only plain text in the “plain old” ASCII character set.

If a message contains only data of this type, then it can be sent without MIME or encoding information; this results in messages that have the smallest total size (often less than 1 kByte per message, including header). Here’s a sample plain-text chickenboner spam:

From address hidden Fri Nov 4 18:18:20 2005
Received: from DM (70.52.108.55)
   by sv12pub.verizon.net (MailPass SMTP server v1.2.0 - 080905135255JY+PrW)
   with SMTP id <1-8069-117-8069-20041-3-1131122073> for address hidden;
   Fri, 04 Nov 2005 10:34:35 -0600
Received: from doom.com.ar (HELO doom.com.ar)
   by mta115.248.115.156.mail.scd.doom.com.ar with SMTP; Fri,
   04 Nov 2005 17:25:41 +0100
Date: Fri, 04 Nov 2005 14:27:41 -0200
From: "Rolando McCarthy" address hidden
Subject: Dirt on your b-oss
Sender: address hidden
X-Sender: address hidden
To: address hidden
Message-id: <51.27029@mail.doom.com.ar>
        << one blank line after end of headers  



Good day sir,

Wou.ld you like 15OO to 35OO per day just for ret-urnin-g phon.e call.s?

Give Us A Ca.l-l - 1-800-839.9032

If you have a te.leph.one and can re.turn calls you are fully quali-fie-d.

Bye,
Rolando McCarthy
address hidden

Note that this is the full mail packet, including the headers. As required by RFC2822, a single blank line is added after the end of the header to separate it from the body (in this case, there are three more blank lines immediately after the first one, but these become a literal part of the message).

Originally, most spam was of this type (as was most e-mail in general). It has the advantage of being very compact in size (so the spammers can make the most efficient use of their stolen bandwidth), but it has the disadvantage (for spammers, at any rate) that you can’t really hide anything in, or do many tricks with, a plain-text message. Any website URLs and e-mail addresses in the body are thus easily spotted and dealt with. More advanced ruses, like markup encryption and unnecessary encoding, are simply not possible. Most spammers, like most other e-mailers, have now moved on to MIME-typed messages.

MIME-typed messages

In the early days of internet e-mail, people were so deliriously happy just to be able to exchange plain text messages (which amount to little more than telegrams) that these simple messages sufficed for a time. However, people soon found the need to be able to do things like:

To meet these needs, a standard for multipurpose internet mail extensions (MIME) was developed and published in RFC2045 and related documents. Since then, MIME has been nearly universally adopted by most mail programs and mail systems; your own mail program probably composes your mail in MIME by default (even for “plain text” messages).

Simply put, MIME defines what a raw message should look like if it contains anything other than ASCII text. MIME requires that the originating machine (1) put certain information in the mail packet (both in the header and in the body) and (2) follow certain procedures in inserting the content into the body (for “encoding” the non-ASCII data, and marking the boundaries between parts of the body).

A MIME message can be spotted by the declaration (“MIME-version: 1.0”) in the header. Further MIME instructions will be found both in the header and in the body, as we’ll see below.

The first thing to look for in a MIME message is its overall MIME content type; this will be found in the message header (i.e., not in the body) in the Content-Type: line. For most spam, this will be one of the following:

The text/plain and text/html types will usually specify a character set (in the charset= clause). By default, us-ascii is assumed, but other character sets may be specified instead, and by looking at these we can sometimes pick up some evidence about the origins of a spam message. Here are some of the more frequently-seen character encodings:

(here’s a Wikipedia article that gives a good overview of these and other common character sets)

Moving on from character sets, we now look at the encoding of non-ASCII data. If the MIME part contains something other than ASCII text, the part may have to be encoded (as I’ll describe shortly); if so, then MIME calls for a Content-Transfer-Encoding: line to be included in order to specify what type of encoding has been used (so that the recipient’s mail program can know how to undo the encoding).

Here’s an excerpt from a lightly-MIME’d “phishing” spam with the relevant MIME stuff highlighted:

Date: Thu, 27 Oct 2005 17:57:10 -0700
From: "PayPal" address hidden
Subject: Important Notification
X-Originating-IP: [216.154.195.49]
X-Originating-IP: [65.118.246.25]
Cc: address hidden
Message-id: <200510271957177.SM03704@User>
MIME-version: 1.0
X-MIMEOLE: Produced By Microsoft MimeOLE V6.00.2600.0000
X-Mailer: Microsoft Outlook Express 6.00.2600.0000
Content-type: text/plain; charset=Windows-1251
Content-transfer-encoding: 7bit
X-Priority: 3
X-MSMail-priority: Normal

Dear PayPal Customer,

You have received this email because we have strong reason to believe that your PayPal account had been recently compromised. [...remainder snipped...]

Here, we see that the sender has composed the message in Windows-1251, which gives pretty strong evidence that the author is Russian (or from some other Cyrillic-writing nation), even though the message text contains only Latin characters from this set. The body has been “encoded” as 7bit (which isn’t really encoding at all; this just tells the recipient’s mail program that only 7-bit characters — that is, those found in ASCII — are used in the message).

Multpart MIME messages

MIME allows you to include more than one “chunk” of data (or “object”) in a single e-mail message. You use this feature each time you attach a picture (or a spreadsheet, mp3 file, etc.) to your outgoing messages (i.e., each picture or other attachment will live in its own MIME part, and will be given suitable encoding). The “multipart” mechanism is also used to create alternate versions of a message (e.g., one version as text/plain and another, with HTML decoration, as text/html). Many of the more elaborate spams exploit multipart MIME to deliver images with their messages, or to camouflage the messages using neutral text that won’t be seen by the reader.

Multi-part MIME messages are identified by the “Content-type: multipart/...;” header line, and by the specification of “boundary” strings (gibberishy-looking strings that are used as markers to divide the parts of the message). Multipart blocks can be nested; that is, you can create a multipart block that contains one or more other multipart blocks.

Let’s look at an an actual sample of a common type of multipart-MIME spam, one that contains (1) a plain-text message part, (2) an HTML alternative message part, and (3) an image file. The MIME-related “administrative” stuff (i.e., specs and part boundaries) is highlighted in pink. Even though this body may appear at first to be rather complicated, it is actually pretty easy to understand, and it would be useful to understand it because this pattern accounts for a large proportion of spam sent these days.

Date: Sat, 29 Oct 2005 23:51:55 -0800
From: "Lula Sinclair" address hidden
Subject: On bring of textile
To: address hidden
Message-id: <VWGRCUX-7474641321186@rcsjsh.cheneybrothers.com>
MIME-version: 1.0
Content-type: multipart/related; boundary=--MvlPLbFwdlLBSS01TddA


This is a multi-part message in MIME format.

----MvlPLbFwdlLBSS01TddA
Content-Type: multipart/alternative;
boundary="--y5eehnvlLeUNdfolD"

----y5eehnvlLeUNdfolD
Content-Type: text/plain;
charset=iso-8859-1
Content-Transfer-Encoding: 7bit


hart Danl, my good man, said she, you must eat and drink, and keep

----y5eehnvlLeUNdfolD
Content-Type: text/html;
charset=iso-8859-1
Content-Transfer-Encoding: 7Bit

<HTML><HEAD>
<META http-equiv=Content-Type content="text/html; charset=iso-8859-1">
</HEAD>
<BODY>

[...snipped...]

</BODY>
</HTML>


----y5eehnvlLeUNdfolD--

----MvlPLbFwdlLBSS01TddA
Content-Type: image/gif;
name="Jtiay.GIF"
Content-Transfer-Encoding: base64
Content-ID: <0a4e01c5252c$43bb7fd0$21eba6aa@cheneybrothers.com>

R0lGODlhSAEuAfcAAAAAAIAAAACAAICAAAAAgIAAgACAgMDAwMDcwKbK8AAAAP
LGCQVTDSEYZDREIGFPHBXTGXUXVQJOSOEDTJGCWERPZGSRCQQBEURYOWHBYCCF
UTUIPKNRSUPOZRYXOOWPLDPSHOPCROZWCZMVRVHSNEAELOAFWMRPPBLFHVHJUQ
DYUYYCCLATZSXPHFXONJDNPCPHNATEZYHYYCWDREZGOBUJJVILWRACRGRWIIKA
CHFVQBISIFHSJEJECAFUGABFTOJIGQJCZVHZVVLXVODELZCVVSCMQGKUCQPPDZ
[...snipped...]

----MvlPLbFwdlLBSS01TddA--

It may not appear so, but this message has a definite structure indicated by the boundary strings. The picture below will (I hope) make this structure more clear:

As noted above, this is a pretty common structure for spam messages these days. It has a couple of nice features for the spammer:

In this case, then, we are interested in looking for any websites linked from the HTML part. We might also want to take a look at the image part in case it contains more info about the spam sources and websites. The plain-text part may contain info of use, but more than likely not.

I’ve actually told you a great deal more about multipart MIME than you really need to know; it is mainly important to be able to identify the message part containing the spam pitch (and any URLs to which it refers), plus possibly any image parts (since spammers frequently hide website addresses or other info in these images in order to evade spam filters).

MIME encoding of message parts

In the message immediately above, we saw that the image Jtisy.GIF was encoded using something called “base64.” What does this mean?

Let’s back up for a moment and recall that an e-mail body, by the “law” of RFC2822, is supposed to contain only seven-bit ASCII characters. MIME tells you how you can divide this body up into multiple parts of various kinds, but these parts all still have to appear as safe 7-bit text (otherwise, mail transfer programs might get horribly confused when trying to handle the message). As it turns out, MIME also gives you the encoding techniques you need in order to ensure that any kind of data can be attached to a message and sent safely.

Two such techniques are commonly used in MIME message bodies: base64 encoding, and quoted-printable encoding. A third type, encoded-word encoding, is used in header lines (typically for subject lines or address “nicknames”). We’ll take a look at these below.

MIME base64 encoding

The base64 technique is used for binary data (like image files), or for text in character sets that are very different from ASCII or Latin; it uses a arithmetical transformation (called, not surprisingly, base64) to turn the block of possibly-binary data into a larger block of 7-bit text (actually, base64 uses only the letters, numerals, and a handful of symbol characters, and not the entire ASCII set). This transformation is easily reversed, so that the recipient’s mail program can render the part or attachment correctly once it has been received.

Although base64 is usually used to encode binary data, MIME also allows it to be applied to pure 7-bit text. There’s no technical reason to do so, but spammers do it anyway since it provides an easy and relatively innocuous way to disguise the text of their messages. Spam filters that don’t know how to decode base64 won’t be able to peek at these messages.

Raw base64 data can be recognized as long, equal-length strings of letters (uppercase or lowercase) and digits with no spaces and very little punctuation. Here’s a sample of what you might see, quoted from the multipart-MIME example above:

R0lGODlhSAEuAfcAAAAAAIAAAACAAICAAAAAgIAAgACAgMDAwMDcwKbK8AAAAP
LGCQVTDSEYZDREIGFPHBXTGXUXVQJOSOEDTJGCWERPZGSRCQQBEURYOWHBYCCF
UTUIPKNRSUPOZRYXOOWPLDPSHOPCROZWCZMVRVHSNEAELOAFWMRPPBLFHVHJUQ
DYUYYCCLATZSXPHFXONJDNPCPHNATEZYHYYCWDREZGOBUJJVILWRACRGRWIIKA
CHFVQBISIFHSJEJECAFUGABFTOJIGQJCZVHZVVLXVODELZCVVSCMQGKUCQPPDZ
[...continues for many more lines...]

MIME quoted-printable encoding

The quoted-printable (QP) technique is also used on the entire contents of a message part, but its alterations to the data are far less drastic. You can usually read most or all of a minimally-QP-encoded message part without having to decode it, something you can’t do with base64. Spammers can jigger QP encoding to hide portions of a message from snoops and filters, even though these portions don’t really require encoding.

Normally, MIME-QP replaces or “escapes” the occasional troublesome (non-ASCII) character with an alternate representation that consists of the equals sign followed by the hex value of the character code. For example, the printable character “” (“Ccedil,” character code 199 in ISO-8859-1) is “troublesome” to e-mail because its code is greater than ASCII’s 7-bit limit of 127 (i.e., it is an “eight-bit” character); in QP, this character would be replaced with the string “=c7” where “c7” is the number “199” in hex. The string “=c7” contains only 7-bit characters, so it is safe to put in an e-mail message.

As with base64, QP encoding can be used for nefarious purposes; characters that don’t require QP encoding (i.e., printable ASCII characters) can nevertheless be encoded in QP. For example, the word “viagra” could be rendered as “=76=69=61=67=72=61;” any spam filters that don’t know how to decode from QP will be unable to scan this spammy keyword.

MIME encoded-word encoding

The thrid type of MIME encoding commonly used in e-mail (and abused by spammers) is “encoded-word” encoding. This sort of thing is generally used only in visible mail headers (To:, From:, Subject:, etc.) and is not found in the bodies of messages. It allows these fields to contain non-ASCII text, which is a good thing: it allows people to type their nicknames or subject lines in their own language when that language cannot be represented by ASCII text. 

The name comes about because EW encoding encodes a single word (or a short phrase) into a bite-size package of ASCII data that can be inserted in its place.

You can recognize encoded words by their starting and ending marks: =? and ?= respectively. Between these marks, the EW has a specific structure:

=? [character-set] ? [encoding-technique] ? [encoded-data] ?=

[character-set]

ID of the character set of the original data (e.g., ISO-8859-1).

[encoding-technique]

“q” for quoted-printable, or “b” for base64.

[encoded-data]

The data to be shown, from the character set described above, using the encoding technique described above.

So, for example, if a spammer wanted to use the word “viagra” in a subject line, he could disguise it from EW-ignorant spam filters using the following: =?iso-8859-1?q?=76=69=61=67=72=61?=.

Because the use of EWs is confined to the visible header lines, and because these seldom contain trustworthy information for spam-tracing purposes, you should seldom have to muck directly with encoded words. Still, this description should help you spot them and understand how they work (and how you might decode them if needed).

Decoding from MIME (outside of your mail program)

As I said above, your mail program will usually fully decode the MIME encoding in any messages it receives, so that you can read the message text or use the binary attachment. Thus, if you’re looking at an e-mail message in your mail program, you don’t have to worry about MIME decoding.

If you’re dealing with a “raw” message that you saved as raw text from your mail program, or that you got as raw source from an online mail service like SpamCop or webmail, then you may need a way to decode the MIME content outside your mail program.

The quickest way to do this for most of us is to use one of the two all-purpose archiving tools WinZip (found mostly on Windows) and Stuffit Expander (found mostly on Mac OS); you can start these tools up and then run their “decode” commands to open and decode the file containing the raw MIME-encoded e-mail. If the message is multi-part, then each part will be saved in a separate file within a single folder or subdirectory on your disk.

If you have a Unix-like system (Linux, BSD, etc.), you might be able to use a multipart-capable decoding command like metamail or munpack (the older tool mimencode is apparently not multipart-savvy). Here are the command lines to try for either of these tools (assuming you’re trying to decode the contents of a file named encoded.mime in your current working directory).

munpack -t encoded.mime
metamail -w < encoded.mime

These programs should also decode each part into a separate file.

OK, now that we’ve completely deconstructed the mail body, it’s now time to fish around inside it to see what we can find. To read about this, go on to the next page.



 home | legal stuff | glossary | blog | search

 Legend:  new window    outside link    tools page  glossary link   


(c) 2003-2008, Richard C. Conner ( )

06420 hits since March 28 2009

Updated:Sat, 14 Jun 2008