Did you come here from a search engine? Click here for my main page.

What it does

curl is a very clever command-line utility that can fetch files from web servers just like a web browser — the only difference being that curl will not attempt to interpret or render the files. curl just prints them out in your terminal or DOS window.

For spam-tracing, curl provides a safe and easy way to get files from a spam website; you can then examine the files (e.g., HTML files) for URL redirections, JavaScripts, encrypted markup, and other tricks of the spam trade. curl isn't particularly useful just for looking at spam mail (since you presumably already have it on your computer's disk and therefore don't need to fetch it).

How to get it

curl has become very popular as a convenient file-fetching utility (e.g., for software installation or patching scripts), so it is now routinely included in modern Unix-like operating systems (such as some Linux distributions and Mac OS X / Darwin). You can use the commands curl -V (which prints version info) or whereis curl (which gives the location of curl and its manual page on your file system) to see whether you have curl on your system already.

If you don't have curl, you can probably find and download a version for your particular system. curl is open source free software available for download at http://curl.haxx.se/. You can get pre-built 'binary' packages ready for use, or you can download the source code and build it (and play with it) yourself if you've a mind to.

How to use it

The basic curl command is pretty easy to use:

Note that the "[url]" should be a complete URL, with the protocol identifier (e.g., the "http://") up front. I find that curl automatically guesses the protocol (or defaults to HTTP) if you don't specify it, but you might as well be thorough.

curl has many command-line options, a few of which I'll mention below. See the curl manual page for more info on options.

Fetching web files with curl

Here's a simple curl command to fetch the index page of this website (http://www.rickconner.net/spamweb/index.html).

[G4733:~] rconner% curl http://www.rickconner.net/spamweb/index.html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
        <head>
                <meta http-equiv="content-type" content="text/html;charset=ISO-8859-1">
                <meta name="no-email-collection" content="http://www.rickconner.net/spamweb/policy.html#bots">
                <title>rickconner.net :: Rick's Spam Digest</title>

 Remainder of printout not shown...

Note here that this URL actually has a file name at the end: that is, we want the file "index.html" that resides in the subdirectory "spamweb/" just off the root directory for this site "/". Of course, not all URLs have filenames. If you happen to get a possible spam URL without a specific filename at the end (e.g., http://www.desperate-milfs.foo/), you can also use this with curl. By convention, the remote web server will interpret this as a request for the index page for the directory you requested (e.g., "index.html") and will try to retrieve this page for you.

Here's what happens if you ask for a page from a website that isn't online:

[G4733:~] rconner% curl http://fakewebsite.zzz/index.html
curl: (6) Couldn't resolve host 'fakewebsite.zzz'

...and here's what happens when you request a page that doesn't exist from a website that does:

[G4733:] rconner% curl http://www.rickconner.net/free-prizes.html
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<HTML><HEAD>
<TITLE>404 Not Found</TITLE>
</HEAD><BODY>
<H1>Not Found</H1>
The requested URL /free-prizes.html was not found on this server.<P>
<HR>
<ADDRESS>Apache/1.3.33 Server at www.rickconner.net Port 80</ADDRESS>
</BODY></HTML>

Note that this is the Apache-generated HTML markup for the usual message you'll see in your browser for a "404" (file-not-found) error.

Let's try that last query again, this time with curl -i (the -i option tells curl to print the HTTP header info for the transaction).

G4733:~] rconner% curl -i http://www.rickconner.net/free-prizes.html
HTTP/1.1 404 Not Found
Date: Sun, 16 Oct 2005 23:05:02 GMT
Server: Apache/1.3.33 (Unix) mod_ssl/2.8.22 OpenSSL/0.9.7d mod_perl/1.29 PHP/4.3.10
Transfer-Encoding: chunked
Content-Type: text/html; charset=iso-8859-1


<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<HTML><HEAD>
<TITLE>404 Not Found</TITLE>
</HEAD><BODY>
<H1>Not Found</H1>
The requested URL /free-prizes.html was not found on this server.<P>
<HR>
<ADDRESS>Apache/1.3.33 Server at www.rickconner.net Port 80</ADDRESS>
</BODY></HTML>

The stuff in blue is the HTTP header info provided by the web server (browsers normally don't display it); here we can see the actual 404 code in the first line of the HTTP header.

You can fetch any kind of file from a website with curl, so long as it is "world-readable" (i.e., you, as an anonymous user, have permission to view the file). Here's a doctored example of a curl fetch that tries to grab a file that I'm not allowed to have (in this case, the webmaster probably protected the file by setting its Unix file mode bits to deny read permission to the public).

[G4733:~] rconner% curl -i http://www.russian-warez.foo/secret-stuff.txt
HTTP/1.1 403 Forbidden
Date: Mon, 17 Oct 2005 01:23:57 GMT
Server: Apache/1.3.33 (Unix)
Transfer-Encoding: chunked
Content-Type: text/html; charset=iso-8859-1


<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<HTML><HEAD>
<TITLE>403 Forbidden</TITLE>
</HEAD><BODY>
<H1>Forbidden</H1>
You don't have permission to access /secret-stuff.txt
on this server.<P>
<HR>
<ADDRESS>Apache/1.3.33 Server at www.russian-warez.foo Port 80</ADDRESS>
</BODY></HTML>

Here, I got a 403 code (in blue) and an appropriate block of HTML markup for the browser to display.

If you're trying to fetch a binary file, such as an image, an mp3 file, or a machine-language executable, you should probably not let curl print it to the screen (it might mess up your display, requiring a reset of the terminal program). Instead, use the curl -o option, or simply redirect the standard output to a file (named "my-file" in these examples):

You can also fetch and store text files (HTML, scripts, etc.) the same way. This allows you to save a copy for later analysis and review.

Speaking of saving text for later review, you can safely use curl to fetch script files or other executables, since curl won't (can't, really) execute them. You can also fetch HTML pages that contain such scripts (spammers frequently use outlandish encryption schemes to disguise the content of these pages, and you'll find it easier to trace what's going on if you have your own local copy of the page).

The things you'll see with curl

Now that we've gotten our feet wet with curl, let's see how we can use it to spot some kinds of subterranean website exploits.

Setting cookies

Here's an example of a website trying to set some cookies via a command in the HTTP header (shown in orange). This cookie-setting (and cookie-getting) is normally a benign activity, and is very popular among large web merchants like Amazon. I don't think your average warez-drugs-porn spammer uses cookies very much, but it is possible that "mainsleaze" spammers might. Of course, by default curl does not accept or surrender cookies, so this attempt washes off like water from a duck's back.

[G4733:~] rconner% curl -i -v http://www.amazon.com:80/exec/obidos/subst/home/home.html
* About to connect() to www.amazon.com:80
* Connected to www.amazon.com (207.171.163.90) port 80
> GET /exec/obidos/subst/home/home.html HTTP/1.1
User-Agent: curl/7.10.2 (powerpc-apple-darwin7.0) libcurl/7.10.2 OpenSSL/0.9.7g zlib/1.1.4
Host: www.amazon.com
Pragma: no-cache
Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, */*

HTTP/1.1 200 OK
Date: Mon, 17 Oct 2005 03:03:51 GMT
Server: Suppressed
Set-Cookie: obidos_path_continue-shopping=continue-shopping-url=
/subst/home/home.html&continue-shopping-post-data=
&continue-shopping-description=generic.gateway.default;
path=/; domain=.amazon.com
Vary: Accept-Encoding,User-Agent
nnCoection: close
Transfer-Encoding: chunked
Content-Type: text/html

Redirection to another website

Now comes an example of using curl to spot HTTP-level redirection (the 300-series HTTP return codes): I asked for a page not on the site, but the server was instructed to redirect me to another page instead (I "phonied up" this example for purposes of illustration, and used curl -i to show the HTTP header).

webster2:~$ curl -i http://www.fake-viagra-portal.foo/sales.html
HTTP/1.1 302 Found
Date: Mon, 17 Oct 2005 00:49:31 GMT
Server: Apache/1.3.33
Location: http://www.secret-viagra-website.bar/sales.php?affiliate=
fake-portal-guy
Transfer-Encoding: chunked
Content-Type: text/html; charset=iso-8859-1

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<HTML><HEAD>
<TITLE>302 Found</TITLE>
</HEAD><BODY>
<H1>Found</H1>
The document has moved <A HREF="http://www.secret-viagra-website.bar/sales.php?affiliate=fake-portal-guy">here</A>.<P>
<HR>
<ADDRESS>Apache/1.3.33 Server at www.fake-viagra-portal.foo Port 80</ADDRESS>
</BODY></HTML>

Here, the 302 return code (in blue) in the HTTP header tells the browser that it should move on to the new location (given in orange). The server also provides some HTML for the browser to display if, for some reason, it does not want to follow the redirection. Of course, curl doesn't interpret any of this stuff, it just prints it (if you wish, you can use a command-line option to tell curl to follow redirections).

Another way to redirect a browser is at the "HTML level," using a <meta HTTP-EQUIV="Refresh"> tag in the head of the HTML document:

webster2:~$ curl -i http://www.fake-viagra-portal.foo/sales.html
HTTP/1.1 200 OK
Date: Mon, 17 Oct 2005 01:30:57 GMT
Server: Apache/1.3.33
Transfer-Encoding: chunked
Content-Type: text/html

<html>
<head>
<title>Public portal page</title>
<meta HTTP-EQUIV="Refresh" CONTENT="0;URL=http://www.secret-viagra-
website.bar/sales.php?affiliate=fake-portal-guy
"></head>
<body>
<center>
<p>Why are you still here, jerk?</p>
</center>
</body>
</html>

As we see from the HTTP header, the file sales.html was found (return code 200) and delivered to my machine. The highlighted meta tag (in blue) in this file tells my browser to "refresh" the page after zero seconds (i.e., immediately) by loading the URL highlighted in orange. This sort of "meta-redirect" is slower, less reliable, and more obvious to the visitor than the HTTP-level redirect, but it requires less work on the part of the person setting up the portal website (and can be used even if this person doesn't have access to the web server to set up an HTTP redirect). Here again, curl doesn't interpret this info, it just prints it.

These kinds of redirection are the sort of thing you might see in spam if the URL you requested were just a portal or affiliate website for some other website where the sales pitch is actually made. Note that in both cases the new location includes a CGI call with what looks like an affiliate ID, so the fake portal guy can get his commission for beating traffic to the main site.

Fetching "auxiliary" files

A website operator can set up his HTML pages to pull in content from other files. In some cases, the server will actually insert these files into the HTML markup itself (e.g., via PHP scripts and so-called server-side includes or SSIs); you'll see this content when you fetch the page with curl.

In other cases, the files are downloaded separately by the browser (e.g., CSS style sheet files, external JavaScript files) and then intepreted by the user’s browser (or, as we say, on the “client side”). Usually, these files are downloaded without your being aware, and are stored in your browser's cache where you'll never find them (at least not easily). When you use curl to fetch an HTML page, you won't get any of these "separate" files automatically, but if you can find where they're referenced, you can download these as well with curl.

For example, let's grab the index page of my spamweb site using curl:

[G4733:~/desktop] rconner% curl http://www.rickconner.net/spamweb/
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

 <head>
  <meta http-equiv="content-type" content="text/html;charset=ISO-8859-1">
  <meta name="no-email-collection" content="http://www.rickconner.net/spamweb/policy.html#bots">
  <title>rickconner.net :: Rick's Spam Digest</title>
  <link href="spamweb.css" rel="stylesheet" media="screen">
</head>

 [rest of output snipped] 

In the head element, you can find a link (highlighted in blue) to an external CSS style sheet that I made for the site. If you're curious, you can download this file using curl. Note that:

  • Stylesheets are invariably text files, so they should be safe for you to print out on your display.
  • The file name (spamweb.css) has no directory path info in front of it, so we can assume that it sits at the same directory as the page itself. So, the URL we want to fetch is http://www.rickconner.net/spamweb/spamweb.css.
  • If the filename had directory path info in front, you'd have to construct your URL appropriately. Here are a couple of examples (it may help to read this page on Unix file path notation on which URL paths are based).
    • For "style/spamweb.css," you would use:
    • For "../style/spamweb.css." you would use:

Let's go ahead and grab the style sheet at its proper location:

[G4733:~] rconner% curl http://www.rickconner.net/spamweb/spamweb.css
body { color: #444; font-size: 15px; font-family: "Utopia Regular", Georgia, "Times New Roman", serif; line-height: 20px; background-color: #ffc0cb; text-align: justify }
.center-example { text-align: center }
p { }
.glosTerm { color: #c71585; font-weight: bold; font-size: 16px; line-height: 18px; font-family: "Courier New", Courier, Monaco; padding-bottom: 0px }
.glosDef { font-size: 14px; line-height: 18px; padding-left: 20px }
.glosSidebar { color: #000; font-size: 14px; line-height: 18px; background-color: #ffc0cb; margin-left: 40px; padding: 10px; border: solid 1px #90ee90 }
.glosAlsosee { background-color: yellow }
.glosLetter { color: #c71585; font-weight: bold; font-size: 20px; line-height: 24px; font-family: "Courier New", Courier, Monaco; background-color: #ffc0cb; padding-left: 10px }

 [remainder snipped...]

Fetching external script files

Another kind of file you can fetch in this way, one that is perhaps of more interest to spam hunters, is a script file. This would be a file containing JavaScripts (or some other kind of scripting) that are called from within the HTML page. Just by looking at the page, you'd never know what the scripts did, since they're off in another file. Here's a phonied-up example of something you might see on a spam website:

[G4733:~] rconner% curl http://www.mortgage-fiends.foo/
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>
 <head>
  <meta http-equiv="content-type" content="text/html;charset=ISO-8859-1">
  <title>Great Re-fi Deais!!!!</title>
  <script src="js/evil_scripts.js">
  <base href="http://badguys.xxx/">
 </head>

 [...skipping...] 

<a href="http://somewhere-else.zzz" onClick="return do_evil('x')">
Use our friendly form</a>

 [remainder snipped...] 

In the hyperlink near the bottom, the spammer calls for the function do_evil('x') to be run when you click the link (the pink highlight). You might possibly like to know what this script will do, but the script itself isn't in the HTML file — it's in the separate file called "js/evil_scripts.js," named in a <script> tag (orange highlight) So, let's go get it!

Wait a minute, though: the spammer has thrown up a bit of a roadblock in the form of the <base href> tag (blue highlight): this tag sets a 'base' for any relative URLs used in the page. The <script> tag calls out a relative URL (i.e., js/evil_scripts.js). So, the proper URL to grab using curl would probably be:

If, in fact, you're able to grab this file with curl, you have a fighting chance to figure out exactly what the spammer is up to.

No curling allowed?

Curl, like any good HTTP client, clearly identifies itself to remote web hosts (using a distinct "user-agent" string in the HTTP request). On rare occasions, this information may be used by a web server to deny access to the requested file; if a request doesn't appear to come from a more-or-less standard web browser, the web server admins may regard it (with some justification) as the precursor to link spamming, to a cracking attempt, or to a spambot scraping attack. It's possible (via the -A option) to force curl to use an alternate user-agent string that may be more palatable, but you might still find your access blocked to such sites due to other anomalies in the request. The server admins are not about to go publishing the details of their security precautions in order for you to evade them, so it's probably best just to lay off.

On the other hand, I've never found any spam websites (nor very many honest sites) that use this sort of defensive "browser-sniffing."



 Legend:  new window    outside link    tools page  glossary link   


(c) 2003-2008, Richard C. Conner ( )

16956 hits since March 28 2009

Updated: Wed, 11 Jun 2008