Frontispiece of "Practical Knowledege for All", volume 4

How to find out `What the Search Engine Found'

[ Home Page | Search report ]

For the curious, a brief explanation of how this process works.

The HTTP standard specifies that when going from one page to another, a browser may transmit to the second server the URL of the first page, as described in this extract from RFC2068:

14.37 Referer

   The Referer[sic] request-header field allows the client to specify,
   for the server's benefit, the address (URI) of the resource from
   which the Request-URI was obtained (the "referrer", although the
   header field is misspelled.) The Referer request-header allows a
   server to generate lists of back-links to resources for interest,
   logging, optimized caching, etc. It also allows obsolete or mistyped
   links to be traced for maintenance. The Referer field MUST NOT be
   sent if the Request-URI was obtained from a source that does not have
   its own URI, such as input from the user keyboard.

        Referer        = "Referer" ":" ( absoluteURI | relativeURI )

   Example:

        Referer: http://www.w3.org/hypertext/DataSources/Overview.html

   If the field value is a partial URI, it SHOULD be interpreted
   relative to the Request-URI. The URI MUST NOT include a fragment.

     Note: Because the source of a link may be private information or
     may reveal an otherwise private information source, it is strongly
     recommended that the user be able to select whether or not the
     Referer field is sent. For example, a browser client could have a
     toggle switch for browsing openly/anonymously, which would
     respectively enable/disable the sending of Referer and From
     information.

(Note that the mis-spelling `Referer' is now enshrined for ever more in the Standard, which is the sort of risk you run if you let physicists design network protocols.)

Now, this means that if you click through from a search engine to one of my pages, the Referer: header will give the address of the search engine-generated page which got them there. For instance, if I search for `vmail-sql' with Google, my broswer will end up at the page with URL http://www.google.com/search?q=vmail-sql; the bit after the `search?q=' is the argument to the search engine. If I then click on the link to the vmail-sql home page, your browser will send a request something like:

GET /~chris/vmail-sql/ HTTP/1.1
Host: www.ex-parrot.com
User-Agent: Nutscrape 3.141 (CPM; 8-bit)
Referer: http://www.google.com/search?q=vmail-sql

This gets logged on my server, in a line (folded for readability) like:

12.34.56.78 - - [20/Aug/2000:22:25:24 +0100]
 "GET /%7Echris/vmail-sql/ HTTP/1.1" 200 3595
 "http://www.google.com/search?q=vmail-sql" "Nutscrape 3.141 (CPM; 8-bit)"

By writing a simple program to read the Referer: headers, I can figure out what people are searching for to get to my pages. The program is search-report; feel free to try it on your own site's logs. (Note that it expects the logs to be in the format which Apache calls `Combined'. Your server may log the Referer: as a different field, or not at all.)

If the page is a collection of keywords people search for, won't it eventually attract all the search-engine traffic on the web?

No. The search engines are not that stupid (I hope).... Also, each report is compiled from only a few weeks' worth of logs.

Update

The search engines do index the report, and it gets direct traffic. How depressing. I've added a noindex tag to the page to stop this; otherwise, one of the search engine operators is bound to become angered.

Update: `Yahoo Chat boot codes'

After wondering what `Yahoo Chat boot code' is -- lots of people seem to search for this term, and some of them find their way to my pages -- I asked if anyone reading my page could enlighten me. Eventually, one of them did. The answer is, apparently,

If you've got Yahoo messenger (a chat program you can download at
Yahoo.com), you can also go to a lot of different Yahoo Chatrooms (works
more or less like mIRC and ICQ) -- in these rooms, you will also find people
who's got scripts/codes that will disconnect you ( "boot" you) off the
chatrooms, and sometimes also disconnect your Yahoo messenger all together.
In order to "boot" someone off, you need either a script, a program or codes
(I'm not sure which) -- and in order to secure yourself from BEING booted,
you also need the same thing.

-- thanks to Gudveig Rian for that.


Copyright (c) 2000 Chris Lightfoot. All rights reserved.