Code Snippet: Keep Track of Hits by Search Engine Spiders
As you can tell from our home page, I am an avid reader of Will Bontrager's 'Why and What and How-to-Do' blog. Back in October, 2005 there was an entry entitled 'A Spider Detector'.
Will presents a spider-detection method which uses a "hidden" link on your web page to call a Perl script. The script then sends you an email to notify you that the link has been clicked. The link is hidden in an image, such as a thin horizontal bar, so most humans would not think of clicking it. However, spiders might follow it, since they want to learn where all of the links on your site lead.
Recently, I decided to experiment with this technique, but did not want to be bothered with receiving emails every time the link was clicked. Consequently, I revised Will's script to print the results in a text file, which I could then peruse at my leisure. I also made it a bit easier on myself by only placing the image/link on my home page, so I didn't have to worry about having the script remember where to return after doing it's job.
Here's the method: First revise the <img> tag for a small image that's located on the web page you want to track. Surround the image tag with anchor tags that link to your Perl script.
<a href="cgi-bin/spider.cgi">
<img src="images/bar.gif" width="100%" height="6" border="0" /></a>
Now for the script:
1 #!/usr/bin/perl
3 ### spider.cgi
5 ### The URL of the webpage where your image/link is located
6 my $URL = "http://www.yourwebsite.com/yourwebpage.html";
8 ### Get environmental variables
9 my $ip = $ENV{REMOTE_ADDR};
10 my $ref = $ENV{HTTP_REFERER};
11 my $ua = $ENV{HTTP_USER_AGENT};
13 ### Get the date and time of the hit
14 my($sec,$min,$hour,$mday,$mon,$year) = gmtime();
15 my $dt = sprintf("%04d-%02d-%02d %02d:%02d:%02d GMT",
$year+1900, $mon+1, $mday, $hour, $min, $sec);
17 ### Write a line to your spiderlog text file
18 open (FILE, ">>../files/spiderlog.txt") || die "Error Opening Logfile for Write";
19 print FILE "$dt | $ip | $ref | $ua |\r\n";
20 close FILE;
22 ### Go back to your webpage
23 print "Status: 307 Temporary Redirect\n";
24 print "Location: $URL\n\n";
Explanation: You may need to change line #1 if your webserver uses a different shebang.
Use the URL of your own web page in line #6.
In line #18, notice that the logfile is located in a different folder, NOT in your 'cgi-bin' folder. This will enable you to view it by the technique described below.
How to view the log: All you need to do is type the URL of your logfile into the address bar of your browser, like:
http://www.yourwebsite.com/files/spiderlog.txt
The entries will look something like the following:
2006-01-14 03:43:55 GMT | 66.249.65.178 | http://www.yourwebsite.com/yourwebpage.html
| Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) |
(This will all be on one line. It is seperated here to make it fit on this page.)
The first item on the line is the date and GMT time, the second item is the IP address of the spider, and the third is the URL of 'yourwebpage'. If you get a blank in the third spot, the hit was probably from a curious surfer that wanted to see what the hidden link was. In the last item, you can easily spot that this was the 'Googlebot' spider. Most other spiders are readily identified.
Now you'll have some real evidence that your website is being spidered!
See it in action: You can see this script in use on the Home Page. The green horizontal bar near the bottom of the page is the image that provides a link to the spider script. If you move your mouse over the bar, you will see the URL of the link in the Task Bar at the bottom of your browser window.