elxsy where humanity wins the fight against machines

8Jun/094

How to identify and ban bots-spiders-crawlers

This is a fast step tutorial. I will describe how to identify and get rid of web spiders/crawlers. Whats is a bot and what it does, how it functions etc all can be found here So you are trouble with robots, good or bad does not matter. They all leech away your bandwith and resources and just maybe do something for you in return. Even though they are not harvasting or spammer bots. This problem goes beyond bandwidth when you have like 100,000 dynamic pages under one server.  So how do we seperate them as good or bad?

First Intention:

  1. Good ones
    1. Intentionally good and result is efficient: Like google and yahoo. They scan your website and return visitors via search queries in exchange. They leech away your resources but give something in return.
    2. intentionally good but result is inefficient : Like cuil or yandex or other wanna look good but index selling companies.  Leeching your resources and nothing in return. This is the place where you want to decide, if a bot leeches away 5% of your bandwith and return 5 visitors in a month or none. You should list that one as bad also.
  2. Bad ones : Ones that scans your website and links in order to harvest emails, content, links and weak security measures and sell them to other people, businesses and other sources. Leeching from your back in other words. They are all bad and should not be allowed to view your contents.

Second Identification and Obeying your Rules:

  1. Who identifies and obeys : Usually who identifies themselves as some robot or spider, they obey the robots.txt rules.  Sometimes they do not.
  2. Who doesnt: Usually harmfull bots identify themselves as normal web users and they do not care about robots.txt

Method

1st Good and Obeying ones

Tag and ban bad ones

This should be the old method. There used to be maybe 10 bots around the internet that can leech and scan important amounts due to low resources but now home-grown spiders are around everyday and minute with large resources. So specifying who is bad in a 95% bad ratio world is not smart.

Tag and Allow only good ones

This should be the appropriate solution for the problem now. Only allow good ones via robots.txt. In my case they are only big and useful search engines for me. Sample robots.txt that allows the ones with specified identification and disallows the rest.

User-agent: Googlebot
User-agent: Slurp
User-agent: msnbot
User-agent: Mediapartners-Google*
User-agent: Googlebot-Image
User-agent: Yahoo-MMCrawler
Disallow: 

User-agent: *
Disallow: /

so we got rid of the intentionally good, or bad but useless and rule obeying ones. How to get rid of disobeying ones?

2nd Bad and disobedient ones

If you are using apache (and you should use apache!) you can ban them via user agent or user ip. So why are we banning via agent? They can use any IP adress and many bots have IP ranges like xx.xx.xx.10-100 but some of them gets a new IP adress whenever they want or as a backend crawler so you can miss them. User agent allows to get them whatever the IP is.

Ban via user agent

Create your .htaccess file or modify existing one and add the bots user agents with respecting values. Sample:

RewriteCond %{HTTP_USER_AGENT} .*Rapidmorebot.* [OR]
RewriteCond %{HTTP_USER_AGENT} ^Gigabot.* [OR]
RewriteCond %{HTTP_USER_AGENT} ^Yanga.* [OR]
RewriteCond %{HTTP_USER_AGENT} ^Baiduspider.* [OR]
RewriteCond %{HTTP_USER_AGENT} ^AISearchBot.*
RewriteRule ^.* - [F]

What we did is we told the apache web server that if any request's user agent parameter matches the given values, redirect them to an access denied page.

.*etc*. shortly means, if includes the etc string in any position

^etc.* means if it begins with etc and goes with anything

[OR] means OR.  Note : If you leave the OR parameter, it will act as an AND and will not block any bots.

You can use regular expressions in Rewrite cond and rules. Search on the internet for more information. You can add or remove as many lines/bots as you desire

Ban via IP address

If they do not identify themselves or fake it, we are going to ban them via IP adresses.  You can specify single IP or IP ranges in them. Again create or modify your .htaccess. Insert your bot's IP into necessary fields and repeat them until you are all done.

order allow,deny
deny from 127.0.0.1 # only ban ip from #dddd
deny from 127.0.0.1/17 # ban through 1 to 17
allow from all # and allow the rest

You can edit these settings from your hosting cpanel management also.

Identifying Bots

We are not bulletproof althout we only let selected good ones, ban the ones we caught. There will be always new bots and updates to existing ones that can make ur settings and rules become invalid. We need to be upto date also. So how do we identify them?

1st General Knowledge

Of course you are not the only one facing this problem. People started to make lists and publish them because these bots annoyed them too much and they wanna help other people like you also. So here is some databases of known bots/spiders. Note : These databases are not upto date and they do not include all spiders, just the generally known ones. You should try to catch your manually and then consult these dbs to check your result.

http://www.robotstxt.org/db.html

http://www.iplists.com/

2nd via Hosting logs | Some stats analyzer

You can manually analyze your hosting logs, bandwith usage, pages to get most hit from singular ips or you can use an application like awstats or commercial solutions like weblogexpert to analyze your logs for you and create reports for you.

Awstats spider/bot list

Awstats spider/bot list

Awstats identifies them by hits on robots.txt and user agent string. You can have the user agent but you are unprotected against unknown and IP values.

You can detect if any bots leeching on your website in easy analysis.

  1. Did "time spent on webpage" decreased suddenly? Bots get a page and exit, so they stay less then 2-3 seconds in your page. Humans will be able to load the page in this time only. So if these ratio goes higher it means you have a spider inside. You can track it down by finding request IPs with short "spent time" values.
  2. Did hits on a pages  increased suddenly? In general rules, you will attract visitors slowly, your hits wont be 10 one day and 1000 other day if everything working ok. You have a spider inside.
  3. Did your bandwith increased suddenly? You have a spider leeching on u.

All these rules applies to awstats Hosts section, you can have their IP and their agents via spider db I gave previously. So lets analyze one of my Host sectiona and find bots manually via their IP. Lets analyze the IPs and area in red drawing.

h2

Awstats Host section

  1. line is a bot or a service (dns, ping, etc) no way a human being can hit that many pages in one day and spend 20mb bandwith. We download a page with its all attachments (images, scripts, styles) like a size of 400kb in my condition (pages are huge). But bots only download the text content, so its relatively small compared to the original size (7-8Kb). (cache is not included because i know it is a dynamic image page )
  2. line is definitely a bot and pretty bad one (zoozle.net you should ban it!) wasted 1.5gb in one day! I hope you see the importance of eliminating bots on your system health now!
  3. and so on.. they are all spiders - services crawling - pinging  the website.

3rd Setup a Bot Trap

Even your analyze can overlook some small but in time big trouble creator bots. What we can do is to setup a bot trap for them to fall in.  What we are going to do is to create a trap link which records visitors details. Publish that link on your website via invisible properties and then tell the good and obeying bots to not to go over forbidden zone. So whoever do not listen to what we say will be get tagged.  Method depends on if you want them in a database or in a logfile or via email.

1st Create a weird link in your web site like www.example.com/this-is-trap-dont-click-it/index.php and add

2nd contents of the .php file is

$ip 		= $_SERVER['REMOTE_ADDR'];
$host		= $_SERVER['HTTP_HOST'];
$agent 		= $_SERVER['HTTP_USER_AGENT'];
$referer	= $_SERVER['HTTP_REFERER'];
$time		= date("d.m.Y H:i");
// mail, write in log file or insert into db depending on your choice

3rd Put in your footer a link to that weird link and make the link same color as background color, or put an 1X1 transparent gif into the link. ie

<!-- Trap for bots. Human visitors please do not visit this adress -->
<a href="this-is-trap-dont-click-it/index.php"><img src="1_by_1pixel.gif" border="0"
alt="Please do not visit, its a trap for bots" width="1" height="1"/></a>

4th add these lines to your robots.txt. Which will tell good and obedient bots not to follow the link and get in trouble.

User-agent: *
Disallow: /this-is-trap-dont-click-it

So here our bot trap is set and finished finally. Analyze your trap logs and ban them via your criteries (leeching - new visitors ) and their behaviour.

Thats it, hope you enjoyed the post.

Comments (4) Trackbacks (0)
  1. Seems to be nice article , gave a good start in bot identification , Thanks a lot bro!.

    • I just want to say that this is very informative and I wonder if I may send you an attachment of a snapshot of my awstats report and you could tell me if I have any spiders in there?

  2. Thanks for this.. I was looking for a better way to ID the bad-bots in AWstats, but it looks like the ‘manual’ way is the only way. But now i’m going to try your bot-trap idea!

  3. Great site, loved the amount of detail given been looking for ideas on getting rid of annoying bots crawling my site for a while now and this is by far the most comprehensive thing I found quickly.


Leave a comment

(required)

No trackbacks yet.