|
Offshore Outsourcing Best Practicesoutsourcing@dataart.com
|  |
Bots Identification
The most reliable method for crawlers' identification is by IP address. This method can be used for the most popular search engines, though their addresses may change occasionally. Unfortunately, such method is not applicable to distributed robots (utilities for sites downloading, personal search systems, pilot bots), which can come to the site from virtually any IP address.
Some crawlers can be detected on the basis of the user agent information if you keep the corresponding list. Some bots disguise themselves as popular browsers (IE, Mozilla) or actually are those browsers (for example, when IE downloads a site to make it available offline). The latter case can be resolved using adaptive methods which analyze the behavior of a remote client. If many pages are requested from single IP in a very short period of time (for example 50 pages in one minute) most probably this is a robot. Apart from that, you can consider several indirect signs, like an empty referrer field or robots.txt file retrieved. Such a multi-level crawler identification scheme turned out to be highly effective in practice.
|
 |
|
|
|
 |
|
|
|
|
|
|
|
|
|
|
 |
|