Posts

iplookup

IP Lookup If you think you've spotted a robot in your web server logs, you can use these tools to lookup the IP address, which gives you a clue as to who may be operating the robot. You can use these external lookup services: Inspect an IP at Project Honey Port which knows about many email address harvesting web robots. Cybernet Quest IP Address Lookup Network Tools Whois Lookup

db

Robots Database The Robots Database lists robot software implementations and operators. Robots listed here have been submitted by their owners, or by web site owners who have been visited by the robots. A listing here does not mean that a robot is endorsed in any way. For a list of User-Agents (including bots) in the wild, see www.botsvsbrowsers.com This robots database is currently undergoing re-engineering. Due to poular demand we have restored the existing data, but addition/modification are disabled. ABCdatos BotLink Acme.Spider Ahoy! The Homepage Finder Alkaline Anthill Walhello appie Arachnophilia Arale Araneo AraybOt ArchitextSpider Aretha ARIADNE arks AskJeeves ASpider (Associative Spider) ATN Worldwide Atomz.com Search Robot AURESYS BackRub Bay Spider Big Brother Bjaaland BlackWidow Die Blinde Kuh Bloodhound Borg-Bot BoxSeaBot bright.net caching robot BSpider CACTVS Chemistry Spider Calif Cassandra Digimarc Marcspider/CGI Checkbot...

checker

/robots.txt checker We currently don't have our own /robots.txt checker, but there are some third-party tools: Google's robots.txt analysis tool (requires a Google Account)

other

Other Sites Google Many people end up on this site because they have questions about specific search engine robots and search engines. For such questions the best place is the relevant's site's own help pages: Google Web Search Help Center Google Webmaster Help Center Yahoo!'s Web Crawler Help Pages Extensions to the Robots Exclusion Protocol Recently three major search engines have collaborated to support extensions to the /robotst.txt directives and related mechanisms. See the join announcements on: Yahoo! Search Blog Google Webmaster Central Blog Microsoft Live Search Webmaster Team Blog Sites about Search Engines Other sites that are useful to webmasters: Search Engine Land Search Engine Watch Search Engine Roundtable

mailinglist

The robots mailing list The robots mailing list provided a technical forum for people interested in web robots, in the early days of the web. It was hosted at Nexor, Webcrawler, and later at mccmedia, but is no longer active. Some archived messages are available: Jun 1994 - Aug 1995 (0.5Mb, text) Oct 1995 - Mar 1997 (4.2Mb, text)

faq

Frequently Asked Questions This is a list with frequently asked questions about web robots. Select the question to go to the answer page, or select on the eye icon after the question to show the answer in this page. About WWW robots What is a WWW robot? What is an agent? What is a search engine? What kinds of robots are there? So what are Robots, Spiders, Web Crawlers, Worms, Ants? Aren't robots bad for the web? Are there any robot books? Where do I find out more about robots? Indexing Robots How does a robot decide where to visit? How does an indexing robot decide what to index? How do I register my page with a robot? How do I get the best listing in search engines? For Server Administrators How do I know if I've been visited by a robot? I've been visited by a robot! Now what? A robot is traversing my whole site too fast! How do I keep a robot off my server? Robots exclusion standard Why do I find entries for /robots.txt in my l...

meta

About the Robots <META> tag In a nutshell You can use a special HTML <META> tag to tell robots not to index the content of a page, and/or not scan it for links to follow. For example: <html> <head> <title>...</title> <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"> </head> There are two important considerations when using the robots <META> tag: robots can ignore your <META> tag. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention. the NOFOLLOW directive only applies to links on this page. It's entirely likely that a robot might find the same links on some other page without a NOFOLLOW (perhaps on some other site), and so still arrives at your undesired page. Don't confuse this NOFOLLOW with the rel="nofollow" link attribute . The details Like the /robots.txt , ...