Chapter 9: Search
Robots.txtTo deal with limiting robot access, the Robot Exclusion protocol was adopted. The basic idea is to use a special file called robots.txt that should be found in the root directory of a Web site. For example, if a spider was indexing http://www.democompany.com, it would first look for a file at http://www.democompany.com/robots.txt. If it finds a file, it would analyze the file first before proceeding to index the site.
Note: You will find that many spiders will ignore a robots.txt file with a URL like http://www.bigfakehostingvendor.com/~customer/robots.txt, where the robots.txt file is not located in the root directory. Unfortunately, you will have to ask the vendor to place an entry for you in their robots.txt file.
The basic format of the robots.txt file is a listing of the particular spider or user agent you are looking to limit and statements including which directory paths to disallow. For example,
User-agent: *In this case, we have denied access for all robots to the cgi-bin directory, the temp directory, and an archive directorypossibly where we would move files that are very old but still need to be online. You should be very careful with what you put in your robots.txt. Consider this file:
User-agent: *In this file, a special subscribers-only and resellers file has been disallowed for indexing. However, you have just let people know this is sensitive. If you have content that is hidden unless someone pays to receive a URL via e-mail, you will certainly not want to list it in the robots.txt file. Just letting people know the file or directory exists is a problem. Consider that malicious visitors will actually look carefully at a robots.txt file to see just what it is you don't want people to see. That's very easy to do: just type in the URL like this: http://www.companytolookat.com/robots.txt.
Overview | Chapters | Examples | Resources | Buy the Book!