Brown University
 A to Z IndexSearch and DirectoriesBrown home

Excluding search engine robots from your site


When you use a search engine to find web pages, you are accessing a database that the search engine has built up by reading web pages and recursively following the links on them.  This recursive reading and following of links is also called spidering.  The program that does the spidering is called a robot or a spider.

The best way to keep your pages from being spidered by a search engine robot  is not to link to them in anywhere.  That's not always possible, since other people may create links to your pages.  There are two directives that allow you to tell any robots that find your pages not to index them:  robots.txt files and robots META tags.  robots.txt files are useful on a website-wide basis.  META tags are useful on a page-by-page basis.

Most, but unfortunately not all, search engine robots honor these directives.   The directives described below are followed by most well-behaved robots, but they do not guarantee that your page will not be indexed if a robot finds it.
 

robots META tags

To exclude single pages, or hierarchies of pages, use robots META tags.  These tags are included in the <head> section of web pages:
 
<html>
<head>
   <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
   <meta name="robots" content="noindex,nofollow">
   <meta name="description" content="instructions for excluding search engine robots">
   <title>Search Engine exclusion</title>
</head>
<body>
....
</body>
These META tag directives allow you to control whether your page is indexed or not and whether the links on your page are followed or not. Again, not all search engine robots honor these META tags.

There are four different combinations for the robots META tag:
 
 

Do not index, do not follow links. <meta name="robots" content="noindex,nofollow">
Do not index, follow links. <meta name="robots" content="noindex,follow">
Index this page, but do not follow links. <meta name="robots" content="index,follow">

If you do not want your pages spidered, it's safest to include the robots META tag on each page, and not rely on one META tag at the top of a hierarchy. Consider the following scenario.
 

You maintain three pages that are connected by links in the sequence pageA -> pageB ->pageC.  If you put a robots META tag on your pageA requesting "noindex,nofollow", pageA will not be indexed, and the links to pageB and pageC will not be followed.  However, if your pageB is linked to from another site, your "noindex,nofollow" robots tag on pageA will not exclude robots from pageB or pageC.
Including the robots META tag  requesting "noindex,nofollow" on all your pages removes this possibility.



robots.txt files

These files allow web site administrators to exclude visiting search engine robots from sections of their site.  It's well suited to excluding large portions of a web server from indexing.  For example, you might not want a search engine hitting and running the programs in your cgi directories. This method is not particularly suited to excluding single pages (see robots META tags instead).

When a robot visits the site http://website.brown.edu, it first checks for the existence of the file
 

http://website.brown.edu/robots.txt


This specially formatted file tells the robot which parts of the site to exclude.  There can be only one robots.txt file per website; the file

http://website.brown.edu/myportion/robots.txt
will not be honored.

If you want to allow all robots to spider your site, put the following in your robots.txt file:

User-agent: *
Disallow:
If you want to exclude all robots from your site, put the following in your robots.txt file:
User-agent: *
Disallow: /
The following robots.txt file excludes all robots from the URLs http://website.brown.edu/Logs/, http://website.brown.edu/Documents/Local/, http://website.brown.edu/cgi-local/ and http://website.brown.edu/cgi-bin/.
 
User-agent: *
Disallow: /Logs/
Disallow: /Documents/Local/
Disallow: /cgi-local/
Disallow: /cgi-bin/


Note that you need a separate line for each area that you want to exclude from search engine robots. Also note the trailing slash which indicates that the entire directory should be ignored by the search engine robots.

It is also possible to allow one specific robot, but to exclude all others:
 

User-agent: FriendlySpider
Disallow:

User-agent: *
Disallow: /


You can find out what User-agent a robot reports to be from your web server's logs.  (Unfortunately, not all search engine robots or search engine administrators honor robots.txt files.)
 



This document was distilled from the documentation on The Web Robots Pages.

Back to the search help page
Back to the main search page
Brown home page