![]() |
|
Excluding search engine robots from your site
When you use a search engine to find web pages, you are accessing
a database that the search engine has built up by reading web pages and
recursively following the links on them. This recursive reading and
following of links is also called spidering. The program that does
the spidering is called a robot or a spider.
The best way to keep your pages from being spidered by a search engine robot is not to link to them in anywhere. That's not always possible, since other people may create links to your pages. There are two directives that allow you to tell any robots that find your pages not to index them: robots.txt files and robots META tags. robots.txt files are useful on a website-wide basis. META tags are useful on a page-by-page basis.
Most, but unfortunately not all, search engine robots honor these directives.
The directives described below are followed by most well-behaved robots,
but they do not guarantee that your page will not be indexed if a robot
finds it.
robots META tags
To exclude single pages, or hierarchies of pages, use robots META tags. These tags are included in the <head> section of web pages:<html>These META tag directives allow you to control whether your page is indexed or not and whether the links on your page are followed or not. Again, not all search engine robots honor these META tags.
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<meta name="robots" content="noindex,nofollow">
<meta name="description" content="instructions for excluding search engine robots">
<title>Search Engine exclusion</title>
</head>
<body>
....
</body>
There are four different combinations for the robots META tag:
| Do not index, do not follow links. | <meta name="robots" content="noindex,nofollow"> |
| Do not index, follow links. | <meta name="robots" content="noindex,follow"> |
| Index this page, but do not follow links. | <meta name="robots" content="index,follow"> |
If you do not want your pages spidered, it's safest to include the robots
META tag on each page, and not rely on one META tag at the top of a hierarchy.
Consider the following scenario.
You maintain three pages that are connected by links in the sequence pageA -> pageB ->pageC. If you put a robots META tag on your pageA requesting "noindex,nofollow", pageA will not be indexed, and the links to pageB and pageC will not be followed. However, if your pageB is linked to from another site, your "noindex,nofollow" robots tag on pageA will not exclude robots from pageB or pageC.Including the robots META tag requesting "noindex,nofollow" on all your pages removes this possibility.
When a robot visits the site http://website.brown.edu,
it first checks for the existence of the file
http://website.brown.edu/robots.txt
This specially formatted file tells the robot which parts of the
site to exclude. There can be only one robots.txt
file per website; the file
http://website.brown.edu/myportion/robots.txtwill not be honored.
If you want to allow all robots to spider your site, put the following in your robots.txt file:
User-agent: *If you want to exclude all robots from your site, put the following in your robots.txt file:
Disallow:
User-agent: *The following robots.txt file excludes all robots from the URLs http://website.brown.edu/Logs/, http://website.brown.edu/Documents/Local/, http://website.brown.edu/cgi-local/ and http://website.brown.edu/cgi-bin/.
Disallow: /
User-agent: *
Disallow: /Logs/
Disallow: /Documents/Local/
Disallow: /cgi-local/
Disallow: /cgi-bin/
Note that you need a separate line for each area that you want to exclude
from search engine robots. Also note the trailing slash which indicates that the entire directory
should be ignored by the search engine robots.
It is also possible to allow one specific robot, but to exclude all others:
User-agent: FriendlySpider
Disallow:User-agent: *
Disallow: /
You can find out what User-agent a robot reports to be
from your web server's logs. (Unfortunately, not all search engine
robots or search engine administrators honor robots.txt files.)
Back to the
search help page
Back to the
main search page
Brown home
page