SEO-News: March 23, 2006 Feature Article

To Print: Click here or Select File/ Print from your Browser Menu.


  Article printed from SEO-News: http://www.seo-news.com
  HTML version available at: http://www.seo-news.com/archives.html
Working With the Robots.txt File 
By RedAlkemi Syndicate (c) 2006

Topics:

What is the robots.txt file?
Working with the robots.txt file
Advantages of robots.txt
Disadvantages of the robots.txt file
Optimization of the robots.txt file
Using the robots.txt file

What is the robots.txt file?

The robots.txt file is an ASCII text file that has specific
instructions for search engine robots about specific content
that they are not allowed to index. These instructions are
the deciding factor of how a search engine indexes your
website's pages. The universal address of the robots.txt
file is: www.example.com/robots.txt . This is the first file
that a robot visits. It picks up instructions for indexing
the site content and follows them. This file contains two
text fields. Lets study this example:

User-agent: *
Disallow:

The User-agent field is for specifying robot name for which
the access policy follows in the Disallow field. Disallow
field specifies URLs which the specified robots have no
access to. An example:

User-agent: *
Disallow: /

Here "*" means all robots and "/ " means all URLs. This is
read as,  " No access for any search engine to any URL"
Since all URLs are preceded by "/ " so it bans access to all
URLs when nothing follows after "/ ". If partial access has
to be given, only the banned URL is specified in the
Disallow field. Lets consider this example:

# Research access for Googlebot.
User-agent: Googlebot
Disallow:

User-agent: *
Disallow: /concepts/new/

Here we see that both the fields have been repeated.
Multiple commands can be given for different user agents in
different lines. The above commands mean that all user
agents are banned access to /concepts/new/ except Googlebot
which has full access. Characters following # are ignored up
to the line termination as they are considered to be comments.

Working with the robots.txt file

1. The robots.txt file is always named in all lowercase
(e.g. Robots.txt or robots.Txt is incorrect)

2. Wildcards are not supported in both the fields. Only *
can be used in the User-agent fields' command syntax because
it is a special character denoting "all". Googlebot is the
only robot that now supports some wildcard file extensions.

Ref: http://www.google.com/webmasters/remove.html

3. The robots.txt file is an exclusion file meant for search
engine robot reference and not obligatory for a website to
function. An empty or absent file simply means that all
robots are welcome to index any part of the website.

4. Only one file can be maintained per domain.

5. Website owners who do not have administrative rights
cannot sometimes make a robots.txt file. In such situations,
the Robots Meta Tag (http://www.redalkemi.com/
search-engine-optimization-seo/meta-tags-article.php) can be
configured to serve the same purpose. Here we must
keep in mind that lately, questions have been raised about
robot behavior regarding the Robot Meta Tag. Some robots
might skip it altogether. Protocol makes it obligatory for
all robots to start with the robots.txt thereby making it
the default starting point for all robots.

6. Separate lines are required for specifying access to
different user agents and Disallow field should not carry
more than one command in a line in the robots.txt file.
There is no limit to the number of lines though i.e. both
the User-agent and Disallow fields can be repeated with
different commands any number of times. Blank lines will
also not work within a single record set of both the
commands.

7. Use lower-case for all robots.txt file content. Please
also note that filenames on Unix systems are case sensitive.
Be careful about case sensitivity when defining directory or
files for Unix hosted domains.

Advantages of the robots.txt file

1. Protocol demands that all search engine robots start with
the robots.txt file. This is the default entry point for
robots if the file is present. Specific instructions can be
placed on this file to help index your site on the web.
Major search engines will never violate the Standard for
Robots Exclusion.

2. The robots.txt file can be used to keep out unwanted
robots like email retrievers, image strippers etc.

3. The robots.txt file can be used to specify the
directories on your server that you don't want robots to
access and/or index e.g. temporary, cgi, and
private/back-end directories.

4. An absent robots.txt file could generate a 404 error and
redirect the robot to your default 404 error page. Here it
was noticed after careful research that sites that do not
have a robots.txt file present and had a customized
404-error page, would serve the same to the robots. The
robot is bound to treat it as the robots.txt file, which can
confuse its indexing.

5. The robots.txt file is used to direct select robots to
relevant pages to be indexed. This especially comes in handy
where the site has multilingual content or where the robot
is searching for only specific content.

6. The need for the robots.txt file was also necessary to stop
robots from deluging servers with rapid-fire requests or
re-indexing the same files repeatedly. If you have duplicate
content on your site for any reason, the same can be
prevented from getting indexed. This will help you avoid
any duplicate content penalties.

Disadvantages of the robots.txt file

Careless handling of directory and filenames can lead
hackers to snoop around your site by studying the robots.txt
file, as you sometimes may also list filenames and
directories that have classified content. This is not a
serious issue as deploying some effective security checks to
the content in question can take care of it. For example, if
you have your traffic log on your site on a URL such as
www.example.com/stats/index.htm which you do not want robots
to index, then you would have to add a command to your
robots.txt file. As an example:

User-agent: *
Disallow: /stats/

However, it is easy for a snooper to guess what you are
trying to hide and simply typing the URL
www.example.com/stats in his browser would enable access to
the same. This calls for one of the following remedies -

1. Change file names:

 * Change the stats filename from index.htm to something
   different, such as stats-new.htm so that your stats URL
   now becomes www.example.com/stats/stats-new.htm

 * Place a simple text file containing the text, "Sorry you
   are not authorized to view this page", and save it as
   index.htm in your /stats/directory.

This way the snooper cannot guess your actual filename and
get to your banned content.

2. Use login passwords:

 * Password-protect the sensitive content listed in your
   robots.txt file.

Optimization of the robots.txt file : -

1. The right commands: Use correct commands. Most common
errors include - putting the command meant for "User-agent"
field in the "Disallow field" and vice-versa.

 * Please  note that there is no "Allow" command in the
   standard robots.txt protocol. Content not blocked in the
   "Disallow" field is considered allowed. Currently, only two
   fields are recognized: "The User-agent field" and the
   "Disallow field". Experts are considering  the addition of
   more robot recognizable commands to make the robots.txt file
   more Webmaster and robot friendly.

 * Please also note that  Google is the only search engine,
   which is experimenting with certain new robots.txt commands.
   There are indications that Google now recognizes the "Allow"
   command. Please refer:
   http://www.google.com/webmasters/remove.html

2. Bad Syntax: Do not put multiple file URLs in one Disallow
line in the robots.txt file. Use a new Disallow line for
every directory that you want to block access to. Incorrect
example :

User-agent: *
Disallow: /concepts/ /links/ /images/

Correct  example:

User-agent: *
Disallow: /concepts/
Disallow: /links/
Disallow: /images/

3. Files and directories: If a specific file has to be
disallowed, end it with the file extension and without a
forward slash at the end. Study the following example :

For file:

User-agent: *
Disallow: /hilltop.html

For Directory:

User-agent: *
Disallow: /concepts/

Remember, if you have to block access to all files in the
directory, you don't have to specify each and every file in
robots.txt . You can simply block the directory as shown
above. Another common error is leaving out the slashes
altogether. This would leave a very different message than
intended.

4. The right location: No robot will access a badly placed
robots.txt file. Make sure that the location is
www.example.com/robots.txt.

5. Capitalization: Never capitalize your syntax commands.
Directory and filenames are case sensitive in Unix
platforms. The only capitals used per standard are:
"User-agent " and "Disallow "

6. Correct Order: If you want to block access to all but
one or more than one robot, then the specific ones should be
mentioned first. Lets study this example:

User-agent: *
Disallow: /

User-agent: MSNBot
Disallow:

In the above case, MSNBot would simply leave the site
without indexing after reading the first command.
Correct syntax is:

User-agent: MSNBot
Disallow:

User-agent: *
Disallow: /

7. Presence: Not having a robots.txt file at all could
generate a 404 error for search engine robots, which could
redirect the robot to the default 404-error page or your
customized 404-error page. If this happens seamlessly, it
is up to the robot to decide if the target file is a
robots.txt file or an html file. Typically it would not cause
many problems but you may not want to risk it. It's always a
better idea to put the standard robots.txt file in the root
directory, than not having it at all.

The standard robots.txt file for allowing all robots to
index all pages is:

User-agent: *
Disallow:

8. Using # carefully in the robots.txt file: Adding comments
after the syntax commands is not a good idea using "#". Some
robots might misinterpret the line although it is acceptable
as per the robots exclusion standard. New lines are always
preferred for comments.


Using the robots.txt file

1. Robots are configured to read text. Too much graphic
content could render your pages invisible to the search
engine. Use the robots.txt file to block irrelevant and
graphic-only content.

2. Indiscriminate access to all files, it is believed, can
dilute relevance to your site content after being indexed by
robots. This could seriously affect your site's ranking with
search engines. Use the robots.txt file to direct robots to
content relevant to your site's theme by blocking the
irrelevant files or directories.

3. The file can be used for multilingual websites to direct
robots to relevant content for relevant topics for different
languages. It ultimately helps the search engines to present
relevant results for specific languages. It also helps the
search engine in its advanced search options where language
is a variable.

4. Some robots could cause severe server loading problems by
rapid firing too many requests at peak hours. This could
affect your business. By excluding some robots that might be
irrelevant to your site, in the robots.txt file, this
problem can be taken care of. It is really not a good idea
to let malevolent robots use up precious bandwidth to
harvest your emails, images etc.

5. Use the robots.txt file to block out folders with
sensitive information, text content, demo areas or content
yet to be approved by your editors before it goes live.

The robots.txt file is an effective tool to address certain
issues regarding website ranking. Used in conjunction with
other SEO strategies, it can significantly enhance a
website's presence on the net.


Related Reading : -

A Standard for Robots Exclusion.
http://www.robotstxt.org/wc/norobots.html

Guide to The Robots Exclusion Protocol
http://www.robotstxt.org/wc/exclusion-admin.html

W3C Recommendations
http://www.w3.org/TR/REC-html40/appendix/notes.html#h-B.4.1.1

Meta Tags Optimization for Search Engines:
http://www.redalkemi.com/search-engine-optimization-seo/meta-tags-article.php
================================================================
RedAlkemi Syndicate. RedAlkemi (http://www.redalkemi.com/) is a
leading Internet Marketing, eCommerce, Graphic Design, Web &
Software Development services company. Experts at Redalkemi have
about 20 years of experience in the field of Graphic Design,
Visual Communication & Web Development. If you have comments;
or would like to have this article republished free on your site,
please contact syndicate@redalkemi.com All due credits must be
carried and text, hyperlinks and headers unaltered.
© Copyright 2005, RedAlkemi.com
================================================================


Copyright © 2006 Jayde Online, Inc.  All Rights Reserved.

SEO-News is a registered service mark of Jayde Online, Inc.