| For those new to the robots.txt file, it is merely a text file implementing what is known as the Standard for Robot Exclusion. The file is placed in the main directory of a website that advises spiders and other robots which directories or files they should not access. The file is purely advisory - not all spiders bother to read it let alone heed it. However, most, if not all, the spiders sent by the major search engines to index your site will read it and take cognizance of the rules contained within the file. Why is a Robots.txt File Important?1. It Can Avoid Wasting Server ResourcesAt the date of this writing, as far as I know, many of the search engine spiders do not bother to index the scripts on your site (such as your CGI or PHP scripts). However, there are those that do, including one of the major players, Google. For robots or spiders that actually index scripts, they will actually call your scripts just as a browser would, complete with all the special characters. If your site is like mine, where the scripts are solely meant for the use of humans and serve no practical use for a search engine (why should a search engine need to invoke my site-navigation script? - it can just crawl the direct links), you may want to block spiders from the directories that contain your scripts. For example, I block spiders from my CGI-BIN directory. Hopefully, this will reduce the load on the web server that occurs when scripts are executed by removing unnecessary executions. Of course there are the occasional ill-behaved robots that hit your server at high speed. Such spiders can actually bring down your server or at the very least slow it down for the real users who are trying to access it. If you know of any such spiders, you might want to exclude them too. You can do this with a robots.txt file. Unfortunately though, ill-behaved spiders often ignore robots.txt files as well. 2. It Can Save Your BandwidthIf you look at your website's web logs, you will undoubtedly find many requests for the robots.txt file by various search engine spiders. If, like me, you have a customized 404 document (which loads each time a visitor tries to retrieve a page that does not exist on your site), you will find that the robot will wind up requesting for that document instead, if you don't have an existing robots.txt file. My site has a fairly large 404 document, with the result that the spiders wind up loading it repeatedly throughout the day, adding to my already large bandwidth problems. In such a case, having a small robots.txt file may save you some bandwidth (yeah, I know, it's not that much). Some spiders may also request for files which you feel they should not. For example, one search engine requests for graphic files (".gif" files") on my sites. Since I see little reason why I should let it index the graphics on my site, waste my bandwidth, and possibly infringe my copyright, I ban it (and in fact all spiders) from my graphic files directory in my robots.txt file. 3. It Removes Clutter from your Web StatisticsI don't know about you, but one of the things I check from my web statistics is the list of URLs that visitors tried to access, but met with a 404 File Not Found Error. Often this tells me if I made a spelling error in one of the internal links on one of my sites (yes, I know - I should have checked all links in the first place, but mistakes do happen). If you don't have a robots.txt file, you can be sure that /robots.txt is going to feature in your web statistics 404 report, adding clutter and perhaps unnecessarily distracting your attention from the real bad URLs that need your attention. How to Set Up a Robots.txt FileWriting a robots.txt file could not be easier. It's just an ASCII text file that you place at the root of your domain. For example, if your domain is www.yourdomain.com, you will place the file at www.yourdomain.com/robots.txt. The file basically lists the names of spiders on one line, followed by the list of directories or files it is not allowed to access on subsequent lines, with each directory or file on a separate line. It is possible to use the wildcard character "*" instead of naming specific spiders. When you do so, all spiders are assumed to be named. Note that the robots.txt file is a robots exclusion file (with emphasis on the "exclusion") - there is no way to tell spiders to include any file or directory. Take the following robots.txt file for example:
The above two lines, when inserted into a robots.txt file, inform all robots (since the wildcard asterisk "*" character was used) that they are not allowed to access anything in the cgi-bin directory and its descendents. That is, they are not allowed to access cgi-bin/whatever.cgi or even a file or script in a subdirectory of cgi-bin, such as /cgi-bin/anything/whichever.cgi. Where Do You Get the Name of the Robots?If you have a particular spider in mind which you want to block, you have to find out its name. To do this, the best way is to check out the website of the search engine. Respectable engines will usually have a page somewhere that gives you details on how you can prevent their spiders from accessing certain files or directories. Common Mistakes in Robots.txt1. It's Not Guaranteed to WorkAs mentioned earlier, although the robots.txt format is listed in a document called "A Standard for Robots Exclusion", not all spiders and robots actually bother to heed it. Listing something in your robots.txt is no guarantee that it will be excluded. If you really need to protect something, you should use a .htaccess file to password-protect the directory (if you are running your site on an Apache server). 2. Don't List Your Secret DirectoriesAnyone can access your robots file, not just robots. For example, typing http://www.google.com/robots.txt will get you Google's own robots.txt file. I notice that some new webmasters seem to think that they can list their secret directories in their robots.txt file to prevent that directory from being accessed. Far from it. Listing a directory in a robots.txt file often attracts attention to the directory! In fact, some spiders (like certain spammers' email harvesting robots) make it a point to check the robots.txt for excluded directories to spider. 3. Only One Directory/File per Disallow lineDon't try to be smart and put multiple directories on your Disallow line. This will probably not work the way you think, since the Robots Exclusion Standard only provides for one directory per Disallow statement. 4. For Zen Cart User-agent: * And a minimal index.html even if blank If this is a php site and you need to load index.php and not the index.html file add the following to your .htaccess file
|
Powered by WHMCompleteSolution
All Inclusive Web Hosting, Online Marketing, Website Development and Anti-Hacker Hosting from Greenix Hosting