Robots.txt ~tutorial

Discussion in 'Tutorials' started by lair360, Jan 16, 2009.

  1. lair360

    lair360 New Member

    Messages:
    200
    Likes Received:
    0
    Trophy Points:
    0
    Version: 15.2
    Revision: 65 Build 32

    Robots.txt ~tutorial

    Introduction:
    this tutorial will help you to block a specific robot or multiple robots from indexing your files, folders, documents and other private extension. It will also reduce the risk of private data from being seen or collected by "search engines" .

    "Robots.txt" is a regular text file. It also has special meanings to the majority of "honourable" robots on the web. By defining a few rules in the text file, you can instruct or command robots to stop crawling and indexing certain files, directories within your site, or none at all. For example, you may not want "Google" to crawl the "/images" directory of your site, as it's both meaningless to you and a waste of your site's bandwidth. "Robots.txt" lets you tell Google just that...

    Notes: before you create a regular "text file" called "robots.txt", you must make sure it's named exactly as its written! This file must also be uploaded to the root (accessible) directory of your site, not a subdirectory...

    Example: http://www.mysite.com but NOT http://www.mysite.com/sub_folder/

    Syntax
    ----------------------------------------


    User-agent - the robots and the following rule applies to...
    Disallow - the URL you want to block...

    ----------------------------------------

    1.] To block all robots from looking at everything and crawl your website, you can use this following codes.
    ---Copy Source Code---
    Code:
    User-agent: *
    Disallow: /
    
    ---End Source Code---

    2.] To block a directory and everything in it, you can use this following codes.
    ---Copy Source Code---
    Code:
    User-agent: *
    Disallow: /random-directory-one/
    Disallow: /random-directory-one/random-directory-two/
    
    ---End Source Code---

    3.] To block a page, just list the page that you want to block.
    ---Copy Source Code---
    Code:
    User-agent: *
    Disallow: /private_file.html
    Disallow: /random-directory-one/style.css
    
    ---End Source Code---

    4.] To remove a specific image from a search image engine, add the following codes.
    ---Copy Source Code---
    Code:
    User-agent: Googlebot-Image
    Disallow: /image1.gif
    Disallow: /random-directory-one/image2.png
    
    ---End Source Code---

    5.] To remove all images on your site, just this source code as an example
    ---Copy Source Code---
    Code:
    User-agent: *
    Disallow: /image_folder/
    
    ---End Source Code---

    6.] To block files of a specific extension, just use this example.
    ---Copy Source Code---
    Code:
    User-agent: *
    Disallow: /*.gif$
    Disallow: /*.jpeg$
    Disallow: /image_folder/*.png$
    Disallow:  /image_folder/*.jpeg$
    
    ---End Source Code---

    7.] To prevent pages on your site from being crawled, while still displaying on other search engines, you'll need to use this example...
    ---Copy Source Code---
    Code:
    User-agent: *
    Disallow: /folder1/
    
    User-agent: Google
    Allow: /folder1/
    
    ---End Source Code---

    8.] To match a sequence of characters, use an asterisk
    [*]. For example, to block access to all subdirectories that begin with "file_directories".
    ---Copy Source Code---
    Code:
    User-agent: Googlebot
    Disallow: /file_directories*/
    
    ---End Source Code---

    9.] To specify matching the end of a URL, you'll need to use $ symbols. For instance, to block any URLs that end with .zip...
    ---Copy Source Code---
    Code:
    User-agent: Googlebot 
    Disallow: /*.zip$
    
    ---End Source Code---

    10.] You can conditionally target multiple robots in "robots.txt." For instance, you want to block all search engines and only allow Google to index or crawl your website without looking at "cgi-bin" and "privatedir".
    ---Copy Source Code---
    Code:
    User-agent: *
    Disallow: /
    User-agent: Googlebot
    Disallow: /cgi-bin/
    Disallow: /privatedir/
    
    ---End Source Code---

    11.] To block multiple extention, you can use this example...
    --Copy Source Code---
    Code:
    User-agent: *
    Disallow: /*.xls$
    Disallow: /*.gif$
    Disallow: /*.jpg$
    Disallow: /*.jpeg$
    Disallow: /*.pdf$
    Disallow: /*.rar$
    Disallow: /*.zip$
    
    ---End Source Code---
    Copyright 2008 ~Lair360
     
  2. dbojan

    dbojan New Member

    Messages:
    99
    Likes Received:
    1
    Trophy Points:
    0
    Thank you lair360. This is a great tutorial.
     
  3. lair360

    lair360 New Member

    Messages:
    200
    Likes Received:
    0
    Trophy Points:
    0
    Thank you very much for your generous support!

    Best regards,
    Lair360
     
  4. DarkDragonLord

    DarkDragonLord New Member

    Messages:
    782
    Likes Received:
    0
    Trophy Points:
    0
    Great tutorial!

    +rep [​IMG]
     
    Last edited: Jan 21, 2009
  5. lair360

    lair360 New Member

    Messages:
    200
    Likes Received:
    0
    Trophy Points:
    0
    Thank you very much for your support!
    You have given me a key to break the boundaries between knowledge and support!
     
  6. zer0ne1337

    zer0ne1337 New Member

    Messages:
    84
    Likes Received:
    0
    Trophy Points:
    0
    Thanks lair360, it is a very helpful tutorial! :)
     
  7. RRJJMM

    RRJJMM New Member

    Messages:
    41
    Likes Received:
    0
    Trophy Points:
    0
    Thanks for the information. This is good stuff for us control freaks that like to "pull the shades" every now and then.

    Cheers,
     
  8. lair360

    lair360 New Member

    Messages:
    200
    Likes Received:
    0
    Trophy Points:
    0
    Thank you very much for your feedback! However, the "robots.txt" is very powerful, you'll have to be very careful when you assign something to order the robots...
     

Share This Page