Robots.txt is a text file that tells bots which page on your website to crawl. It also tells bots which page not to crawl.
Basically crawling is the process where the bots go around numerous websites, finding new and updated information to report back to Search Engines. The bots will find what to crawl using links.
Creating a Robots.txt file is extremely easy. It’s just a text file(ASCII) which you place at the root of your domain. For example, if your domain is www.harshithdas.com, place the file at www.harshithdas.com/robots.txt. If you use Windows, you already have an ASCII text editor on your system, called Notepad.
You don’t need to have any technical knowledge to leverage the power of Robots.txt. If you can find the source code for your website, you can easily use this.
What does the robots.txt file contain?
This file can list the entire names of spiders/bots on one line, a list of directories/files that you don’t want spiders to access on the next line, with each directory/file on the following separate lines.
1. If you want all the spiders/bots to be named just type asterisk symbol * besides User-agent. By doing this all spiders are assumed to be named. The directories/files that you don’t want spiders to access can be written beside disallow. Check this following Robots.txt file for example:
User-agent: * Disallow: /cgi-bin/
The above two lines, inform all robots (because of asterisk symbol *) that they are not allowed to access anything in the “cgi-bin” directory and its descendants. That is, they are not allowed to access cgi-bin/whatever.cgi or even a file or script in a subdirectory of cgi-bin.
2. If you have a specific spider/bot in mind not to crawl your entire website such as the Baidu spider, you may include lines like the below example:
User-agent: Baiduspider Disallow: /
This means that the search bot, “Baiduspider”, should not try to access any file in the root directory “/” and all its sub-directories. This effectively means that it is banned from getting any file from your entire website.
3. You can have multiple Disallow lines for each user agent (ie, for each spider). Here is an example of a longer robots.txt file:
User-agent: * Disallow: /images/ Disallow: /cgi-bin/ User-agent: Baiduspider Disallow: /
The first block of text disallows all spiders from the images directory and the cgi-bin directory. The second block of code disallows the Baidu spider from every directory.
4. It is possible to exclude a spider from indexing a particular file. For example, if you don’t want Baidu search robot to index a particular picture, say, “mybike.jpg” you can add the following:
User-agent: Baiduspider Disallow: /images/mybike.jpg
5. If you want to unblock the URL path of a subdirectory, within a blocked parent directory, then you can add an extra line like below example:
User-agent: * Disallow: /wp-admin/ Allow: /wp-admin/admin-ajax.php
Here all the bots are taken into account as the asterisk * symbol is used beside User-agent. All these bots cannot access files in the “wp-admin” directory except “wp-admin/admin-ajax.php”.
Once you know what kind of syntax from above examples is perfect for your website, save the file & must be named as robots.txt (Remember to use notepad(ASCII) to save the file) You must place the file in the highest-level directory of your site or the root of your domain!
It’s time to test your newly created robots.txt text file. So go to a robots.txt tester to test the file. Now after testing & correcting the file from warnings & errors, you need to submit the updated robots.txt file and ask Google to more quickly crawl and index a new robots.txt file for your site.
To submit the text file you will find the submit button in the bottom right corner of the robots.txt editor page. On clicking submit button opens up a dialog box like below.
- Download your updated robots.txt code from the robots.txt tester page by clicking Download option.
- Upload your updated robots.txt file to website domain’s root as a text file named robots.txt & make sure it’s the same by viewing the uploaded version.
- Click Submit to notify Google that your robots.txt file has been updated and request Google to crawl it.
It doesn’t take a lot of effort to set up your robots.txt file & takes less than a minute. It’s mostly a one-time setup, and you can make little changes as needed.
All the best folks 🙂
25/07/2017 – Harshith Das