Controlling Search Robots

What It Is

There may be pages on your website that you'd rather Google's search robots not be able to access. For example, you may not want Google ranking the page people see after completing your contact form. You may also to prevent Google and Bing from indexing files people need to enter a password to access.

In these situations, you need to find a way to prevent Google and Bing from finding this page. There are actually two behaviors you'll want to control:

  1. Crawling: Do you want the robots accessing (crawling) your website from Google and Bing to see the contents of the page or file?
  2. Indexing: Do you want the robots crawling this page to include this page in the index, allowing this page to appear in the search results?

Methods Of Blocking Robots: robots.txt

The "robots.txt" file is a plain text file placed in the root directory of your website. It provides information to robots telling them which directories they are not allowed to crawl.

The robots.txt file controls access to the entire website, allowing you to prevent access to a particular page or an entire directory. Within the robots.txt file, you can control the behavior of specific robots (for example, allow Google to access something but block Bing).

The problem with blocking pages using the robots.txt file is that it only prevents Google and Bing robots from crawling the page, not from indexing the page. That is, even if you prevented the robots from crawling the page via the robots.txt file, the page could still be indexed and appear in a search result.

For example, Google and Bing may find a link to this page on your website or elsewhere on the web. They couldn't crawl the page to see what the contents were, but they would know the page exists and could possibly show the page in Google's index.

Given that the robots.txt file only controls crawling and not indexing, this is not a preferred means of keeping content out of Google's search index. However, if you do want to block Google from crawling (but not indexing) a part of your website, you could add the directory or file you wish to block to your robots.txt file.

A common problem that can occur when using the robots.txt file is that your JavaScript or CSS files may be blocked. The JavaScript and CSS files control how the page looks and Google does rely on some of these design factors to decide where a page ranks. If Google is unable to crawl the CSS or JavaScript files on your website, this may affect how your page ranks.

If you do wish to use the robots.txt file, here is an example of a robots.txt file. This robots.txt file instructs Google to avoid the "my-content-admin-area" directory.

user-agent: googlebot
Disallow: /my-content-admin-area/

Methods Of Blocking Robots: Meta NoIndex, Nofollow

Another method of preventing Google from accessing a page on your website is the robots meta tag. This is a more preferable method than the robots.txt file as this tag is localized to a particular page of the website and can control two robot behaviors. First, it can either allow or disallow that particular page from being included in Google's index. Second, it can either allow or disallow a robot from crawling links on that particular page.

The <meta> robots tag is located in the head area of any given page, similar to the title tag and the <meta> description tag. In this example, the meta robots tag tells every robot to not index this particular page but to follow links found on this page:

<html>
<head>
...
<meta name="robots" content="noindex,follow" />
</head>
<body>
...
</body>
</html>

You could instead prevent a search robot from indexing and crawling the links contained on this page:

<html>
<head>
...
<meta name="robots" content="noindex,nofollow" />
</head>
<body>
...
</body>
</html>

Methods Of Blocking Robots: Nofollow links

The other, related, robots control is the rel "nofollow" attribute that can be added to a link. This attribute is specific to a particular link, not the entire page and not the entire website. Adding a rel="nofollow" to an <a> tag (the link tag) instructs Google's robot to not follow (crawl) that particular link and, as well, instructs Google's robots to not factor this link into ranking factors for the page that is being linked to.

In general, this can be helpful to prevent a robot from crawling a specific link on your website and giving any weight to that link. This can be helpful for links that may be added by your visitors, like comments in a blog or forum.

The rel nofollow attribute can be added to any link. For example, this link would not be followed by a robot.

<a href="/link-to-my-page.html" rel="nofollow">My page</a>

Import To Remember

The robots.txt file, any page with the robots tag, and any link with a rel "nofollow" attribute are all publicly accessible and all elements can be ignored. If you are trying to limit pages or files that are already publically viewable from being crawled or indexed, because those pages will add no value if indexed, then using the robot controls can make sense. If, however, you are trying to hide a directory or you want to prevent access to a portion of your website, you need to use something other than robots controls (like password protecting a directory).

Testing robots.txt

In Google Search Console, you can check your current robots.txt file to see what, if any, pages are currently listed as pages you do not want Google to access. This is under the Crawl menu, called "robots.txt Tester".

Google Search Console Robots.txt Tester

You can enter in URLs and test them to see if Google would be prevented from crawling this page due to the robots.txt file.

Google Search Console - Test robots.txt file

Resources