There may be pages on your website that you'd rather Google's search robots not be able to access. For example, you may not want Google ranking the page people see after completing your contact form. You may also to prevent Google and Bing from indexing files people need to enter a password to access.
In these situations, you need to find a way to prevent Google and Bing from finding this page. There are two behaviors you'll want to control:
The "robots.txt" file is a plain text file placed in the root directory of your website. It provides directives to robots telling them which directories you would prefer they not crawl. It is important to note that whether a bot listens to these directives is optional. Robots from Google, Bing, and other reputable companies follow these directives even though there is no requirement they must.
The robots.txt file directives provide information about how you want bots to access your entire website. You can allow or prevent access to a particular page or an entire directory of pages. Within the robots.txt file, you can provide directions for specific robots (for example, allow Google to access something but block Bing).
The problem with blocking pages using the robots.txt file is that it only prevents robots, like those from Google and Bing, from crawling the page, not from indexing the page. That is, even if you prevented the robots from crawling the page via the robots.txt file, the page could still be indexed and appear in a search result.
For example, Google and Bing may find a link to this page on your website or elsewhere on the web. They couldn't crawl the page to see the contents of the page, but they would know the page exists and could possibly show the page in Google's index.
Given that the robots.txt file only controls crawling and not indexing, this is not a preferred means of keeping content out of Google's search index. However, if you do want to block Google from crawling (but not indexing) a part of your website, you could add the directory or file you wish to block to your robots.txt file.
If you do wish to use the robots.txt file, here is an example of a robots.txt file. This robots.txt file instructs Google to avoid the "my-content-admin-area" directory.
Another method of preventing Google from accessing a page on your website is the robots meta tag. This is a more preferable method than the robots.txt file as this tag is localized to a particular page of the website and can control two robot behaviors. First, it can either allow or disallow that particular page from being included in Google's index. Second, it can either allow or disallow a robot from crawling links on that particular page.
The <meta> robots tag is located in the head area of any given page, similar to the title tag and the <meta> description tag. In this example, the meta robots tag tells every robot to not index this particular page but to follow links found on this page:
<meta name="robots" content="noindex,follow" />
You could instead prevent a search robot from indexing and crawling the links contained on this page:
<meta name="robots" content="noindex,nofollow" />
Although not as well known, there is also a way of indicating you don’t want a page to be included in Google’s index within the robots.txt file. Within the robots.txt file, you can specify a “Noindex” directive which works similar to the <meta/> noindex. Here is an example of a robots.txt file. This robots.txt file instructs Google to not index the "confirmation-page.html" page.
The other, related, robots control is the rel "nofollow" attribute that can be added to a link. This attribute is specific to a particular link, not the entire page and not the entire website. Adding a rel="nofollow" to an <a> tag (the link tag) instructs Google's robot to not follow (crawl) that particular link and, as well, instructs Google's robots to not factor this link into ranking factors for the page that is being linked to.
In general, this can be helpful to prevent a robot from crawling a specific link on your website and giving any weight to that link. This can be helpful for links that may be added by your visitors, like comments in a blog or forum as those links can sometimes contain spam.
The rel nofollow attribute can be added to any link. For example, this link would not be followed by a robot.
<a href="/link-to-my-page.html" rel="nofollow">My page</a>
The robots.txt file, any page with the robots tag, and any link with a rel "nofollow" attribute are all publicly accessible and all elements can be ignored. If you are trying to limit pages or files that are already publically viewable from being crawled or indexed, because those pages will add no value if indexed, then using the robot controls can make sense. If, however, you are trying to hide a directory or you want to prevent access to a portion of your website, you need to use something other than robots controls (like password protecting a directory).
In Google Search Console, you can check your current robots.txt file to see what, if any, pages are currently listed as pages you do not want Google to access. This is under the Crawl menu, called "robots.txt Tester".
You can enter in URLs and test them to see if Google would be prevented from crawling this page due to the robots.txt file.
Want help improving your website’s technical SEO factors? Contact us today to discuss how we can help review and improve your current technical structure.