October 26, 2018 Vance Moore

Robots.txt

What is a Robots.txt ?

A small text file included on a website that directs a search engine to include/exclude specific pages from its index. It can be submitted manually to search engines to ensure the latest version is followed regardless of the “crawl cycle.” Robots.txt is a text file webmaster create to instruct web robots (typically search engine robots) how to crawl pages on their website. The robots.txt file is part of the robot’s exclusion protocol (REP), a group of web standards that regulate how robots crawl the web, access and index content, and serve that content up to users. The REP also includes directives like meta robots, as well as page-, subdirectory-, or site-wide instructions for how search engines should treat links (such as “follow” or “nofollow”).

In practice, robots.txt files indicate whether certain user agents (web-crawling software) can or cannot crawl parts of a website. These crawl instructions are specified by “disallowing” or “allowing” the behavior of certain (or all) user agents.

How does robots.txt work?

Search engines have two main jobs:

Crawling the web to discover content;
Indexing that content so that it can be served up to searchers who are looking for information.

To crawl sites, search engines follow links to get from one site to another — ultimately, crawling across many billions of links and websites. This crawling behavior is sometimes known as “spidering.”

After arriving at a website but before spidering it, the search crawler will look for a robots.txt file. If it finds one, the crawler will read that file first before continuing through the page. Because the robots.txt file contains information about how the search engine should crawl, the information found there will instruct further crawler action on this site. If the robots.txt file does not contain any directives that disallow a user-agent’s activity (or if the site doesn’t have a robots.txt file), it will proceed to crawl other information on the site.

Why do you need robots.txt?

Robots.txt files control crawler access to certain areas of your site. While this can be very dangerous if you accidentally disallow Googlebot from crawling your entire site (!!), there are some situations in which a robots.txt file can be very handy.

Some common use cases include:

Preventing duplicate content from appearing in SERPs (note that meta robots is often a better choice for this)
Keeping entire sections of a website private (for instance, your engineering team’s staging site)
Keeping internal search results pages from showing up on a public SERP
Specifying the location of sitemap(s)
Preventing search engines from indexing certain files on your website (images, PDFs, etc.)
Specifying a crawl delay in order to prevent your servers from being overloaded when crawlers load multiple pieces of content at once

If there are no areas on your site to which you want to control user-agent access, you may not need a robots.txt file at all.

« Back to Glossary Index