A robots.txt file is a text file at a website's root URL that instructs search engine crawlers which URLs they are allowed or disallowed to crawl. It controls crawler access at the site level but does not control indexing — pages blocked by robots.txt can still appear in search results if linked from other sites.
Quick Answer
A robots.txt file is a text file at a website's root URL that instructs search engine crawlers which URLs they are allowed or disallowed to crawl. It controls crawler access at the site level but does not control indexing — pages blocked by robots.txt can still appear in search results if linked from other sites.
How robots.txt Works
A robots.txt file is placed at the root of a domain (e.g., yourdomain.com/robots.txt) and uses the Robots Exclusion Standard to tell crawlers which parts of the site they are and aren\'t allowed to crawl. Google\'s Googlebot, Bing\'s Bingbot, and other major crawlers check robots.txt before crawling any URL from a domain.
Why robots.txt Matters for B2B Marketing
The basic syntax uses `User-agent:` (specifying which crawler the rule applies to, `*` for all), `Disallow:` (paths to block), and `Allow:` (paths to permit, used to override a broader Disallow). You can also reference your sitemap URL in robots.txt, which is a common best practice.
robots.txt: Best Practices & Strategic Application
Robots.txt is widely misunderstood in a critical way: it controls crawling, not indexing. Blocking a URL in robots.txt prevents Google from crawling it, but if other sites link to that URL, Google can still index it (without seeing its content — just knowing the URL exists). To prevent indexing, use a noindex meta tag on the page itself. To prevent both crawling and indexing, you need a noindex tag (but the page must be crawlable to receive it).
Agency Perspective: robots.txt in Practice
The most catastrophic robots.txt error is accidentally blocking the entire site — typically by adding `Disallow: /` without realizing it applies to all user agents. This mistake has caused major sites to disappear from Google within days. Other common errors include: blocking CSS and JavaScript files needed for rendering (which prevents Google from seeing your content as users do), blocking image directories, and forgetting to update robots.txt after a site migration that changes URL structures.
Frequently Asked Questions: robots.txt
A robots.txt file is a text file at a website's root URL that instructs search engine crawlers which URLs they are allowed or disallowed to crawl. It controls crawler access at the site level but does not control indexing — pages blocked by robots.txt can still appear in search results if linked from other sites.
No — use noindex meta tags for that purpose. Blocking pages via robots.txt prevents Google from reading the noindex tag, which can lead to confusing situations where pages are blocked from crawling but still indexed. More importantly, blocking low-quality pages via robots.txt doesn't prevent them from being indexed if they have external links. The correct approach is: noindex for pages you want de-indexed, robots.txt for resources you want to save crawl budget by not processing (images, internal search results, admin areas).
Google Search Console includes a robots.txt Tester in the Legacy Tools section, which lets you test any URL against your live robots.txt to see whether Googlebot would be allowed to crawl it. You can also manually inspect the file at yourdomain.com/robots.txt and trace any Disallow patterns against your URL structure. After making changes, resubmit your sitemap and monitor Search Console's Coverage report for unexpected blocked pages.
MV3 Marketing helps B2B companies apply these strategies to drive measurable pipeline growth. Our team executes technical seo audit for technology, SaaS, and professional services companies.
ID used to identify users for 24 hours after last activity
24 hours
_gat
Used to monitor number of Google Analytics server requests when using Google Tag Manager
1 minute
_gac_
Contains information related to marketing campaigns of the user. These are shared with Google AdWords / Google Ads when the Google Ads and Google Analytics accounts are linked together.
90 days
__utma
ID used to identify users and sessions
2 years after last activity
__utmt
Used to monitor number of Google Analytics server requests
10 minutes
__utmb
Used to distinguish new sessions and visits. This cookie is set when the GA.js javascript library is loaded and there is no existing __utmb cookie. The cookie is updated every time data is sent to the Google Analytics server.
30 minutes after last activity
__utmc
Used only with old Urchin versions of Google Analytics and not with GA.js. Was used to distinguish between new sessions and visits at the end of a session.
End of session (browser)
__utmz
Contains information about the traffic source or campaign that directed user to the website. The cookie is set when the GA.js javascript is loaded and updated when data is sent to the Google Anaytics server
6 months after last activity
__utmv
Contains custom information set by the web developer via the _setCustomVar method in Google Analytics. This cookie is updated every time new data is sent to the Google Analytics server.
2 years after last activity
__utmx
Used to determine whether a user is included in an A / B or Multivariate test.
18 months
_ga
ID used to identify users
2 years
_gali
Used by Google Analytics to determine which links on a page are being clicked