How to Block ChatGPT from Crawling and Using Your Content

How to Block ChatGPT from Crawling and Using Your Content

OpenAI, the organization behind ChatGPT, has recently published details concerning their web crawler, GPTBot.  They did so quietly on the platform documentation site, with no formal announcement concerning it.

According to the documentation, the bot “may potentially be used to improve future models,” meaning that a future version of ChatGPT or DALL-E may use newly collected data to train the AI models.  The current ChatGPT iteration, GTP-4, is using training data that was collected up to September 2021, so its data is a few years out of date.

Why Block Content?

There are many web crawlers out there already, such as Google’s and Bing’s, which index the content of a site so it can be returned in user searches on their respective platforms.  In turn, those search results link back to the indexed content directly (hence the whole concept of SEO).  AI tools like ChatGPT provide their own generated content, which is based on content the AI model has consumed and processed.  This means that the resulting content may heavily resemble your copyrighted material, containing no direct attribution or linking back to the original source.  Even if it does provide sources, it may just make them up.

One scenario I can envision is having a number of AEM Guides published on your site containing technical information regarding your brand’s products.  The site is successfully engaging past customers with the technical information they are seeking, as well as converting a good percentage of customers to new products they end up purchasing. GPTBot comes through and crawls your site, consumes your brand’s content, and then a future iteration of ChatGPT provides users with your technical information directly, with no attribution back to your documentation.  Additionally, the AI generated content may be incorrect, leading to not only a lack of engagement with your customer base, but a mistrust of your products among past customers.

This scenario is playing out right now as the FTC has already launched an investigation into ChatGPT for providing false information.

Blocking Content Strategies

Thankfully, there are a few different ways to block GPTBot from crawling your site (or at least, that is what the documentation currently says).

The bot will automatically “remove sources that require paywall access, are known to gather personally identifiable information (PII), or have text that violates our polices.”  That is nice to know, but a little difficult to fully trust because the bot can really do whatever it wants.  A better solution would be to explicitly tell GPTBot to skip your content, and this can be easily done via your robots.txt file, and knowing the user-agent GPTBot will be using.

Conveniently, that user-agent is “GPTBot.”  The full user-agent string is:

Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)

To block your site as a whole, simply update your robots.txt file to include the following:

User-agent: GPTBot
Disallow: /

If, by chance, you wanted to allow some content, but block other content, you could do something like this:

User-agent: GPTBot
Allow: /my-allowed-dir/
Disallow: /my-blocked-dir/

Finally, if you wanted to be extremely aggressive, you could block the IP Address range that GPTBot will be using, however, IP addresses can be changed and reassigned, so it might not be a perfect solution.

In conclusion, if you are looking to ensure that your content remains solely on your site, then blocking OpenAI’s GPTBot from crawling your site is a must.  Here is to hoping that the bot follows the robots.txt directives.

Reach out, and say "nevermore" to bad agency experiences.