Without fanfare or an official announcement, ChatGPT maker OpenAI launched a new website crawling bot this week for scanning website content to train its large language models (LLMs). But after news of the bot broke, a revolt ensued, as website owners and creators quickly traded tips on how to block GPTBot from scraping their sites’ data.
When OpenAI added the GPTBot support page, it also introduced a way to block the service from scraping your website. A small modification to a website’s robots.txt file would stop the content from being shared with OpenAI. However, due to how extensively the web is scraped otherwise, it’s unclear if simply blocking GPTBot will completely stop content from being included in LLM training data.
“We periodically collect public data from the internet which may be used to improve the capabilities, accuracy, and safety of future models,” an OpenAI spokesperson said in an email. “On our website, we provide instructions on how to disallow our collection bot from accessing a site. Web pages are filtered to remove sources that have paywalls, are known to gather personally identifiable information (PII), or have text that violates our policies.”
Websites raise their defenses
Web outlets like The Verge have already added the robots.txt flag to stop the OpenAI model from grabbing content to add to its LLMs. Casey Newton has asked readers of his substack newsletter, Platformer, if he should stop OpenAI from collecting his content. Neil Clarke, editor of sci-fi magazine Clarkesworld, announced on X (formerly known as Twitter) that it would block GPTBot.
Shortly after GPTBot’s launch became public, OpenAI announced a $395,000 grant and partnership with New York University’s Arthur L. Carter Journalism Institute. Led by former Reuters editor-in-chief Stephen Adler, NYU’s Ethics and Journalism Initiative aims to aid students in developing responsible ways to leverage AI in the news business.
“We are excited about the potential of the new Ethics and Journalism Initiative and very pleased to support its goal of addressing a broad array of challenges journalists face when striving to practice their profession ethically and responsibly, especially those related to the implementation of AI,” said Tom Rubin, OpenAI’s chief of intellectual property and content, in a release on Tuesday.
Rubin did not mention public web scraping—nor the controversy surrounding it—in the release.
What’s ‘known’ can’t really be forgotten
While a little more control over who gets to use the content on the open net is handy, it’s still unclear how effective simply blocking the GPTBot would be in stopping LLMs from gobbling up content that isn’t locked behind a paywall. LLMs and other generative AI platforms have already used massive collections of public data to train the datasets they currently deploy.
Google’s Colossal Clean Crawled Corpus (C4) data set and nonprofit Common Crawl are well-known collections of training data. If your data or content was captured in those scraping efforts, experts say it’s likely a permanent part of the training information used to enable OpenAI’s ChatGPT, Google’s Bard or Meta’s LLaMA platforms. Services like CommonCrawl do allow for similar robots.txt blocks, but website owners would have needed to implement those changes before any data was collected.
VentureBeat was no exception, with its information found in the C4 training data and available through the Common Crawl datasets as well.
Questions of web scraping fairness remain before courts
Last year, the U.S. Ninth Circuit of Appeals reasserted the notion that web scraping publicly accessible data is a legal activity that did not contravene the Computer Fraud and Abuse Act (CFAA).
Despite this, data scraping practices in the name of training AI have come under attack this past year on several fronts. In July, OpenAI was hit with two lawsuits. One, filed in federal court in San Francisco, alleges that OpenAI unlawfully copied book text by not getting consent from copyright holders or offering them credit and compensation. The other claims ChatGPT and DALL-E collect people’s personal data from across the internet in violation of privacy laws.
Further lawsuits have been filed by Sarah Silverman and novelists Christopher Golden and Richard Kadrey alleging that the companies trained their LLMs on the authors’ published works without consent. X and Reddit have also made news around data scraping, and both sought to protect their respective datasets by limiting access to them. In an effort to curb the effects of AI data scraping, X temporarily prevented individuals who were not logged in from viewing tweets on the social media platform and also set rate limits for how many tweets can be viewed. Reddit waged a PR campaign against its moderators and third-party app developers who got caught in the crossfire when it started to charge higher prices for API access in a bid to fend off web scraping of its content.