News Publishers Block Internet Archive to Curb AI Training on Archived Content

Nearly 245 news organisations spanning nine countries are actively blocking the Internet Archive's automated crawlers, the software bots that capture and preserve web pages for the Wayback Machine. This move, driven by concerns over unauthorised use of archived content by artificial intelligence companies, marks a significant escalation in the ongoing tension between publishers and tech firms.

The Internet Archive, which stores over one trillion web pages dating back to 1996, is one of the world's largest public repositories of digital information. Its collection includes past articles from major outlets such as CNN, The New York Times, The Guardian, and USA Today. These archives serve as primary sources for historians, journalists, and researchers, and are used to track changes made to articles after publication.

Why Publishers Are Blocking Access

Publishers argue that AI companies are exploiting the Archive's content to train large language models (LLMs) without fair compensation or permission. According to an analysis by Originality AI, more than 20 major news organisations already block ia_archiverbot, the Archive's primary crawler. However, at least one of the Archive's four crawling bots is blocked by 241 global news sites, with a significant portion owned by USA Today Co, the largest newspaper publisher in the United States. This has effectively removed hundreds of local publications from historical records.

Archival news content offers AI companies vast quantities of high-quality, structured, and dated text and images, making it ideal for training models to mimic human writing. The Archive's URL and API interfaces further simplify access, allowing AI firms to extract data with ease. Much of this data has already been incorporated into key AI-training datasets, raising the stakes for publishers already embroiled in lawsuits against companies like OpenAI and Perplexity for alleged copyright violations.

“The issue is that Times content on the Internet Archive is being used by AI companies in violation of copyright law to directly compete with us,” said Graham James, a spokesperson for The New York Times, as cited by The Next Web. “The Times invests an enormous amount of resources in producing original journalism, and that work should not be used without our permission.”

Other organisations, such as The Guardian, have adopted a more conservative approach, limiting rather than completely blocking the Archive's access. This reflects a broader struggle to balance preservation with intellectual property rights, a challenge that resonates across Europe as well. For instance, the EU's pushback against housing deregulation highlights similar tensions between public access and private interests.

The Internet Archive's Position

Mark Graham, director of the Wayback Machine, maintains that the Archive is merely “collateral damage” in a conflict driven by AI companies. He argues that the real culprits are those accessing past content through the Archive's interfaces. In response, the Archive has taken steps to limit large downloads and automated extraction for certain materials.

Graham emphasises the Archive's role in preservation: without it, articles can be edited without authorisation or accountability, whether to change quotes, correct mistakes, or redirect claims. The Wayback Machine currently tracks such changes, providing a vital check on digital integrity. This function is particularly relevant in Europe, where debates over data sovereignty and digital rights are intensifying, as seen in the Europol hackathon identifying deported Ukrainian children.

Some news organisations are now seeking compromises with the Internet Archive, exploring workarounds that limit access rather than impose hard blocks. These negotiations underscore the complexity of the issue: how to protect publishers' investments in original journalism while preserving the public's access to historical records. As AI continues to reshape the media landscape, the outcome of these disputes will have lasting implications for both the news industry and the broader digital ecosystem.

News Publishers Block Internet Archive to Curb AI Training on Archived Content

Why Publishers Are Blocking Access

The Internet Archive's Position

More from this story

Croatia launches Europe's first commercial robotaxi service in Zagreb

French Families Sue TikTok for Algorithmic 'Abuse of Weakness'

Apple Sues OpenAI Over Alleged Theft of Trade Secrets by Former Employees

Spain and Morocco agree urgent return of latest Ceuta migrants