The AI Scraping Fight That Could Change the Future of the Web -- WSJ

Dow Jones
2025/07/09

By Isabella Simonetti and Robert McMillan

Publishers are stepping up efforts to protect their websites from tech companies that hoover up content for new AI tools.

The media companies have sued, forged licensing deals to be compensated for the use of their material, or both. Many asked nicely for artificial-intelligence bots to stop scraping. Now, they are working to block crawlers from their sites altogether.

"You want humans reading your site, not bots, particularly bots that aren't returning any value to you," said Nicholas Thompson, the chief executive of the Atlantic.

Scraping is nearly as old as the web itself. But the web has changed significantly since the 1990s, when Google was a scrappy startup. Back then, there were benefits to letting Google crawl freely: sites that were scraped would pop up in search results, driving traffic and ad revenue.

A new crop of AI-fueled chatbots, from ChatGPT to Google's Gemini, now deliver succinct answers using troves of data taken from the open web, eliminating the need for many users to visit websites at all. Search traffic has dropped precipitously for many publishers, who are bracing for further hits after Google began rolling out AI Mode, which responds to user queries with far fewer links than a traditional search.

Scraping activity has jumped 18% in the past year, according to Cloudflare, an internet services company.

The outcome of the copyright fights and technical efforts to curb free scraping could have a seismic impact on the future of the media industry -- and the internet at large. Publishers are essentially trying to fence off swaths of the web while AI companies argue that the material they are scraping is fair game.

The Atlantic has a licensing deal with OpenAI.

It plans to turn off the data spigot for many other AI companies with the help of Cloudflare, which said earlier this month it introduced a new feature that would act as a toll booth for AI scrapers. Customers can decide whether they want AI crawlers to access their content and how the material can be used.

"People who create intellectual property need to be protected or no one will make intellectual property anymore," said Neil Vogel, the CEO of Dotdash Meredith, whose brands include People and Southern Living.

The media company has a content licensing deal with OpenAI and is working with Cloudflare to choke off what Vogel called "bad actors" who don't want to compensate publishers.

It isn't clear yet how well Cloudflare's efforts will work to curb scraping. Some other companies, including Fastly and DataDome, also try to help publishers manage unwanted bots. Technology companies have few incentives to work with intermediaries, but publishers say they are keen to at least try to tamp down the use of their work.

Until recently, USA Today owner Gannett tried to prevent bots, mainly by relying on Robots.txt, a file based on a decades-old protocol that tells crawlers whether they can scrape or not. Renn Turiano, Gannett's chief consumer and product officer, likened the effort to "putting up a 'Do Not Trespass' sign."

AI companies ignored those types of signs and added bots that override Robots.txt instructions, according to data from TollBit, which works with publishers including Time and the Associated Press to monitor and monetize scraping activity.

Reddit sued AI startup Anthropic last month, claiming that it was scraping the online-discussion site without permission -- and had hit the site more than 100,000 times even after Anthropic said it would stop. Anthropic said it disagrees with Reddit's claims and will defend itself vigorously in court.

The do-it-yourself tech repair site iFixit said it blocked Anthropic's scraper after it hit the company's servers one million times in a 24-hour period last year. "You're not only taking our content without paying, you're tying up our... resources. Not cool," iFixit CEO Kyle Wiens wrote in an X message.

Wikimedia, the publisher of Wikipedia, said earlier this year that it is planning to change its site access policies "to help us identify who is reusing content at scale." The company said scrapers are overloading its infrastructure.

Some worry that academic research, security scans and other types of benign web crawling will get elbowed out of websites as barriers are built around more sites.

"The web is being partitioned to the highest bidder. That's really bad for market concentration and openness," said Shayne Longpre, who leads the Data Provenance Initiative, an organization that audits the data used by AI systems.

Legal battles between publishers and tech companies are winding their way through courts. The New York Times, which has a licensing agreement with Amazon.com, has an ongoing suit against Microsoft and OpenAI. Meanwhile, The Wall Street Journal's parent company, News Corp, has a content deal with OpenAI, and two of News Corp's subsidiaries have sued Perplexity.

Meta Platforms and Anthropic won partial victories in June in two separate cases. The judge in the Anthropic suit said pulling copyrighted material to train AI models is fair use in certain scenarios.

For the Internet Archive, a site that both archives the internet and is scraped by others, the uncertainty over what actions are fair has become paralyzing.

Brewster Kahle, the website's founder and digital librarian, said lawsuits and unclear lines around scraping could set back artificial-intelligence companies in the U.S. "This is not a way to run a major industry," he said.

Write to Isabella Simonetti at isabella.simonetti@wsj.com and Robert McMillan at robert.mcmillan@wsj.com

 

(END) Dow Jones Newswires

July 09, 2025 09:00 ET (13:00 GMT)

Copyright (c) 2025 Dow Jones & Company, Inc.

应版权方要求,你需要登录查看该内容

免责声明:投资有风险,本文并非投资建议,以上内容不应被视为任何金融产品的购买或出售要约、建议或邀请,作者或其他用户的任何相关讨论、评论或帖子也不应被视为此类内容。本文仅供一般参考,不考虑您的个人投资目标、财务状况或需求。TTM对信息的准确性和完整性不承担任何责任或保证,投资者应自行研究并在投资前寻求专业建议。

热议股票

  1. 1
     
     
     
     
  2. 2
     
     
     
     
  3. 3
     
     
     
     
  4. 4
     
     
     
     
  5. 5
     
     
     
     
  6. 6
     
     
     
     
  7. 7
     
     
     
     
  8. 8
     
     
     
     
  9. 9
     
     
     
     
  10. 10