AI Companies Need to Be Regulated: An Open Letter to the U.S. Congress and European Parliament

Federico: Historically, technology has usually advanced in lockstep with opening up new creative opportunities for people. From word processors allowing writers to craft their next novel to digital cameras letting photographers express themselves in new ways or capture more moments, technological progress over the past few decades has sustained creators and, perhaps more importantly, spawned industries that couldn’t exist before.

Technology has enabled millions of people like myself to realize their life’s dreams and make a living out of “creating content” in a digital age.

This is all changing with the advent of Artificial Intelligence products based on large language models. If left unchecked without regulation, we believe the change may be for the worse.

Over the past two years, we’ve witnessed the arrival of AI tools and services that often use human input without consent with the goal of faster and cheaper results. The fascination with maximization of profits above anything else isn’t a surprise in a capitalist industry, but it’s highly concerning nonetheless – especially since, this time around, the majority of these AI tools have been built on a foundation of non-consensual appropriation, also known as – quite simply – digital theft.

As we’ve documented on MacStories and as other (and larger) publications also investigated, it’s become clear that foundation models of different LLMs have been trained on content sourced from the open web without requesting publishers’ permission upfront. These models can then power AI interfaces that can regurgitate similar content or provide answers with hidden citations that seldom prioritize driving traffic to publishers. As far as MacStories is concerned, this is limited to text scraped from our website, but we’re seeing this play out in other industries too, from design assets to photos, music, and more. And top it all off, publishers and creators whose content was appropriated for training or crawled for generative responses (or both) can’t even ask AI companies to be transparent about which parts of their content was used. It’s a black box where original content goes in and derivative slop comes out.

We think this is all wrong.

The practices followed by the majority of AI companies are ethically unfair to publishers and brazenly walk a perilous line of copyright infringement that must be regulated. Most worryingly, if ignored, we fear that these tools may lead to a gradual erosion of the open web as we know it, diminishing individuals’ creativity and consolidating “knowledge” in the hands of a few tech companies that built their AI services on the back of web publishers and creators without their explicit consent.

In other words, we’re concerned that, this time, technology won’t open up new opportunities for creative people on the web. We fear that it’ll destroy them.

We want to do something about this. And we’re starting with an open letter, embedded below, that we’re sending on behalf of MacStories, Inc. to to U.S. Senators who have sponsored AI legislation as well as Italian members of the E.U. Special Committee on Artificial Intelligence in a Digital Age.

In the letter, which we encourage other publishers to copy if they so choose, we outline our stance on AI companies taking advantage of the open web for training purposes, not compensating publishers for the content they appropriated and used, and not being transparent regarding the composition of their models’ data sets. We’re sending this letter in English today, with an Italian translation to follow in the near future.

I know that MacStories is merely a drop in the bucket of the open web. We can’t afford to sue anybody. But I’d rather hold my opinion strongly and defend my intellectual property than sit silently and accept something that I believe is fundamentally unfair for creators and dangerous for the open web. And I’m grateful to have a business partner who shares these ideals and principles with me.

With that being said, here’s a copy of the letter we’re sending to U.S. and E.U. representatives.

Hello,

We are writing to you on behalf of MacStories, Inc. in support of legislation regulating:

the non-consensual training of large language models by artificial intelligence companies using the intellectual property of third parties for commercial gain; and
the generation of AI-based content designed to replace or diminish the source material from which it was created

MacStories is a small U.S. media company that was founded in Italy by Federico Viticci in 2009. Today, MacStories operates MacStories.net and produces several podcasts covering apps, technology, videogames, and the world of media, which draw a worldwide audience centered in the EU and US.

As business owners with a long history of operating on the web, we wanted to share our perspective on Artificial Intelligence (“AI”) large language model (“LLM”) training and some of the products created using them. What’s come into sharp focus for us in the past several weeks is that, as an industry, companies training AI models don’t respect the intellectual property rights of web-based content creators. Moreover, the cavalier attitude of these companies toward decades-old norms on the Internet makes it clear that AI model training and some of the products built with them threaten the very foundations of the web as an outlet for human creativity and communication.

The danger to the Internet as a cultural institution is real and evolving as rapidly as AI technology itself. However, while the threat to the web is new and novel, what these AI companies are doing is not. Quite simply, it’s theft, which is something as old as AI is new. The thieves may be well-funded, and their misdeeds wrapped in a cloak of clever technology, but it’s still theft and must be stopped.

The source of the Internet’s strength is hyperlinks, which create value by connecting people and ideas in a way that is more valuable than the sum of their parts. But as the web grew, discovery became a problem. Google and other companies built search engines that use web crawlers to index the web. Search engines like Google’s are imperfect, but by and large, they offer a fair trade. In exchange for crawling and indexing a publisher’s website, links to that content appear in search results, sending traffic to the publisher. And, if a publisher doesn’t want their site crawled, they can opt out thanks to the Robots Exclusion Protocol by adding a simple robots.txt file to their website. It’s a social contract among the participants of the web that worked for decades before the advent of AI.

However, it turns out that feeding more raw material into an LLM produces models that perform better. As a result, the companies making these models have an insatiable appetite for text, images, and video, which led them straight to the web and strip mining its landscape for fuel to feed their voracious models.

The trouble with the companies developing LLMs is that instead of offering a fair trade to publishers and other creators and respecting their wishes about whether their content is crawled, they just took it, and in some cases, brazenly lied to everyone along the way. The breadth of offenders is staggering. This isn’t just a startup problem. In fact, a wide swath of the tech industry, including behemoths like Apple, Google, Microsoft, and Meta, have joined OpenAI, Anthropic, and Perplexity, to ingest the intellectual property of publishers without their consent, and then used that property to build their own commercial products. None of them paid any regard to the Robots Exclusion Protocol. Instead, some offered a way for publishers to opt out of their crawling activities, but only after they’d already taken the entire corpus of the Internet like a thief offering a shopkeeper a lock after emptying their storefront.

Some companies have gone even further, devising products aimed at replacing the web as we know it by substituting AI-generated web pages for source material, which in many situations amounts to plagiarism. Perplexity Pages, The Browser Company’s Arc Search app, and the incorporation of AI answers in Google Search results are all designed to step between people and the creators of web content. All profess to drive traffic to the source material with obfuscated citations, but as Wired recently reported, which we’ve also seen, the traffic these products drive is negligible.

As technology writers and podcasters, we’ve built careers on our enthusiasm and excitement for new technology and how it can help people. AI is no different. There are roles it can play in fighting disease, climate change, and other challenges, big and small, that are faced by humanity. However, in the race to advance AI and satisfy investors, the tech industry has lost sight of the value of the Internet itself. Left unchecked, this devaluation of Internet culture will undermine the ability of today’s creators to earn a fair wage from their work and prevent the next generation of creators from ever hoping to do the same.

Consequently, on behalf of ourselves and similarly situated Internet publishers and creators, we request that you support legislation regulating the artificial intelligence industry to prevent further damage from being done and to compensate creators whose work has already been misappropriated without their consent. The existing tools for protecting what is published on the web are too limited and imperfect. What’s needed is a comprehensive regulatory regime that puts content creators on an even footing with the companies that want to use what they publish to feed their models. That starts by putting publishers in control of their content, requiring their consent before it can be used to train LLMs, and mandating transparency regarding the source material used to train those models.

Federico Viticci, Editor-in-Chief and Co-Owner of MacStories
John Voorhees, Managing Editor and Co-Owner of MacStories