Is ChatGPT Use Of Web Content Fair?

Date:

Large-Language Models (LLMs) such as ChatGPT are trained using a variety of information, which includes web content. The data is the basis for summary of the content , in the form of articles which are published without the benefit of attribution or credit to the authors of the original content for training ChatGPT.

Search engines download web content (called”crawling” or indexing) to answer questions via hyperlinks to the website.

Website owners have the option to choose not to have their website being crawled, indexed and crawled using their Robots Exclusion Protocol, commonly called Robots.txt.

The Robots Exclusions Protocol is not an official Internet standard, but it’s one which legitimate web crawlers follow.

Are web-based publishers permitted to use their website’s Robots.txt protocol to stop large models of language from utilizing their content on websites?

Large Language Models Utilize Web Content that is not attributed

Certain people who work in marketing through search are not happy with the way that data from websites is utilized to train machines without offering anything in return, such as acknowledgment or traffic.

Hans Petter Blindheim ( LinkedIn profile) Director of Curamando. Curamando shared his thoughts with me.

Hans Petter commented:

  • “When the author writes following a lesson learned from an article published on your website, they’ll typically hyperlink to your original article as it is a sign of respect and also as a professional gesture.
  • It’s referred to as an”citation.
  • The extent at the rate at which ChatGPT absorbs content but doesn’t grant any kind of return distinguishes it from Google and users.
  • A website is usually designed with a specific business goal in the mind.
  • Google helps users find information they are looking for, thereby generating traffic which is beneficial to both.
  • However, it’s not as if large model languages asked for permission to make use of your content, they only apply it in a wider context than was anticipated when you published your content.
  • If the AI language models don’t provide value for the money then why should publishers let the models to search and utilize the content?
  • Does the use they make of your content comply with the requirements of fair use?
  • If ChatGPT as well as Google’s models for ML/AI train on your content without your permission and then spins the information it has learned there, and applies it to restricting users from your website – shouldn’t business and lawmakers also try to gain power on the Internet by making them change into the “opt-in” method?”

The concerns Hans Petter expresses are reasonable.

With the speed at which technology is changing, should laws regarding fair use be reviewed and revised?

I inquired John Rizvi, a Registered Patent Attorney ( LinkedIn profile) who is board certified in Intellectual Property Law, I asked John Rizvi, a board certified Intellectual Property Attorney, if Internet law regarding copyright is out of date.

John Answered:

  • “Yes absolutely.
  • One of the major points of dispute in such cases is the fact that law always evolves slower than technology.
  • The 1800s were a time of rapid change, it wasn’t so important because advancements were slow, and therefore legal machinery was made to work to match.
  • Nowadays, however, rogue technological advances have outpaced the capability of the law to keep pace.
  • There are just too many developments and a lot of elements to be able to handle.
  • As it’s currently constructed and administered by those who aren’t experts in the field of technology we’re talking about this article Law enforcement is not designed or crafted to keep up with technology…and it is important to recognize that this isn’t a completely negative aspect.
  • In one sense there is a possibility that Intellectual Property law does need to change if it intends or even hopes for it to keep pace with technological advancements.
  • The main issue is finding an equilibrium between keeping up with the different ways in which various types of technology are used without allowing for excessive overreach or suppression of speech for political gain, all wrapped in good intentions.
  • The law must also be careful not to legislate against the potential uses of technology in a way that would limit any benefit that could be derived from the use of technology.
  • You may easily be in violation under your rights under the First Amendment and any number of settled cases that outline how, when and in what manner intellectual property is utilized and by whom.
  • and trying to imagine every possible use of technology for years or even years before the framework is available in order to make it feasible or even feasible would be extremely dangerous fool’s game.
  • In such situations in such situations, the law can’t assist but react to the manner in which technologies are used…not necessarily the way it was designed to be used.
  • It’s unlikely to be changing anytime soon, unless we reach a huge and unpredicted tech plateau that permits the law to catch up with the current developments.”

It appears that the matter of copyright laws involves numerous factors to be considered in regards to the way AI is taught, and there is no one-size-fits-all solution.

OpenAI as well as Microsoft Sued

A fascinating case recently reported is one where OpenAI and Microsoft made use of free software to design the CoPilot product.

The issue with using open-source code comes from the Creative Commons license requires attribution.

In an article that was published in an academic journal:

  • “Plaintiffs assert they have been able to prove that OpenAI and GitHub created and released Copilot, a commercial application Copilot to generate an generative program using publically accessible software originally distributed under a variety of “open source”-style licensing agreements, most of which contain an obligation to attribute.
  • As GitHub says, ‘…[t]rained on billions of lines of code. GitHub Copilot converts natural language signals into code suggestions in many languages.’
  • The product that was created allegedly did not give any acknowledgement to the creators of the original product.”

The writer of the article who is a lawyer expert in the field of copyrights, said that many people view free-for-all Creative Commons licenses as a “free-for-all.”

There are those who consider the term free-for-all an accurate description of the databases made up of Internet content. The data is harvested and utilized to build AI products such as ChatGPT.

Background information on LLMs and Datasets

Large-scale language models train using multiple sets of content. The data sets can include books, emails as well as government data Wikipedia articles, or even websites that are that are linked to posts on Reddit which have at least three votes.

A lot of the data sets that are related to the content of the Internet originate from the crawl developed by a non-profit group known as the Common Crawl.

Their data, known as the Common Crawl dataset, is available for download for free and use.

This Common Crawl dataset is the basis for a variety of other datasets that were created out of it.

For instance, chatGPT-3 used a filtered version of Common Crawl ( Language Models are Few-Shot Learning pdf).

This is the way chatGPT -3 researchers used the web information contained in the Common Crawl dataset:

  • “Datasets for language models have exploded to include The Common Crawl dataset… with more than one trillion words.
  • This dataset size will be sufficient for us to train our biggest models, without having to update this same pattern twice.
  • We’ve found that lightly or unfiltered version of Common Crawl tend to have lower quality than those with more well-curated datasets.
  • So, we decided to take three steps to improve average quality of our data:
  • (1) We downloaded and filtered a copy of CommonCrawl that was similar to a number of high-quality reference corpora
  • (2) we conducted fuzzy deduplication of the document both within and across datasets, in order to avoid redundancy and ensure quality of the validated held-out set, which is a reliable test of overfitting.
  • (3) We have also added high-quality, well-known reference corpora to our training mix to increase the quality of CommonCrawl and broaden its range.”

The Google’s C4 dataset (Colossal, Cleaned Crawl Corpus) that was used to develop T5, the Text-to-Text Transfer Transformer (T5) is rooted from the Common Crawl dataset, too.

The research document ( Exploring the Limits of Transfer Learning using an Unified Text-to-Text Transformer PDF) describes:

“Before present the findings from our extensive empirical study We review the important background issues that are required to understand our findings, including the Transformer model’s structure and the downstream tasks we focus on.

We also describe our strategy to treat every issue as a text-to-text problem and provide a description of the “Colossal Clean Crawled Corporation” (C4) which is the Common Crawl-based data set that we designed to provide non-labeled data on text.

The name we use refers to our framework and model as the ‘Text-toText Transfer Transformer’ (T5).”

Google posted an update on its AI blog that explains in detail the process by which Common Crawl data (which contains information scraped from the Internet) was used to develop C4.

It was written by them:

“An essential component of transfer-learning is an unlabeled data utilized for pre-training.

To measure accurately the effects of increasing the quantity of pre-training, it is necessary to have an array of data that is not just rich and varied however, it is also large.

Pre-training data sets that exist don’t meet the three criteria such as the text on Wikipedia is excellent however, it is uniform in style and quite tiny for our purposes The Common Crawl web scrapes are massive and extremely diverse, but are surprisingly poor quality.

To meet these needs To meet these requirements, we created to meet these requirements, we created the Colossal Clean Crawled Corpus (C4) which is a clean Version of Common Crawl that is two orders of magnitude greater than Wikipedia.

Our cleaning procedure involved removal of duplicate sentences, removing sentences that were not complete and taking out loud or offensive content.

This filtering helped improve outcomes on downstream tasks, as well as the extra size let the model’s size grow without overfitting in training.”

Google, OpenAI, even Oracle’s Open Data are using Internet content, your content to build datasets which can be used later to develop AI applications such as ChatGPT.

Common Crawl Can Be Blocked

You can block Common Crawl and subsequently opt-out of all datasets that are based upon Common Crawl.

However, if your site is crawled already, then the information on the site is in data sets. The only way is to get rid of your website’s content off this dataset. Common Crawl dataset and any of the other datasets such as C4 or Open Data.

Utilizing the Robots.txt protocol will hinder future crawls using Common Crawl, it won’t hinder users from using the content that is already included in the data.

How to block the Common Crawls from Your Data

Stopping Common Crawl is possible through the use of the Robots.txt protocol, but within the limitations discussed

This is what the Common Crawl bot is called, CCBot.

It is identified by the most current CCBot user-agent strings: CCBot/2.0

Blocking CCBots with Robots.txt is done the same way as the other types of bots.

Here’s the code for blocking CCBot by using Robots.txt.

The User Agent: CCBot Disallow: /

CCBot crawls through Amazon AWS IP addresses.

CCBot also adheres to the Nofollow Robots Meta tag

What If You’re Blocking the Common Crawl?

Content from the Web is available for download without consent This is the way browsers function, they download content.

Google or anyone else doesn’t require permission to download and utilize content that is publicly available.

Website Publishers are Limited in Options

The question of whether it’s ethical to develop AI on content from the web isn’t included in any debate about the ethical implications of the method by which AI technology is created.

It’s assumed that Internet content can be downloaded, compiled and transformed into a service known as ChatGPT.

Does this seem fair? It’s a complicated question. Hope you got an idea on chatGPT

Share post: