Developing AI models ethically: Ensuring copyright compliance and factual validation

By Victor Botev
Published 2 years ago

When constructing large language models (LLMs), developers require immense amounts of training data, often measured in hundreds of terabytes or even petabytes. The challenge lies in obtaining this data without violating copyright laws or using inaccurate information and avoiding potential lawsuits.

Some AI developers have been discovered collecting pirated ebooks, proprietary code, or personal data from online sources without consent. This stems from a competitive push to develop the largest possible models, increasing the likelihood of using copyrighted training data, causing environmental damage, and producing inaccurate results. A more effective approach would be to develop smart language models (SLMs) with a horizontal knowledge base, using ethically-sourced training data and fine-tuning to address specific business challenges.

Avoiding Copyright Infringement and Illegal Datasets

To ensure AI models comply with future regulations, developers must verify the sources of all training data. This is often easier for large corporations like Amazon or Microsoft, who possess vast amounts of user data. Start-ups, however, face difficulties in gathering comparable data whilst avoiding copyrighted material.

Firstly, developers should obtain the necessary permissions or licences to access and use selected datasets, and establish rules governing data collection and storage. Secondly, consider using smaller datasets to train models or fine-tune existing open-source alternatives. This simplifies data collection and verification whilst offering the opportunity to enhance reliability in specific domains or industries.

Synthetic training data offers another alternative, as it can achieve higher accuracy levels and circumvent copyright issues entirely.

Addressing Specific Problems

Developers should identify the specific problem they aim to solve, such as locating relevant papers within vast scientific research repositories, and train their model on a focused, labelled dataset from authoritative sources, like publicly-available academic research.

Ensuring Accuracy and Preventing Misinformation

By aggregating a high-quality, thoroughly-appraised set of training data, developers can ensure their models provide accurate, informed responses and reduce the spread of misinformation. Factual validation should be a key aspect of model architecture.

Developers should aim to construct models with more selective unsupervised learning, enhanced attention spans, and improved focus, using internal mechanisms to filter data before incorporating it into the training process.

Pursuing a Smarter Approach to Language Models

Current LLMs built by large corporations consume vast amounts of electricity and resources. For instance, training BigScience’s model, BLOOM, cost $7 million worth of grants on one of the world’s biggest supercomputers: Jean Zay in Paris. This is not only environmentally damaging but also highly inefficient. By refining the training process and focusing on specific use-cases, we can develop sustainable, futureproof models.

In specialized, technical domains, quality is significantly more important than quantity. Developers could also consider using a swarm of smart language model agents to tackle multiple facets of a business problem, rather than relying on a single LLM.

We must focus on 'data-centric AI' if we want to push the industry beyond where it is today. This means engineering the data required to construct specific AI models, improving data quality and labeling to match cutting-edge algorithms.

To create AI models that comply with copyright laws, developers should prioritize quality over quantity. They must research sources, understand data requirements for specific use-cases, and implement factual validation mechanisms to ensure accuracy. By working together, we can develop smarter language models, not merely larger ones.

Image Credit: Wayne Williams

Victor Botev is CTO and Co-Founder of Iris.ai.

No Comments

Comments are closed.

Developing AI models ethically: Ensuring copyright compliance and factual validation

Avoiding Copyright Infringement and Illegal Datasets

Addressing Specific Problems

Ensuring Accuracy and Preventing Misinformation

Pursuing a Smarter Approach to Language Models

Recent Headlines

Forget about Fake Cell Towers and Spying Threats: Android 16 Introduces New Security Features

Google Expands its AI Overviews to YouTube App, Starting with U.S. Premium

Apple’s CarPlay Ultra Comes to a Halt as Industry Giants Start Changing Their Minds

OpenAI & Microsoft Partnership Is On Shaky Ground as Altman Admits ‘Points of Tension’

Apple’s Liquid Glass Control Center Gets a Much-Needed Fix in iOS 26 Beta 2

Talking to Google Just Got Real: Real-Time Voice Conversations Launched with AI Mode

16 Billion Passwords Exposed: Major Leak Hits Apple, Facebook and Google Users

Most Commented Stories

Betanews Is Growing Alongside You

16 Billion Passwords Exposed: Major Leak Hits Apple, Facebook and Google Users

Will Windows 10 stop working? See if your PC will survive the switch to Windows 11

Apple’s Liquid Glass Control Center Gets a Much-Needed Fix in iOS 26 Beta 2

Apple’s CarPlay Ultra Comes to a Halt as Industry Giants Start Changing Their Minds

Microsoft is making huge changes to Windows 10 and 11, cutting out nagging to use Edge... for some

Fences 6.0 is the essential desktop upgrade for Windows 10 and 11 users -- get it today!

Chaos RAT malware strikes Linux and Windows as hackers exploit its flaws