Expert opinion

Legal Precedent Challenging AI Training Data Practices: Expert Analysis

A recent U.S. court ruling in the case of Thomson Reuters v. Ross Intelligence is a significant moment for generative AI development. It could change how companies gather and use data for training their models. This is the first major case in the U.S. involving AI and copyright, and its effects go beyond just the companies involved.

Particularly significant is that the judge rejected the fair use argument, which AI companies often cite in similar disputes. In the judge's opinion, Ross was creating a direct competitor to Westlaw rather than transforming the content for research or educational purposes. If courts start following this case as a precedent, similar lawsuits will likely be filed against almost all companies involved in model training.

We talked about this with Volodymyr Getmanskyi, Head of Artificial Intelligence Office at ELEKS, to learn more about key complications around data governance and AI model training.

How will the ruling impact the way companies collect data for AI training?

Firstly, this is the problem around the question of how to classify data, especially in Web 4.0, starting from the question of how your browser caches data and whether you can reuse it further.

Data science or AI professionals are typically aware of such issues (public sources or copyright metadata in datasets) and always check it before usage. However, there can be some controversial cases, like when OpenAI's CTO, Mira Murati, couldn’t say what data was used to train Sora in a WSJ interview.

How can training data origins be tracked?

There are some thoughts around it that look quite innovative, like using blockchain blocks to track the distribution or some kind of steganography to hide copyright information in data. However, the main question regarding the verification of trained models remains open, especially in cases of distillation or transfer learning; how to inspect the parameters and forward propagation path to determine the presence of specific samples is still unclear.

How effective are synthetic data generation methods?

There are many cases where synthetic datasets help a lot, but there is also another side question–in case some module/algorithm/model knows how to generate data, knows all dependencies and differences inside data, why not use it as the primary model w/o generation and additional training or architecture search?

From a technical viewpoint, can companies create data filtering systems that exclude any copyrighted content?

The data or samples should be labelled first and then go through the easiest filter based on IP labels. Without that, the only option is to find each sample somewhere on the internet, check for the primary source (origin), and verify the copyright. And it looks too complicated and time-consuming.

Want to ensure your AI training practices are future-proof?

Book a consultation!

Generative AI software development

Unlock generative AI’s potential to streamline processes, optimise expenses, and elevate customer interactions.

View service

Data science

Deep-dive into your data and boost business performance by understanding what your users really want.

View expertise

Skip the section

FAQs

What is model training in AI?

Model training helps an AI system learn from data to make predictions or decisions. It’s like teaching a student: you provide examples (training data), and the model learns to identify patterns and relationships in that data.

Where to get trained AI models?

How to train an AI model?

Talk to experts

Skip the section

Name
This field is for validation purposes and should be left unchanged.

Full name*
We need your name to know how to address you

Email*
We need your email to respond to your request

Phone number*
We need your phone number to reach you with response to your request

Country*
We need your country of business to know from what office to contact you

Company*
We need your company name to know your background and how we can use our experience to help you

Message*

Attach file
Accepted file types: jpg, gif, png, pdf, doc, docx, xls, xlsx, ppt, pptx, Max. file size: 10 MB.

Add an attachment

(jpg, gif, png, pdf, doc, docx, xls, xlsx, ppt, pptx, PNG)

- I want to receive news and updates once in a while

We will add your info to our CRM for contacting you regarding your request. For more info please consult our privacy policy

What our customers say

The breadth of knowledge and understanding that ELEKS has within its walls allows us to leverage that expertise to make superior deliverables for our customers. When you work with ELEKS, you are working with the top 1% of the aptitude and engineering excellence of the whole country.

Sam Fleming

President, Fleming-AOD

Right from the start, we really liked ELEKS’ commitment and engagement. They came to us with their best people to try to understand our context, our business idea, and developed the first prototype with us. They were very professional and very customer oriented. I think, without ELEKS it probably would not have been possible to have such a successful product in such a short period of time.

Caroline Aumeran

Head of Product Development, appygas

ELEKS has been involved in the development of a number of our consumer-facing websites and mobile applications that allow our customers to easily track their shipments, get the information they need as well as stay in touch with us. We’ve appreciated the level of ELEKS’ expertise, responsiveness and attention to details.

Samer Awajan

CTO, Aramex