A recent U.S. court ruling in the case of Thomson Reuters v. Ross Intelligence is a significant moment for generative AI development. It could change how companies gather and use data for training their models. This is the first major case in the U.S. involving AI and copyright, and its effects go beyond just the companies involved.
Particularly significant is that the judge rejected the fair use argument, which AI companies often cite in similar disputes. In the judge's opinion, Ross was creating a direct competitor to Westlaw rather than transforming the content for research or educational purposes. If courts start following this case as a precedent, similar lawsuits will likely be filed against almost all companies involved in model training.
We talked about this with Volodymyr Getmanskyi, Head of the Data Science Office at ELEKS, to learn more about key complications around data governance and AI model training.
Firstly, this is the problem around the question of how to classify data, especially in Web 4.0, starting from the question of how your browser caches data and whether you can reuse it further.
Data science or AI professionals are typically aware of such issues (public sources or copyright metadata in datasets) and always check it before usage. However, there can be some controversial cases, like when OpenAI's CTO, Mira Murati, couldn’t say what data was used to train Sora in a WSJ interview.
There are some thoughts around it that look quite innovative, like using blockchain blocks to track the distribution or some kind of steganography to hide copyright information in data. However, the main question regarding the verification of trained models remains open, especially in cases of distillation or transfer learning; how to inspect the parameters and forward propagation path to determine the presence of specific samples is still unclear.
There are many cases where synthetic datasets help a lot, but there is also another side question–in case some module/algorithm/model knows how to generate data, knows all dependencies and differences inside data, why not use it as the primary model w/o generation and additional training or architecture search?
The data or samples should be labelled first and then go through the easiest filter based on IP labels. Without that, the only option is to find each sample somewhere on the internet, check for the primary source (origin), and verify the copyright. And it looks too complicated and time-consuming.
Model training helps an AI system learn from data to make predictions or decisions. It’s like teaching a student: you provide examples (training data), and the model learns to identify patterns and relationships in that data.
There are several sources with pre-trained AI models model hubs, such as Hugging Face, PyTorch Hub, GitHub Repositories etc.
Training an AI model involves preparing and cleaning data, choosing a model type like linear regression or neural networks, and selecting a learning method: supervised, unsupervised, or semi-supervised. After training the model with the data, you must validate and test it using separate datasets to ensure it works well. If it doesn't meet the desired accuracy, retraining may be necessary.
The breadth of knowledge and understanding that ELEKS has within its walls allows us to leverage that expertise to make superior deliverables for our customers. When you work with ELEKS, you are working with the top 1% of the aptitude and engineering excellence of the whole country.
Right from the start, we really liked ELEKS’ commitment and engagement. They came to us with their best people to try to understand our context, our business idea, and developed the first prototype with us. They were very professional and very customer oriented. I think, without ELEKS it probably would not have been possible to have such a successful product in such a short period of time.
ELEKS has been involved in the development of a number of our consumer-facing websites and mobile applications that allow our customers to easily track their shipments, get the information they need as well as stay in touch with us. We’ve appreciated the level of ELEKS’ expertise, responsiveness and attention to details.