How tech giants are cutting corners on data harvesting for AI

The race to lead AI has become a desperate quest for the digital data needed to advance the technology. To obtain this data, tech companies like OpenAI, Google and Meta cut corners, ignored company policies and debated skirting the law, according to a New York Times study.

At Meta, owner of Facebook and Instagram, executives, lawyers and engineers discussed last year buying the publishing house Simon & Schuster to obtain long-form works, according to recordings of internal meetings obtained by the Times. They also discussed the collection of copyrighted data on the Internet, even if it meant legal action. Negotiating licenses with publishers, artists, musicians and the news industry would take too long, they said.

Like OpenAI, Google has transcribed YouTube videos to harvest text for its AI models, five people with knowledge of the company’s practices said. This could potentially violate the copyrights of the videos, which belong to their creators.

Last year, Google also expanded its terms of service. One motivation for the change, according to members of the company’s privacy team and an internal message seen by The Times, was to allow Google to leverage publicly available Google documents. , restaurant reviews on Google Maps and other online materials to learn more. AI products.

The companies’ actions illustrate how online information – news stories, works of fiction, forum posts, Wikipedia articles, computer programs, photos, podcasts and video clips – has increasingly become the lifeblood of the industry booming AI. Creating innovative systems depends on having enough data available to teach technologies to instantly produce text, images, sounds, and videos that resemble what a human creates.

Related Posts