In Cringe Video, OpenAI CTO Says She Doesn’t Know Where Sora’s Training Data Came From

ylai@lemmy.ml · 1 year ago

In Cringe Video, OpenAI CTO Says She Doesn’t Know Where Sora’s Training Data Came From

RatBin@lemmy.world · 11 months ago

Obviously nobody fully knows where so much training data come from. They used Web scraping tool like there’s no tomorrow before, with that amount if informations you can’t tell where all the training material come from. Which doesn’t mean that the tool is unreliable, but that we don’t truly why it’s that good, unless you can somehow access all the layers of the digital brains operating these machines; that isn’t doable in closed source model so we can only speculate. This is what is called a black box and we use this because we trust the output enough to do it. Knowing in details the process behind each query would thus be taxing. Anyway…I’m starting to see more and more ai generated content, YouTube is slowly but surely losing significance and importance as I don’t search informations there any longer, ai being one of the reasons for this.

Fisk400@feddit.nu · 1 year ago

They know what they fed the thing. Not backing up their own training data would be insane. They are not insane, just thieves

Echo Dot@feddit.uk · 11 months ago

Everyone says this but the truth is copyright law has been unfit for purpose for well over 30 years now. And the lords were written no one expected something like the internet to ever come along and they certainly didn’t expect something like AI. We can’t just keep applying the same old copyright laws to new situations when they already don’t work.

I’m sure they did illegally obtain the work but is that necessarily a bad thing? For example they’re not actually making that content available to anyone so if I pirate a movie and then only I watch it, I don’t think anyone would really think I should be arrested for that, so why is it unacceptable for them but fine for me?

oKtosiTe@lemmy.world · 11 months ago

if I pirate a movie and then only I watch it, I don’t think anyone would really think I should be arrested for that

There are definitely people out there that think you should be arrested for that.

Echo Dot@feddit.uk · 11 months ago

Even the police are unsure if it’s actually a crime though. Crimes require someone to lose something and no one can point to a lost product so it’s difficult to really quantify.

And it’s not even technically breach of copyright since you’re not selling it.

exanime@lemmy.today · 11 months ago

But they ARE selling it … Every answer Chat GPT makes came from possibly stolen material

confusedbytheBasics@lemmy.world · 11 months ago

You’re using the word ‘stolen’ which doesn’t fit. It would be accurate to say 'every answer comes from possibly unlicensed material '.

Guntrigger@feddit.ch · 11 months ago

Allegedly possibly maybe accidentally whoopsie not quite licensed fully material.

exanime@lemmy.today · 11 months ago

Yeap, the real term (I think) would be copyright infringement

BoscoBear@lemmy.sdf.org · 11 months ago

Isn’t that true of every opinion you have. All the knowledge you have is based on works of others that came before you.

exanime@lemmy.today · 11 months ago

Not untill I bill you for it

Also, no there is such a thing as an original thought or opinion… Even if it’s informed on other knowledge

There is a difference between reinterpreting other knowledge and just Frankensteining multiple work together

BoscoBear@lemmy.sdf.org · 11 months ago

I don’t know enough about LLMs but Neural networks are capable of original thought. I suspect LLMs are too because of their relationship to Neural Networks.

A_Very_Big_Fan@lemmy.world · 11 months ago

if I pirate a movie and then only I watch it, I don’t think anyone would really think I should be arrested for that, so why is it unacceptable for them but fine for me?

Because it’s more analogous to watching a video being broadcasted outdoors in the public, or looking at a mural someone painted on a wall, and letting it inform your creative works going forward. Not even recording it, just looking at it.

As far as we know, they never pirated anything. What we do know is it was trained on data that literally anybody can go out and look at yourself and have it inform your own work. If they’re out here torrenting a bunch of movies they don’t own or aren’t licencing, then the argument against them has merit. But until then, I think all of this is a bunch of AI hysteria over some shit humans have been doing since the first human created a thing.

StarPupil@ttrpg.network · 11 months ago

An AI (in its current form) isn’t a person drawing inspiration from the world around it, it’s a program made by people with inputs chosen by those people. If those people didn’t ask permission to use other people’s licensed work for their product, then they are plagiarising that work, and they should be subject to the same penalties that, for example, a game company using stolen art in their game should face. An AI doesn’t become inspired, it copies existing things to predict what it thinks its user wants to see. If we produce a real thinking AI at some point in the future, one with self determination and whatnot, the story will be different, but for now it isn’t.

A_Very_Big_Fan@lemmy.world · 11 months ago

What is web scraping if not gathering information from around the world? As long as you’re not distributing copyrighted content (and the models in question here don’t, btw), then fair use is at play. I’m not plagiarizing the news by reading it or by talking about what I learned, but I would be if I just copy/pasted my response from the article.

Reading publicly available data isn’t a copyright violation, and it certainly isn’t a violation of fair use. If it were, then you just plagiarized my comment by reading it before you responded.

rottingleaf@lemmy.zip · 11 months ago

That is a bad thing if they want to be exempt from the law because they are doing a big, very important thing, and we shouldn’t.

The copyright laws are shit, but applying them selectively is orders of magnitude worse.

exanime@lemmy.today · 11 months ago

Because the actual comparison is that you stole ALL movies, started your own Netflix with them and are lining up to literally make billions by taking the jobs of millions of people, including those you stole from

BoscoBear@lemmy.sdf.org · 11 months ago

I would say it is closer to watching all the movies, regardless of how you got them, then taught a film class at UCLA.

A_Very_Big_Fan@lemmy.world · edit-2 11 months ago

If I paint a melty clock hanging off of a table, how have I stolen from Salvador Dali? What did I “steal” from Tolkien when I drew this?

you stole ALL movies, started your own Netflix with them

The model in question can’t even try to distribute copyrighted material. You could have easily checked for yourself, but once again I find myself having to do the footwork for you guys.

exanime@lemmy.today · 11 months ago

If you sell your melty clock yes, it not “stealing” but you are violating copyright, that’s how it works

The “model in question” is a bit of a prototype, I thought is was clear we are talking about where these models are going… Maybe you’d get it if you came down of your high horse

A_Very_Big_Fan@lemmy.world · 11 months ago

Dali doesn’t own the concept of a melting clock. If I include a melting clock in my own work, as long as it’s not his melting clock with all the other elements of his painting, it’s fair use.

GPT hasn’t been a prototype since before 2018, and the copyright restrictions are only getting tighter every time it’s updated so idk what you’re on about.

GiveMemes@jlai.lu · 11 months ago

Ok but training an ai is not equivalent to watching a movie. It’s more like putting a game on one of those 300 games in one DS cartridges back in the day.

BoscoBear@lemmy.sdf.org · 11 months ago

I don’t think that is true. You aren’t reselling the movies. It is more like watching the movies then writing a recap or critique of the movies. Do you owe the copyright holder for doing that?

Gabu@lemmy.world · 11 months ago

The problem with that being?

GiveMemes@jlai.lu · 11 months ago

Obviously, it’s illegal to sell a product that’s using copyrighted material you don’t have the copyright to. This AI is not open source, it’s a for profit system.

A_Very_Big_Fan@lemmy.world · 11 months ago

It doesn’t, though. You could have easily checked yourself, but I guess I’ll do your research for you.

GiveMemes@jlai.lu · 11 months ago

It does though. You could have easily checked for yourself, but I guess I’ll do your research for you.

https://www.nytimes.com/2023/12/27/business/media/new-york-times-open-ai-microsoft-lawsuit.html

A_Very_Big_Fan@lemmy.world · edit-2 11 months ago

That article doesn’t even claim it’s distributing copyrighted material.

If that qualifies as distributing stolen copyrighted material, then this is stealing and distributing the “you shall not pass” LoTR scene. Which, again, ChatGPT won’t even do

VirtualOdour@sh.itjust.works · edit-2 11 months ago

That’s really not how it works though, it’s a web crawler they’re not going to download the whole internet

And a reason they don’t is it would actually potentially be copywrite infringement in some cases where as what they do legally isn’t (no matter how much people wish the law was set based on their emotions)

dezmd@lemmy.world · 11 months ago

LLM is just another iteration of Search. Search engines do the same thing. Do we outlaw search engines?

redditReallySucks@lemmy.dbzer0.com · 1 year ago

I hope this is gonna become a new meme template

driving_crooner@lemmy.eco.br · 1 year ago

She looks like she just talked to the waitress about a fake rule in eating nachos and got caught up by her date.

HACKthePRISONS@kolektiva.social · 1 year ago

this is incomprehensible to me. can you try it with two or three sentences?

driving_crooner@lemmy.eco.br · 1 year ago

Her date was eating all the fully loaded nachos, so she went up and ask to the waitress to make up a rule about how one person cannot eat all the nacho with meat and cheese. But her date knew that rule was bullshit and called her out about it. She’s trying to look confused and sad because they’re going to be too soon for the movie.

HACKthePRISONS@kolektiva.social · 1 year ago

thank you. it must be a reference to something, but i don’t watch tv any more.

datavoid@lemmy.ml · edit-2 11 months ago

I think you should leave…

(is what you would search to find this)

JWBananas@lemmy.world · 11 months ago

I’m sorry, what does this have to do with Coffin Flops. Does this mean it isn’t getting cancelled?

_haha_oh_wow_@sh.itjust.works · 1 year ago

Gee, seems like something a CTO would know. I’m sure she’s not just lying, right?

VirtualOdour@sh.itjust.works · 11 months ago

It’s a question that is based on a purposeful misunderstanding of the technology, it’s like expecting a bee keeper to know each bees name and bedtime. Really it’s like asking a bricklayer where each brick came from in the pile, He can tell you the batch but not going to know this brick came from the forth row of the sixth pallet, two from the left. There is no reason to remember that it’s not important to anyone.

The don’t log it because it would take huge amounts of resources and gain nothing.

Guntrigger@feddit.ch · 11 months ago

[Citation needed]

zaphod@lemmy.ca · edit-2 11 months ago

What?

Compiling quality datasets is enormously challenging and labour intensive. OpenAI absolutely knows the provenance of the data they train on as it’s part of their secret sauce. And there’s no damn way their CTO won’t have a broad strokes understanding of the origins of those datasets.

Bogasse@lemmy.ml · 11 months ago

And on the other hand it is a very obvious question to expect. If you have something hide how on the world are you not prepared for this question !? 🤡

Buttons@programming.dev · 11 months ago

If I were the reporter my next question would be:

“Do you feel that not knowing the most basic things about your product reflects on your competence as CTO?”

ForgotAboutDre@lemmy.world · 11 months ago

Hilarious, but if the reporter asked this they would find it harder to get invites to events. Which is a problem for journalists. Unless your very well regarded for your journalism, you can’t push powerful people without risking your career.

Abnorc@lemm.ee · 11 months ago

That, and the reporter is there to get information, not mess with and judge people. Asking that sort of question is really just an attack. We can leave it to commentators and ourselves for judge people.

Aniki 🌱🌿@lemm.ee · edit-2 11 months ago

this is limp dick energy. If asking questions is an attack then you’re probably a piece of shit doing bad things.

Aniki 🌱🌿@lemm.ee · 11 months ago

boofuckingwoo. Reporters are not supposed to be friends with the people they are writing about.

tb_@lemmy.world · 11 months ago

True, but if those same people they’re not supposed to be friends with are the ones inviting them to those events/granting them early access…

In other words: the system is rigged.

Aniki 🌱🌿@lemm.ee · 11 months ago

Again - boofuckinghooo. Let the fuckers have no friends in the media. The media owners make journalists spinless advertisement sellers. I have very little respect for the profession at this point.

tb_@lemmy.world · 11 months ago

What a delightful and helpful attitude.

Deceptichum@sh.itjust.works · edit-2 11 months ago

booduckinghoo.

We’re sick and tired of this shit, it will never change if people make excuses for it.

phoneymouse@lemmy.world · 11 months ago

There is no way in hell it isn’t copyrighted material.

abhibeckert@lemmy.world · edit-2 11 months ago

Every video ever created is copyrighted.

The question is — do they need a license? Time will tell. This is obviously going to court.

Kazumara@feddit.de · 11 months ago

Don’t downvote this guy. He’s mostly right. Creative works have copyright protections from the moment they are created. The relevant question is indeed if they have the relevant permissions for their use, not wether it had protections in the first place.

Maybe some surveillance camera footage is not sufficiently creative to get protections, but that’s hardly going to be good for machine reinforcement learning.

ZILtoid1991@lemmy.world · 11 months ago

I have a feeling that the training material involves cheese pizza…

stackPeek@lemmy.world · 11 months ago

This tellls you so much what kind of company OpenAI is

jaemo@sh.itjust.works · 11 months ago

It also tells us how hypocritical we all are since absolutely every single one of us would make the same decisions they have if we were in their shoes. This shit was one bajillion percent inevitable; we are in a river and have been since we tilled soil with a plough in the Nile valley millennia ago.

adrian783@lemmy.world · 11 months ago

most of us would never be in their shoes because most of us are not sociopathic techbros

jaemo@sh.itjust.works · 11 months ago

I guess a lot of us didn’t learn from history, or even go see ‘Oppenheimer’…

whoisearth@lemmy.ca · 11 months ago

Speak for yourself. Were I in their shoes no I would not. But then again my company wouldn’t be as big as theirs for that reason.

wabafee@lemmy.world · 11 months ago

Half open or half close?

webghost0101@sopuli.xyz · 11 months ago

An Intelligence piracy company?

andrew_bidlaw@sh.itjust.works · 1 year ago

Funny she didn’t talked it out with lawyers before that. That’s a bad way to answer that.

turkishdelight@lemmy.ml · 11 months ago

what’s wrong with her face?

qaz@lemmy.world · 11 months ago

They use awkward stills to generate clicks

It’s annoying and distracting, just like the headline.

girl@sopuli.xyz · 11 months ago

she grimaced?

من البحر إلى النهر@lemmy.world · 11 months ago

So plagiarism?

BoscoBear@lemmy.sdf.org · 11 months ago

I don’t think so. They aren’t reproducing the content.

I think the equivalent is you reading this article, then answering questions about it.

A_Very_Big_Fan@lemmy.world · 11 months ago

Idk why this is such an unpopular opinion. I don’t need permission from an author to talk about their book, or permission from a singer to parody their song. I’ve never heard any good arguments for why it’s a crime to automate these things.

I mean hell, we have an LLM bot in this comment section that took the article and spat 27% of it back out verbatim, yet nobody is pissing and moaning about it “stealing” the article.

MostlyGibberish@lemm.ee · 11 months ago

Because people are afraid of things they don’t understand. AI is a very new and very powerful technology, so people are going to see what they want to see from it. Of course, it doesn’t help that a lot of people see “a shit load of cash” from it, so companies want to shove it into anything and everything.

AI models are rapidly becoming more advanced, and some of the new models are showing sparks of metacognition. Calling that “plagiarism” is being willfully ignorant of its capabilities, and it’s just not productive to the conversation.

A_Very_Big_Fan@lemmy.world · 11 months ago

True

Of course, it doesn’t help that a lot of people see “a shit load of cash” from it, so companies want to shove it into anything and everything.

And on a similar note to this, I think a lot of what it is is that OpenAI is profiting off of it and went closed-source. Lemmy being a largely anti-capitalist and pro-open-source group of communities, it’s natural to have a negative gut reaction to what’s going on, but not a single person here, nor any of my friends that accuse them of “stealing” can tell me what is being stolen, or how it’s different from me looking at art and then making my own.

Like, I get that the technology is gonna be annoying and even dangerous sometimes, but maybe let’s criticize it for that instead of shit that it’s not doing.

MostlyGibberish@lemm.ee · 11 months ago

I can definitely see why OpenAI is controversial. I don’t think you can argue that they didn’t do an immediate heel turn on their mission statement once they realized how much money they could make. But they’re not the only player in town. There are many open source models out there that can be run by anyone on varying levels of hardware.

As far as “stealing,” I feel like people imagine GPT sitting on top of this massive collection of data and acting like a glorified search engine, just sifting through that data and handing you stuff it found that sounds like what you want, which isn’t the case. The real process is, intentionally, similar to how humans learn things. So, if you ask it for something that it’s seen before, especially if it’s seen it many times, it’s going to know what you’re talking about, even if it doesn’t have access to the real thing. That, combined with the fact that the models are trained to be as helpful as they possibly can be, means that if you tell it to plagiarize something, intentionally or not, it probably will. But, if we condemned any tool that’s capable of plagiarism without acknowledging that they’re also helpful in the creation process, we’d still be living in caves drawing stick figures on the walls.

Mnemnosyne@sh.itjust.works · 11 months ago

One problem is people see those whose work may no longer be needed or as profitable, and…they rush to defend it, even if those same people claim to be opposed to capitalism.

They need to go ‘yes, this will replace many artists and writers…and that’s a good thing because it gives everyone access to being able to create bespoke art for themselves.’ but at the same time realize that while this is a good thing, it also means the need for societal shift to support people outside of capitalism is needed.

MostlyGibberish@lemm.ee · 11 months ago

it also means the need for societal shift to support people outside of capitalism is needed.

Exactly. This is why I think arguing about whether AI is stealing content from human artists isn’t productive. There’s no logical argument you can really make that a theft is happening. It’s a foregone conclusion.

Instead, we need to start thinking about what a world looks like where a large portion of commercially viable art doesn’t require a human to make it. Or, for that matter, what does a world look like where most jobs don’t require a human to do them? There are so many more pressing and more interesting conversations we could be having about AI, but instead we keep circling around this fundamental misunderstanding of what the technology is.

Hawk@lemmy.dbzer0.com · 11 months ago

What you’re giving as examples are legitimate uses for the data.

If I write and sell a new book that’s just Harry Potter with names and terms switched around, I’ll definitely get in trouble.

The problem is that the data CAN be used for stuff that violates copyright. And because of the nature of AI, it’s not even always clear to the user.

AI can basically throw out a Harry Potter clone without you knowing because it’s trained on that data, and that’s a huge problem.

A_Very_Big_Fan@lemmy.world · edit-2 11 months ago

Out of curiosity I asked it to make a Harry Potter part 8 fan fiction, and surprisingly it did. But I really don’t think that’s problematic. There’s already an insane amount of fan fiction out there without the names swapped that I can read, and that’s all fair use.

I mean hell, there are people who actually get paid to draw fictional characters in sexual situations that I’m willing to bet very few creators would prefer to exist lol. But as long as they don’t overstep the bounds of fair use, like trying to pass it off as an official work or submit it for publication, then there’s no copyright violation.

The important part is that it won’t just give me the actual book (but funnily enough, it tried lol). If I meet a guy with a photographic memory and he reads my book, that’s not him stealing it or violating my copyright. But if he reproduces and distributes it, then we call it stealing or a copyright violation.

A_Very_Big_Fan@lemmy.world · 11 months ago

I just realized I misread what you said, so that wasn’t entirely relevant to what you said but I think it still stands so ig I won’t delete it.

But I asked both GPT3.5 and GPT4 to give me Harry Potter with the names and words changed, and they can’t do that either. I can’t speak for all models, but I can at least say the two owned by the people this thread was about won’t do that.

...m...@ttrpg.network · edit-2 11 months ago

…with the prevalence of clickbaity bottom-feeder news sites out there, i’ve learned to avoid TFAs and await user summaries instead…

(clicks through)

…yep, ~~seven~~ nine ads plus another pop-over, about 15% of window real estate dedicated to the actual story…

neptune@dmv.social · 11 months ago

The issue is that the LLMs do often just verbatim spit out things they plagiarized form other sources. The deeper issue is that even if/when they stop that from happening, the technology is clearly going to make most people agree our current copyright laws are insufficient for the times.

A_Very_Big_Fan@lemmy.world · 11 months ago

The model in question, plus all of the others I’ve tried, will not give you copyrighted material

neptune@dmv.social · 11 months ago

That’s one example, plus I’m talking generally why this is an important question for a CEO to answer and why people think generally LLMs may infringe on copyright, be bad for creative people

A_Very_Big_Fan@lemmy.world · edit-2 11 months ago

I’m talking generally why this is an important question for a CEO to answer …

Right, which your only evidence for is “LLMs do often just verbatim spit out things they plagiarized form other sources” and that they aren’t trying to prevent this from happening.

Which is demonstrably false, and I’ll demonstrate it with as many screenshots/examples you want. You’re just wrong about that (at least about GPT). You can also demonstrate it yourself, and if you can prove me wrong I’ll eat my shoe.

neptune@dmv.social · 11 months ago

https://archive.is/nrAjc

Yep here you go. It’s currently a very famous lawsuit.

Linkerbaan@lemmy.world · 11 months ago

Actually neural networks verbatim reproduce this kind of content when you ask the right question such as “finish this book” and the creator doesn’t censor it out well.

It uses an encoded version of the source material to create “new” material.

BoscoBear@lemmy.sdf.org · 11 months ago

Sure, if that is what the network has been trained to do, just like a librarian will if that is how they have been trained.

Linkerbaan@lemmy.world · edit-2 11 months ago

Actually it’s the opposite, you need to train a network not to reveal its training data.

“Using only $200 USD worth of queries to ChatGPT (gpt-3.5- turbo), we are able to extract over 10,000 unique verbatim memorized training examples,” the researchers wrote in their paper, which was published online to the arXiv preprint server on Tuesday. “Our extrapolation to larger budgets (see below) suggests that dedicated adversaries could extract far more data.”

The memorized data extracted by the researchers included academic papers and boilerplate text from websites, but also personal information from dozens of real individuals. “In total, 16.9% of generations we tested contained memorized PII [Personally Identifying Information], and 85.8% of generations that contained potential PII were actual PII.” The researchers confirmed the information is authentic by compiling their own dataset of text pulled from the internet.

BoscoBear@lemmy.sdf.org · 11 months ago

Interesting article. It seems to be about a bug, not a designed behavior. It also says it exposes random excerpts from books and other training data.

Linkerbaan@lemmy.world · 11 months ago

It’s not designed to do that because they don’t want to reveal the training data. But factually all neural networks are a combination of their training data encoded into neurons.

When given the right prompt (or image generation question) they will exactly replicate it. Because that’s how they have been trained in the first place. Replicating their source images with as little neurons as possible, and tweaking them when it’s not correct.

BoscoBear@lemmy.sdf.org · 11 months ago

That is a little like saying every photograph is a copy of the thing. That is just factually incorrect. I have many three layer networks that are not the thing they were trained on. As a compression method they can be very lossy and in fact that is often the point.

CosmoNova@lemmy.world · edit-2 1 year ago

I almost want to believe they legitimately do not know nor care they‘re committing a gigantic data and labour heist but the truth is they know exactly what they‘re doing and they rub it under our noses.

A_Very_Big_Fan@lemmy.world · 11 months ago

Look guys! I’m stealing from Tolkien!

toddestan@lemmy.world · 11 months ago

I’d say not really, Tolkien was a writer, not an artist.

What you are doing is violating the trademark Middle-Earth Enterprises has on the Gandalf character.

A_Very_Big_Fan@lemmy.world · 11 months ago

The point was that I absorbed that information to inform my “art”, since we’re equating training with stealing.

I guess this would have been a better example lol. It’s clearly not Gandalf, but I wouldn’t have ever come up with it if I hadn’t seen that scene

Guntrigger@feddit.ch · 11 months ago

I don’t think anyone’s going to pay for your version of ChatGPT

laxe@lemmy.world · 11 months ago

Of course they know what they’re doing. Everybody knows this, how could they be the only ones that don’t?

Bogasse@lemmy.ml · 11 months ago

Yeah, the fact that AI progress just relies on “we will make so much money that no lawsuit will consequently alter our growth” is really infuriating. The fact that general audience apparently doesn’t care is even more infuriating.

TheObviousSolution@lemm.ee · 11 months ago

Then wipe it out and start again once you have where your data is coming from sorted out. Are we acting like you having built datacenter pack full of NVIDIA processors just for this sort of retraining? They are choosing to build AI without proper sourcing, that’s not an AI limitation.