A federal judge on Wednesday sided with Facebook parent Meta Platforms in dismissing a copyright infringement lawsuit from a group of authors who accused the company of stealing their works to train its artificial intelligence technology.
There is nothing intelligent about “AI” as we call it. It parrots based on probability. If you remove the randomness value from the model, it parrots the same thing every time based on it’s weights, and if the weights were trained on Harry Potter, it will consistently give you giant chunks of harry potter verbatim when prompted.
Most of the LLM services attempt to avoid this by adding arbitrary randomness values to churn the soup. But this is also inherently part of the cause of hallucinations, as the model cannot preserve a single correct response as always the right way to respond to a certain query.
LLMs are insanely “dumb”, they’re just lightspeed parrots. The fact that Meta and these other giant tech companies claim it’s not theft because they sprinkle in some randomness is just obscuring the reality and the fact that their models are derivative of the work of organizations like the BBC and Wikipedia, while also dependent on the works of tens of thousands of authors to develop their corpus of language.
In short, there was a ethical way to train these models. But that would have been slower. And the court just basically gave them a pass on theft. Facebook would have been entirely in the clear had it not stored the books in a dataset, which in itself is insane.
I wish I knew when I was younger that stealing is wrong, unless you steal at scale. Then it’s just clever business.
Except that breaking copyright is not stealing and never was. Hard to believe that you’d ever see Copyright advocates on foss and decentralized networks like Lemmy - its like people had their minds hijacked because “big tech is bad”.
What name do you have for the activity of making money using someone else work or data, without their consent or giving compensation? If the tech was just tech, it wouldn’t need any non consenting human input for it to work properly. This are just companies feeding on various types of data, if justice doesn’t protects an author, what do you think it would happen if these same models started feeding of user data instead? Tech is good, ethics are not
How do you think you’re making money with your work? Did your knowledge appear from a vacuum? Ethically speaking nothing is “original creation of your own merit only” - everything we make is transformative by nature.
Either way, the talks are moot as we’ll never agree on what is transformative enough to be harmful to our society unless its a direct 1:1 copy with direct goal to displace the original. But thats clearly not the case with LLMs.
Accuracy and hallucination are two ends of a spectrum.
If you turn hallucinations to a minimum, the LLM will faithfully reproduce what’s in the training set, but the result will not fit the query very well.
The other option is to turn the so-called temperature up, which will result in replies fitting better to the query but also the hallucinations go up.
In the end it’s a balance between getting responses that are closer to the dataset (factual) or closer to the query (creative).
“hallucination refers to the generation of plausible-sounding but factually incorrect or nonsensical information”
Is an output an hallucination when the training data involved in the output included factually incorrect data? Suppose my input is “is the would flat” and then an LLM, allegedly, accurately generates a flat-eather’s writings saying it is.
Terrible judgement.
Turn the K value down on the model and it reproduces text near verbatim.
Ah the Schrödinger’s LLM - always hallucinating and also always accurate
There is nothing intelligent about “AI” as we call it. It parrots based on probability. If you remove the randomness value from the model, it parrots the same thing every time based on it’s weights, and if the weights were trained on Harry Potter, it will consistently give you giant chunks of harry potter verbatim when prompted.
Most of the LLM services attempt to avoid this by adding arbitrary randomness values to churn the soup. But this is also inherently part of the cause of hallucinations, as the model cannot preserve a single correct response as always the right way to respond to a certain query.
LLMs are insanely “dumb”, they’re just lightspeed parrots. The fact that Meta and these other giant tech companies claim it’s not theft because they sprinkle in some randomness is just obscuring the reality and the fact that their models are derivative of the work of organizations like the BBC and Wikipedia, while also dependent on the works of tens of thousands of authors to develop their corpus of language.
In short, there was a ethical way to train these models. But that would have been slower. And the court just basically gave them a pass on theft. Facebook would have been entirely in the clear had it not stored the books in a dataset, which in itself is insane.
I wish I knew when I was younger that stealing is wrong, unless you steal at scale. Then it’s just clever business.
Except that breaking copyright is not stealing and never was. Hard to believe that you’d ever see Copyright advocates on foss and decentralized networks like Lemmy - its like people had their minds hijacked because “big tech is bad”.
What name do you have for the activity of making money using someone else work or data, without their consent or giving compensation? If the tech was just tech, it wouldn’t need any non consenting human input for it to work properly. This are just companies feeding on various types of data, if justice doesn’t protects an author, what do you think it would happen if these same models started feeding of user data instead? Tech is good, ethics are not
How do you think you’re making money with your work? Did your knowledge appear from a vacuum? Ethically speaking nothing is “original creation of your own merit only” - everything we make is transformative by nature.
Either way, the talks are moot as we’ll never agree on what is transformative enough to be harmful to our society unless its a direct 1:1 copy with direct goal to displace the original. But thats clearly not the case with LLMs.
Accuracy and hallucination are two ends of a spectrum.
If you turn hallucinations to a minimum, the LLM will faithfully reproduce what’s in the training set, but the result will not fit the query very well.
The other option is to turn the so-called temperature up, which will result in replies fitting better to the query but also the hallucinations go up.
In the end it’s a balance between getting responses that are closer to the dataset (factual) or closer to the query (creative).
“hallucination refers to the generation of plausible-sounding but factually incorrect or nonsensical information”
Is an output an hallucination when the training data involved in the output included factually incorrect data? Suppose my input is “is the would flat” and then an LLM, allegedly, accurately generates a flat-eather’s writings saying it is.
The enemy is at the same time too strong and too weak.