SymmetricalDataSecurity: This Content Is for Human Consumption Only

ChatGPT has subverted everyone’s predictions on automation. Just a few years ago, it seemed most likely that the manual, boring, and rote jobs would be automated—but in the presence of GPT and the other newest gargantuan deep learning models like DALL-E, it seems more likely that writers, artists, and programmers are the most vulnerable to displacement. Everyone’s freaking out about it,

including me, except mine is more of a cynical freak out: I don’t want to live in a world where AI content is ubiquitous and human content is sparse and poorly incentivized—if only because the professions of a writer, artist, programmer etc. are some of the most fulfilling vocations out there. If the technological trend continues, we’re facing a future world where intellectual work no longer exists. This is the worst imaginable end-stage capitalism dystopia, in which the only ways to make money are the grueling physical jobs like nursing and commercial kitchens (if you work in a field like that, you have my deepest respect).

I don’t think a language model can replace a programmer—it can only convincingly fool a hiring manager that it can. (Or I don’t know, maybe it will take like two more years of progress before the hiring managers are fooled. It managed to fool this guy.) And the same is true with writing and art—ChatGPT can’t actually replace a good human writer (yet), but it can certainly convince someone that it can do the job well enough for less money. It can certainly get literary magazines shut down by filling up its submission pipelines with its polite sludge. These generative models create a double-whammy of badness: the programmer will be out of a job and the company will find their infrastructure crumbling for every seasoned programmer they let go. Writers and artists won’t be able to make livings from their work and the content they’re not producing anymore will become horribly banal—where will we go to satisfy our curiosity then?

Whether you think ChatGPT is wonderful

or terrible, I hope we can agree on this: people should have the right to control whether the things they create are used to train the massive AIs of the massive for-profit tech corporations. You may think that OpenAI doesn’t need permission to use creations that are publicly available on the internet, but hopefully you agree that a person should be able to disallow OpenAI from using the things he or she creates, even if he or she wants to share those things with the world. Anyway maybe OpenAI should need permission to use creations that are publicly available on the internet.

I think this is the direction that discussion and policy needs to move now that generative models are becoming ubiquitous. It is already at best questionable whether OpenAI et al. should be allowed to use online content without permission from the creator. And it is already at best questionable whether ChatGPT and the like represent a net good for society—even putting aside the potential existential risk to humanity. There needs to be some sort of regulation on this newish industry which feeds off of all of us, and it needs to be enforced. So the question has to be what sort of regulation is fair & feasible?

There’s been talk of compensating people for their contributions to the training data, which is a nice, utopian, idea, but it’s not feasible to implement. How do you track down everyone that contributed? How do you determine the value of their contribution? And do you really think that the compensation would be anything more than pennies?

On the more pessimistic end of things, there’s the call to halt AI research. Keep in mind that this isn’t a call to make it illegal to use neural networks, it’s only a call to stop creating neural networks at the scale that’s at the limits of our capabilities. It’s basically a call to hold off on GPT-5. I think this would basically be great. I don’t think we’d really be giving anything up. There are genuinely good applications of “AI” like deep learning for protein folding (that’s biology research using AI, not AI research; AI research is about the bleeding edge of AI itself), but I don’t think anyone is calling for a pause on that. I don’t think ChatGPT and the like are profoundly useful or good for society. I think they’re mostly harmless, but then what about the next generation of GPT & co.? What sort of consequences will we be dealing with in five years, or even next year, if these developments continue unchecked?

Alas, it’s unlikely there will be a halt on AI research, even if half of the world population signed that letter. OpenAI, Google, and Microsoft aren’t just going to say, “Okay, you’re right everyone, let’s shut it down,” and I’m not optimistic that the government would be able to enforce a full stop, even if we were facing a more immediate, obvious, and severe threat.

In the middle, there’s this idea, which I haven’t seen discussed much: we should all simply refuse to be included in training data, or at least we should all have the right to refuse to be included in training data.

If you’re an individual, and you’re worried about all this, all you have to do is either explicitly disallow the things you produce from being included in AI training datasets, set some platform setting that does this for you, or get off the platform altogether if it does not give you this freedom.

If you’re a governing body, all you have to do is go to OpenAI, Google, and Microsoft (importantly, only these and maybe a couple other organizations are really the only ones capable of the sort of AI training we’re talking about) and require that they obtain the explicit permission from the creator of every observation in their dataset. (I know nothing about corporate policy. Maybe this could never be enforced and is, in general, terribly naive. But it has to be easier to enforce than an absolute shutdown of AI research or paying people for their contributions to datasets.)

The way these “innovations” in AI like ChatGPT work is basically that they amass absolutely disgusting quantities of data. How disgusting, you ask? Well, OpenAI doesn’t tell us how big their datasets are, or where the content comes from anymore. That’s how disgusting.

The actual innovation was in 2017, with what is turning out to be one of the most important papers of all time, titled “Attention Is All You Need.” I remember reading this paper over and over in 2018-2019 while I was implementing and training these models at work. Basically, the authors discovered that you could get a lot more bang for your buck out of neural networks trained on text data by constructing them entirely out of “attention mechanisms” (which the specifics of are relatively interesting if you work in machine learning but not interesting at all otherwise). This led to the development of the “transformer architecture,” which pretty much revolutionized machine learning for text (GPT stands for Generative Pre-trained Transformer—the first version was created in 2018. Before that, OpenAI was still working on reinforcement learning [note the date on that link—2017!], and their arena of choice was one of my all-time favorite video games, DotA 2).

(As an interesting aside: the paper “Attention Is All You Need” was so popular and important and its title was so cute and fun that researchers copied it to an irritating degree, even in fields outside machine learning: see “Diversity Is All You Need,” “A Lip Sync Expert Is All You Need”, “Empathy Is All You Need,” and 29,000 other results on Google Scholar.)

Since 2017, machine learning researchers have pretty much just been throwing more and more data at these transformer models and getting better and better results. It just turns out that the performance of this model architecture scales very well and very far out with the size of the dataset (notably, Sam Altman has said that he thinks we’re approaching the limits of this relationship—I’ll let you decide what to make of that).

Understanding the history here is helpful in two ways. Firstly, it gets you past the overly simple “All it’s doing is predicting the next word!” way of comprehending these models. While this isn’t overtly wrong, it would also be accurate to say “It’s a system, the complexity of which rivals that of a mammalian brain, that is somehow encoding a rich representation of the entire english language, and responds to written language with novel, relevant, and apparently intelligent language based on an extremely complex network of mathematical relationships which resemble, to some degree, the way humans process written language.”

But secondly, it exposes the fact that GPT-4 is more impressive that GPT-3 mostly just because the training dataset is bigger. It’s really not a function of clever programmers inventing the smartest AI. There’s some of that, yes, but the project is really more about getting a bigger dataset. Acquiring and maintaining training data is where most of the cost and effort for creating these models come from. If we introduce even a little bit of resistance on the capabilities of an organization to perform this data-hoarding, it will become a lot harder to produce these sort of models, because they’re operating at the limits of what is possible in terms of big data—it might even be impossible to make any more progress. You may think that is a good thing or a bad thing, but it’s hard not to feel like OpenAI is being shady by not disclosing any details about their training data, and maybe regulators need to step in and do something about that.

In fact, it’s already happening. Take a look at this WSJ coverage of recent legislation from the EU, and discussion on Hacker News. This is already the way things are headed, and that’s good, even if you think ChatGPT is good for society. What I’m advocating for is basically just an extension of privacy (which we’ve already basically agreed as a society is important, and which we’re already building policy and infrastructure for): AI privacy.

It seems like people want this. DeviantArt prohibited the use of content for AI datasets in response to user feedback. And look at all the downvotes on this Stack Overflow post, where the platform announces that content will be used in AI training.

Is putting a red cancel emoji in your Instagram bio going to stop OpenAI from downloading your art and using it in training data anyway? No. I mean, the Instagram terms of service could just say that you give up all rights by using the platform, and then they’ve got full legal freedom to sell your content to OpenAI.

Are people going to unite in rebellion against these platforms if they don’t refuse to share their data with OpenAI? No, probably not.

For instance, I don’t think people are going to stop using Stack Overflow. That doesn’t mean that their decision isn’t a bummer. That doesn’t mean we shouldn’t push Stack Overflow to reverse their position like DeviantArt did. That doesn’t mean someone couldn’t create a very similar programming Q&A site that doesn’t allow AI consumption, and see if people wouldn’t prefer to use that one. The platforms need the users—if everyone starts leaving because they don’t want AI to replace them, then they’ll stop sharing the content with AI—but the users also need the platforms, and platforms are hard to build, especially when you consider the requirement of strong network effects (good luck getting your Stack Overflow alternative off the ground: step 1 is getting everyone to leave Stack Overflow), so the platforms have at least half of the power in this dynamic. But this is the sort of way in which regulatory pressure actually does work: shifting the power balance toward the users, toward the creators. We probably can’t create regulations that directly prevent superintelligent or somehow malignant AIs from being created, but we can create regulations that pressure platforms to behave a certain way, and therefore improve the world by improving the way that these technologies are allowed to be developed.

from Hacker News https://ift.tt/y6YqaC5

SymmetricalDataSecurity

Sunday, April 30, 2023

This Content Is for Human Consumption Only

No comments:

Post a Comment

Blog Archive

Search This Blog

Total Pageviews