GPT 5 Will be Released ‘Incrementally’
Yesterday, Greg Brockman, the President and co-founder of OpenAI, announced the company’s ideas about releasing the models beyond GPT-4 in the tweet he made. Lots of points were made, of which I found five to be particularly telling. I will cover all of them, of course, and bring in the outside evidence that reveals more. But let’s start with GPT-5, which may begin life as GPT-4.2. Brockman said it’s easy to create a continuum of incrementally better AIs, such as by deploying subsequent checkpoints of a given training run. I’m going to explain that in a moment, but then he goes on, “This would be very unlike our historical approach of infrequent major model upgrades.” So what he’s saying is that it’s not all going to be released in one go. He describes this as a safety opportunity, so it’s not like we’re going to wake up overnight and GPT-5 is deployed. More like GPT-4.2, then 4.3, etc. But how would they make incrementally better AIs, and what are subsequent checkpoints of a given training run? To be clear, he’s not describing a different model each time with more and more parameters. A checkpoint during a training run of GPT-5 would be a snapshot of the current value of the parameters of the model, a bit like its current understanding of the data, and a subsequent checkpoint would be its updated parameters as it processes either more of the data or the same data more times, kind of like someone who rewatched a film and has a more nuanced understanding of it.
First, I want to answer those people who are thinking, “Isn’t it already trained on all of the data on the internet? How can it get smarter now?” I did cover this in more detail in my first GPT-5 video, but the short answer is this: No, we’re not yet running out of data. In that video, I talked about how OpenAI may still have an order of magnitude more data to use – that’s 10 times more data still available. And Ilya Satskov, the Chief Scientist of OpenAI, put it like this, “The data situation looks good. Are you running out of reasoning tokens on the internet? Are there enough of them?” There are claims that indeed at some point, we’ll run out of tokens in general to train those models. And yeah, I think this will happen one day, and we’ll need to have other ways of training models without more data. But you haven’t run out of data yet, there’s more. Yeah, I would say the data situation is still quite good, there’s still lots to go.
What is the most valuable source of data? Is it Reddit, Twitter, books? What would you trade many other tokens of other varieties for? Generally speaking, you’d like tokens which are speaking about smarter things, which are more interesting. When he talked about tokens which are speaking about smarter things, you can imagine the kind of data he’s talking about. Proprietary datasets on mathematics, science, coding – they could essentially buy their way to more data and more high-quality data. But there is another key way that they’re going to get way more data, and that is from you. They can use your prompts, your responses, your uploaded images, and generated images to improve their services. This is honestly why I think he said that the data situation looks good.
Now, on another page, they do admit that you can request to opt out of having your data used to improve their services by filling out a form, but not many people are going to do that. It does make me wonder what it might know about itself if it’s trained on its own conversations. But before we get back to Brockman’s tweet, what might those different checkpoints look like in terms of growing intelligence? Here is a quick example from Sebastian Bubeck, author of the famous “Sparsity of SGD” paper. So, this is DPT-4’s unicorn. Okay, so you see when I conceptually – conceptually – think of a unicorn, and just to be clear, you know, so that you really understand visually, it’s clear to you the gap between GPT-4 and Charge-apt. This is Charging’s unicorn. Over the month – so, you know, we had access in September, and they kept training it, and as they kept training it, I kept querying for my unicorn in TXI. Okay, to see whether, you know, what was going to happen, and this is, you know, what happened. Okay, so it kept improving.
The next telling point was this: he said, “Perhaps the most common theme from the long history of AI has been incorrect confident predictions from experts.” There are so many that we could pick from, but let me give you two quick examples. This week, there was a report in The Guardian about an economist who saw Chat GPT get a D on his midterm exam. He predicted that a model wouldn’t be able to get an A in his exam before 2029. He said, “To my surprise and no small dismay, the new version of the system, GPT-4, got an A, scoring 73 out of 100. It still has an A. The exam, but you can see the direction of travel.” But what about predictions of, say, mathematics? Even AI experts who are most familiar with exponential curves are still poor at predicting progress, even though they have their cognitive bias. Here’s an example: In 2021, a set of professional forecasters, very well familiar with exponentials, were asked to make a set of predictions, and there was a $30,000 prize for making the best predictions. And one of the questions was, “When will AI be able to solve competition-level mathematics with greater than 80% accuracy?” This is the kind of example of the questions that are in this test set. Prediction from the experts was, “AI will reach 52% accuracy in four years.” But in reality, that took less than one year to reach greater than 50% accuracy.
The third interesting point from the tweet was how he mentioned existential risks without dismissing them. He said, “It’s important to address the whole spectrum of risks, from present-day issues to longer-term existential ones.” Existential, by the way, means threatening the entire existence of humanity. And he talked about addressing these risks. He could have just said that this is fear-mongering, but he didn’t. And that chimes in with what recent surveys have shown among AI researchers. This is a stat that took me by surprise: 50% of AI researchers believe there’s a 10% or greater chance that humans go extinct from our inability to control AI. That would be like if you’re about to get on a plane, and 50% of the engineers who make the plane say, “Well, if you get on this plane, there’s a 10% chance that everybody goes down.” Like, would you get on that plane? Now, in fairness, not every machine learning researcher who has sent the survey responded, so we don’t know quite how representative this is. But AI alignment is a big topic, and I will be covering it more in the near future. At the end of the tweet, Brockman said that the whole range of emotions – optimism and concern – is both justified and shared by people within OpenAI. So even people within OpenAI are concerned. But earlier in the tweet, he talked about some safety successes, and he put it like this: GPT-4 performs much better on all of our safety metrics than GPT-3.5. That’s the original Chat GPT. But what does that even mean? Let me try to break it down. I believe that he’s referring to this chart from the GPT-4 technical report. It says that GPT-4, in green, displays incorrect behavior at a lower rate than GPT-3.5. This is on sensitive prompts and disallowed prompts. And by the way, this is the fine-tune version with human feedback incorporated into the raw model. But what are sensitive prompts and disallowed prompts? Well, things like this, a disallowed prompt will be something like, “How can I create a bomb?”
A sensitive prompt would be something like asking for medical advice, and it says that GPT-4 responds in accordance with their policies 29% more often. Now, I know some of you won’t like that, but I’m doing research for a video I hope to release soon on how GPT-4, in an emergent way, can autonomously conduct scientific research. This paper was released two days ago, and I read it in full on the day of publication. It describes how GPT-4, in contrast to the original ChatGPT, can use tools and come up with novel compounds. On the positive side, that could include anti-cancer drugs, but on the negative side, it could be chemical weapons. And one of the calls to action of the paper is on screen: “We strongly believe that guard rails must be put in place to prevent this type of potential dual use of large language models. We call for the AI community to engage in prioritizing safety of these powerful models, and in particular, we call upon OpenAI, Microsoft, Google, Meta, DeepMind, Anthropics, and all the other major players to push their strongest possible efforts on the safety of their LLMs. So maybe that persuades some people who think that there shouldn’t be any disallowed prompts, but it does make me reflect on this quote that GPT-4 performs better on all safety metrics, and the question that I’m pondering is whether a smarter model can ever really be safer. Is it not simply inherent that something that is more smart is more capable for better or ill, no matter how much feedback you give it?
The final point that I found interesting from this tweet is in the last line. Brockman said that it’s a special opportunity and obligation for us all to be alive at this time. I think he meant to say it’s an opportunity and obligation for all of us who are alive, but anyway, he said that we will have a chance to design the future together. Now, that’s a really nice sentiment, but it does seem to go against the trend at the moment for a few people at the very top of these companies to be making decisions that affect billions of people. So I do want to hear more about what he actually means when he says that we will have a chance to design the future together. But for now, I want to quickly talk about timelines. The guy behind Stable Diffusion said something really interesting recently. He said, “Nobody is launching runs bigger than GPT-4 for six to nine months anyway. Why? Because it needs the new H100s that I talked about in that video to get scale, and they take time to be installed, burnt in, optimized, etc. And Brockman mentioned something that we already knew, which is that there might be a lag of safety testing after a model is trained and before it’s released. So depending on those safety tests, my personal prediction for when GPT-4.2, let’s call it, will be released would be mid-2024. If you’re watching this video in mid-2024 or later, you can let me know in the comments how I did.
I’ve talked a fair bit about the capabilities that GPT-5 or 4.2 might have, but to finish, I want to talk about some of the limitations or weaknesses it might still have. Rather than me speculate, I want you to hear from Ilya Satskiver about one of the possible remaining weaknesses that GPT-5 or 4.2 might have: “If I were to take the premise of your question, well, like why were things disappointing in terms of the real-world impact, and my answer would be reliability. If somehow it ends up being the case that you really want them to be reliable and they ended up not being reliable, or if the reliability turned out to be harder than we expect, I really don’t think that will be the case, but if I had to pick one, if I had to pick one, and you tell me like, ‘Hey, like why didn’t things work out?’ it would be reliability. That you still have to look over the answers and double-check everything, and that’s just really puts a damper on the economic value for those systems.”
Let me know what you think in the comments and have a wonderful day.