The Launch of GPT-4

4 months ago 12

With GPT-4 now stepping back from its starring role in ChatGPT, I want to share a few of my favorite memories from its launch.

I originally joined OpenAI as an engineer on the Applied team, but later moved into a hybrid role as OpenAI’s “science communicator.” That shift let me dive deep into technical work while also figuring out how best to explain it to the world—shifting from one role to a very different one is common inside OpenAI and one of the things that makes the organization special.

During my time there, I had the privilege of helping launch GPT-4. The comms team assigned me to the project, and I partnered with product manager Joanne Jang—an exceptionally talented colleague who now leads model behavior at OpenAI. Joanne’s ability to keep on top of everything from research to customer experience was incredible to watch.

The ChatGPT Disruption

We knew GPT-4 was going to be big. It was so much more capable than GPT-3 and much more like what we thought AI should be like. But what we hadn’t prepared ourselves for was how popular ChatGPT was going to be. ChatGPT was based on GPT-3.5 – which in turn was an improved version of GPT-3 – taking all the lessons we learned and making a better model but within the same rough parameters.

We genuinely didn’t think ChatGPT was going to be the disruptor it became because the base model, GPT-3.5 had been out for months. ChatGPT wasn’t any smarter (initially) than GPT-3.5. It was just easier to use. The night before the launch of ChatGPT Ilya used it and determined that it just wasn’t good enough. He had unsatisfactory results with some of his test prompts and felt the model wasn’t there. While it didn’t meet Ilya’s requirements, it turns out hundreds of millions of other people felt differently.

The runaway success of ChatGPT left us scrambling to get GPT-4 out the door. We were (and they still are) a very small company. That meant more hats to wear for everyone already wearing multiple hats. ChatGPT was a wonderful gift, but it was a disruption on the way to GPT-4.

Making the launch video

The green and black striped logo for GPT-4 was created by the team I’d hired to help make the launch videos. Kornhaber/Brown (Andrew Kornhaber and Eric Brown) produced the PBS Spacetime series on YouTube and I brought them onboard for the launch of DALL-E 2 because I loved the way they blended educational content and visual elements. I think it was Eric Brown (the Brown in Kornhaber/Brown) who created the motion graphic after listening to researchers talk about GPT-4. I loved writing with Eric because he had a really good grasp of how to break things down for people while not dumbing them down.

Another little detail about the launch video is that we didn’t use titles for any of the OpenAI employees. Even to this day OpenAI is an incredibly flat organization. I watched a DeepMind video where every talking head had a title and it seemed like a caste system. While I don’t know if that’s really how it is there, I wanted to show that at OpenAI titles didn’t matter all that much. The one exception to titles were the people from Microsoft that appeared. I was given very specific instructions from them about titles. Microsoft even flew one of their execs down on a private jet so he could be in the video.

Although I had a no title policy for names in the video, I was a bit shameless in putting my own name in the examples used on the GPT-4 launch page.

Exploring Vision was very interesting

We spent a lot of time trying to find examples that clearly showed the capabilities with GPT-4. In particular was the ability to use images and reason over them. One night in the lead up I was working on vision examples and got hungry and went to the kitchen to make something to eat. Standing in front of my refrigerator I remembered a demo IBM had done years prior where Watson would suggest a recipe based upon ingredients you suggested. I took my phone out, captured a photo and uploaded it to GPT-4 and it suggested quesadillas based on what it saw. My wife and I had just moved into our home and the refrigerator was a bit bare — it looked like the saddest bachelor refrigerator.

When we were testing GPT-4 with Vision early on I played with purely visual prompts using symbols to make requests to the model. It was fascinating to see that you could prompt it with some text and a drawing on a Post-It-Note and get good results. It became apparent that you could get much better performance with the model if you tried to think how the model “saw” images. Unlike humans, who use eye-tracking to see details and connections, vision models get one quick glance and have to reconstruct the details based on very limited data. When you ask a model like o3 to process visual data it will run a Python script and break it down into tiles and look at details more closely and look for connections.

When I gave the model a Rube Goldberg-style absurd machine it was very good at explaining the cause and effect – “The bowling ball will roll down the ramp, hit the toaster and send the toast into the air and wake up the cat.” Part of the capability was because the model was pretty good at assuming the relationship between objects, but it didn’t always have a strong sense of the physical connections. While creating examples for the vision model it was pointed out that my handwriting was so bad that it should be used for adversarial testing.

Long context

In looking for concrete bullet-point examples of what made GPT-4 better than precursor models, the ability to handle really long inputs stood out. GPT-3.5 could handle 4,096 max tokens, (~3,000 words) while the long context version of GPT-4 was capable of handling 32,768 tokens (~25,000 words.) As a small organization we were often dependent upon outside partners telling us how useful something was. One of the partners using the long context model pointed out that there were some accuracy issues (a problem with all long context models to this day) and it was decided to not announce the long context capability based on their observations.

I’d been using it to help do research and found it very useful for finding names and situations in my books that I had trouble recalling. So I ran my own tests and found for needle in the haystack queries it was quite good and the limitations were logarithmic with the length (they increased the longer the context) but were still quite usable for many purposes.

In the run up to launch I ran some more tests to show that it was useful and we ended up including it in the launch. I was very happy about this because I thought it was a great feature and it gave me a solid example to showcase the difference between GPT-4 and GPT-3.5.

The example I used to show off long context comprehension was summarizing the Wikipedia page for Rihanna’s Super Bowl performance. I used this because we knew that the model had finished training before it and there was no way it could have a summary of a future event.

I saw another AI company showcase their model’s long context by giving it The Great Gatsby and asking it to write the next chapter. This seemed like an unconvincing example to me because even if you were sure the text of the book wasn’t in the training data, there are probably thousands of examples of people summarizing it and breaking the book down. A more effective example would be to take an unpublished book and see if it could write a competent next chapter.

Naming things is hard (especially for OpenAI)

Naming things has never been OpenAI’s strong point. Part of the reason is that OpenAI is research driven. This means our research determined our product path and not the other way around. While there’s a logic to each individual name, there’s not a cohesive strategy between them all. Part of this is that while working on a model like GPT-4o (the ‘o’ stands for ‘omni’ because it can see, hear and read) another team is working on a new category of model that changes the paradigm completely (like o1 —– where the ‘o’ kind of stands for ‘OpenAI’.)

When we were getting ready to launch GPT-4 we hired a naming firm to help us find the right name for it. Several of us made the case that it should just be ‘GPT-4’ because there had never been anything in AI with that much anticipation and advanced name recognition. Before Apple announced the ‘iPhone’ people were talking about the iPhone because of the iPad, iBook and iMac that came before it. While it was more than a phone, you couldn’t ask for a better name than a name people already knew associated with you.

After much consideration and back and forth the firm came to us with their suggestion: GPT-4…

Iceland

Sometimes things just happen at the right point when they’re needed. In getting ready to launch GPT-4 we were thinking a lot about how you make something like this useful for everyone and also not just a cultural product of one group of people. An AI that helps you navigate the world of information should have a perspective broader than Northern California. In a planning meeting I suggested that adding languages that were vanishing to the GPT-4 training data could help preserve them and make the model accessible to future generations. I loved that idea that a child could talk with an AI a thousand years from now in a language that no longer existed.

As it would happen, the week after this was brought up in a meeting we had a delegation visiting from Iceland. We asked them if they’d be interested in helping us preserve Icelandic and they were very enthusiastic about that. As a result, Icelandic was the second language after English GPT-4 was specifically trained upon. This ended up being a source of pride for Iceland. A delegation of OpenAI representatives who went there were treated like visiting heads of state.

GPT-3.5

While we were getting GPT-4 ready for launch, an interesting thing happened: With all the people using GPT-3.5 in ChatGPT and the data we had about performance, we were able to keep training newer versions of ChatGPT and it kept improving —– and closing the gap with GPT-4 in certain meaningful ways. I had to pull some examples of things GPT-4 could do and GPT-3.5 couldn’t because after an update GPT-3.5 was now capable of doing those tasks. Conversely, As we post-trained GPT-4 for safety and to be better at certain tasks, I watch some of my favorite capabilities go away. The challenge with LLMs is that getting it really good at some tasks comes at the expense of it being not so good at others and the things inbetween. While Gemini 2.5 is a really good model, the first several models felt like they were only optimized for benchmarks. Meanwhile, Anthropic, which makes great models, and created a fantastic coding model with Sonnet 3.5, didn’t hold the top spot on some benchmarks like coding — but was generally regarded as the best coding mode even after the benchmarks said it wasn’t number one. Benchmarks have to be taken with a grain of salt.

Just to show you how weird model capabilities are, GPT-3.5 Turbo was better than the released version of GPT-4 at chess until later GPT-4 models came out. Chess play has long been considered a symbol of intellectual prowess, but GPT-3.5 Turbo was better than the next generation model. A lot of that has to do with post-training. The base model might actually be exceptional at something, but later training will curtail that.

My personal example of this was using GPT-4 to create memory palaces. The base model of GPT-4 created the best memory palace I’d ever seen from an LLM and understood the connections between words and visuals shockingly well. Later versions didn’t seem to have that ability. Take that with a grain of salt as well because it’s a very subjective opinion from me.

GPT-4 and Video

While playing with GPT-4’s vision capabilities before the announcement, I found that I could extract frames from a video and send them to the model and it did a great job of explaining what was happening. It could identify dance moves, magic tricks, golf swings, etc. I made a little Python app that let me load a video clip, extract frames and then send it to GPT-4 for video understanding. I showed this to the research team and asked if we could add “video understanding” to the list of GPT-4 features.

At the time there were rumors that Google was working on a true video comprehension model. Based on feedback from research we decided not to showcase video comprehension with GPT-4 because it would have looked silly if we were just doing sequential images and they had something that used true video understanding via some more sophisticated method like video tokens or some clever temporal method.

After GPT-4, Google announced Gemini and proudly exclaimed it was the first model LLM to handle video. This grabbed headlines and made Gemini look like a next generation model. I was excited to see that they solved the problem in a less janky manner than I was doing and dived into their research paper to see how they accomplished this. It turned out they had a loading script that extracted frames from the video and sent it to Gemini as a series of images — the same method as I used. No fancy tokens or temporal mojo. As frustrating as that was for me, I was proud that our research team didn’t want to make a claim about something they weren’t sure about. I’m also proud of the fact that Sam Altman and the other execs listened. Google and DeepMind have incredible researchers. I think sometimes their researchers were probably a bit annoyed by the way their executives framed things.

Insanely great

Just a few years before working on GPT-4 at OpenAI I’d been working in television. While I’d been interested in robotics and AI my whole life (I used to create chatbots with “memory” in BASIC when I was a kid, I knew Marvin Minksy, etc.) it seemed like me being able to do interesting things in artificial intelligence was very far away, especially for someone starting in their forties. Thankfully, my curiosity was stronger than my self-awareness of what was practical and I began studying the field more closely (building GANs image generators, taking Jeremy Howard’s wonderful FAST.AI course and making lots and lots of projects.) When I got a call out of the blue to experiment with GPT-3 it led to an incredible life-altering experience. My last job before OpenAI was starring in a magic prank show on A&E television (and swimming with great white sharks for a Shark Week special on the Discovery Channel.) If you told me after GPT-2 came out that I’d have my name in the GPT-4 research report for both my model capability work and my comms work I’d think that sounded cool, but out of reach. Life is crazy.

I’ve had an extremely lucky life so far and got to do a lot of different things from writing novels (I’m an Edgar and multi-time Thriller finalist — which still amazes me) to getting to be on my own television show and getting paid to swim with sharks. But none of that compares with the experience of being a tiny part of an incredible team and witnessing something truly magical happen. I got to be a fly on the wall at an amazing time. For someone who loved reading stories about the birth of the Macintosh, Atari or Bell Labs, I got to see history happen up close and spend time with the people that were changing history. While I can share some experiences about working on GPT-4, it’s hard to express (even for a writer) to tell you what it truly meant.

Read Entire Article