I’ve been playing around with OpenAI’s new GPT-3 language model. When I got beta access, the first thing I wondered was, how human is GPT-3? How close is it to passing a Turing test?
How It Works
Let me explain how exactly I’m generating these conversations. GPT-3 is a general language model, trained on a large amount of uncategorized text from the internet. It isn’t specific to a conversational format, and it isn’t trained to answer any specific type of question. The only thing it does is, given some text, guess what text comes next.
So if we want GPT-3 to generate answers to questions, we need to seed it with a “prompt”. I’m using this prompt to initialize all of the Q&A sessions:
This is the default prompt suggested by OpenAI for Q&A, so I’m not cherrypicking it to prove anything. The point of this prompt is just to show GPT-3 that we’re doing questions and answers, not to provide it with information. For the prompt, both questions and answers are provided by a human. For all the others, the answers are generated by GPT-3.
Common Sense
Traditionally, artificial intelligence struggles at “common sense”. But GPT-3 can answer a lot of common sense questions.
Ten years ago, if I had this conversation, I would have assumed the entity on the other end was a human. You can no longer take it for granted that an AI does not know the answer to “common sense” questions.
How does GPT-3 know that a giraffe have two eyes? I wish I had some sort of “debug output” to answer that question. I don’t know for sure, but I can only theorize that there must be some web page in its training data that discusses how many eyes a giraffe has. If we want to stump GPT-3 with common sense questions, we need to think of questions about things so mundane, they will not appear on the internet.
It’s only 4/5. We’re closer to stumping GPT-3 here. I think a human would be pretty close to 100% on these questions. It makes sense these are trickier - there probably isn’t any web page that compares toasters and pencils by weight. It’s only indirectly that humans gain this knowledge.
This gives us a hint for how to stump the AI more consistently. We need to ask questions that no normal human would ever talk about.
Now we’re getting into surreal territory. GPT-3 knows how to have a normal conversation. It doesn’t quite know how to say “Wait a moment… your question is nonsense.” It also doesn’t know how to say “I don’t know.”
The lesson here is that if you’re a judge in a Turing test, make sure you ask some nonsense questions, and see if the interviewee responds the way a human would.
Trivia Questions
GPT-3 is quite good at answering questions about obscure things.
Oops, a repeat snuck in with question 4, but a human would make that sort of error too. GPT-3 seems to be above human-level on this sort of question. The tricky thing for applications, I think, is to figure out when the answer can be relied on. The OpenAI API does expose more data than just the text, here, so perhaps something clever is possible.
In general, if you are trying to distinguish an AI from a human, you don’t want to ask it obscure trivia questions. GPT-3 is pretty good at a wide variety of topics.
One trend that continues from the common sense is that GPT-3 is reluctant to express that it doesn’t know the answer. So invalid questions get wrong answers.
These wrong answers are actually fascinating! None of these were presidents of the United States, of course, since the US didn’t exist then. But they are all prominent political figures who were in charge of some US-related political entity around that time. In a sense, they are good guesses.
A bleak view of a dystopian future.
Encouraging as a Bengals fan, but perhaps not the objectively most accurate prediction. We’ll have to wait and see.
Logic
People are used to computers being superhuman at logical activities, like playing chess or adding numbers. It might come as a surprise that GPT-3 is not perfect at simple math questions.
This is where the generic nature of GPT-3 comes into play. It isn’t just the generic model, though, it’s also the architecture of neural networks themselves. As far as I know there is no neural network that is capable of doing basic arithmetic like addition and multiplication on a large number of digits based on training data rather than hardcoding.
It’s funny, because these operations are simple for a customized program. But recursive logic that does some operation and repeats it several times often doesn’t quite map onto the architecture of a neural net well.
An interesting corollary is that GPT-3 often finds it easier to write code to solve a programming problem, than to solve the problem on one example input:
This problem shows up in more human questions as well, if you ask it about the result of a sequence of operations.
It’s like GPT-3 has a limited short-term memory, and has trouble reasoning about more than one or two objects in a sentence.
Additional Discussion
It’s important to understand that the GPT-3 model’s behavior can change drastically with different prompts. In particular, all of the examples above are using the same default prompt, which doesn’t give any examples of nonsense questions, or of sequential operations.
It’s possible to improve GPT-3’s performance on the specific tasks above by including a prompt solving similar problems. Here are some examples:
- Nick Cammarata demonstrating a prompt that handles nonsense questions
- Gwern showing how GPT-3 can express uncertainty
- Gwern showing how GPT-3 can handle sequential operations
Right now, we are mostly seeing what GPT-3 can do “out of the box”. We might get large improvements once people spend some time customizing it to particular tasks. If you’d like to get a better sense of what can be done with prompt customization, Gwern’s exposition is excellent. Do read the whole thing.
Conclusion
We have certainly come a long way. The state of the art before modern neural networks was
GPT-3 is quite impressive in some areas, and still clearly subhuman in others. My hope is that with a better understanding of its strengths and weaknesses, we software engineers will be better equipped to use modern language models in real products.
As I write this, the GPT-3 API is still in a closed beta, so you have to join a waitlist to use it. I recommend that you sign up here and check it out when you get the chance.
.png)