How ChatGPT, GPT4 and beyond came to be (an origin story)
Pretext 1: This blog was not AI-generated
Pretext 2: Get early access to Genie’s AI lawyer here
For other parts of this series on transformer models, see here:
- Part 1: What are transformer models and LLMs?
- Part 2: How ChatGPT, GPT-4 and beyond came to be (this article)
- Part 3: Everything you need to know about GPT-4
GPT-2 and BERT
In February 2019 Open AI released GPT-2. It’s popular competitor was BERT, created and open-sourced by Google in the paper: “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, released in May 2019.
Both models performed well at the time, though BERT appeared to be the preferred model. During this period, the key breakthrough was that these models could come “pre-trained” to work reasonably well for a large variety of tasks. Researchers and professionals could then “fine-tune” the models on further training tasks to improve accuracy for their specific needs. Indeed, the BERT paper specifically notes that a pre-trained BERT model can be fine-tuned with just one additional output layer to it’s multi-layered neural net, making it adaptable to a wide variety of tasks “without substantial architecture modifications”. As noted in my previous article, this is/was an extremely key point and innovation.
GPT-3 came out in June 2020, from a paper entitled “Language Models are Few-Shot Learners”. The main “innovation” of GPT-3 was taking advantage of the cheaper training costs of transformers and simply building a massive model; 175bn parameters to be exact (”10x more than any previous non-sparse language model”), and some estimates say it could have cost as much as $10m to train. Indeed, it was trained on large web corpora which represented much of the internet at the time!
Despite this, GPT-3 wasn’t all that different to GPT-2 except some minor architectural improvements to speed up training. Their main realization was this massive model could perform well on a large variety of generic and specific tasks without any further fine-tuning - hence the idea of “few-shot learners”; that is to say, the model performs well in just a few attempts on many tasks, without further training! Clearly this represented a step change for AI as a whole, because it lessened the importance and need for training data.
In January 2022 the InstructGPT paper was released entitled "Training language models to follow instructions with human feedback”. Unlike the progression from GPT-2 to GPT-3, InstructGPT was not a bigger model. In fact, it was a much smaller model, of only 1.3bn parameters, i.e. 100x less than GPT-3. The key innovation this time was using a reinforcement learning algorithm known as proximal policy optimization (PPO) to further train a smaller transformer model on what “good” results looked like, according to a team of 40 human labellers.
That’s right - OpenAI hired a team of human labellers to create a large training dataset to make the model produce results that were more expected by humans.
This was a fascinating approach because once again there were not any major innovations, but instead OpenAI relied on the “traditional” approach of manually creating training data, signifying a return of the importance of training data.
It also shows the power of being “problem” or “user focused” because tuning the model to perform more like a human would expect seemed to produce much “better” results than simply achieving high metrics on machine learning benchmarks. Indeed, the model at times regressed and performed worse on such benchmarks, despite the results being more pleasing to humans (however this was also addressed in the training regime).
InstructGPT, which eventually led to chatGPT, essentially tuned OpenAI’s transformers to produce more “realistic” results as defined and expected by humans. In particular, people were giving “instructions” to the model, rather than input and output sequences theoretically expected by seq2seq models. However as can be seen above, not that much technical or architectural innovation has occured since Vaswani’s 2017 paper on self-attention.
GPT4 and beyond
Future avenues of research and development are likely to focus on areas such as:
- Multimodal data (inputting and outputting text with images, videos and so on)
- Solving the limited input sequence length problem
- More advanced architectures to improve performance
- Once new architectures are created or a change in compute power / requirements, a return to larger models (for now however the trend is smaller models - this oscillates over time).
Funnily enough, this article was written a few days before the publishing of GPT-4, and GPT-4 has more or less taken into account all of the above, with the exception of significant architecture changes! For more information on GPT-4, see our write-up here.
The strength of training data still remains an important factor in the quality of these models; at Genie AI we have access to unique, highly structured document drafting, collaboration, review and negotiation data which means our proprietary transformer models outperform the GPT family of models for legal tasks.
You can sign up to our app here, or get early access to our legal AI here.