Software: 3.0 vs. 2.0 vs. 1.0

4 months ago 18

It’s been a while since Karpathy wrote his first post on software 2.0. However, the bulk of the software is still being written in Python. As the post states, one of the main drawback of software 2.0 is predictability and understandability. Additionally, a big drawback is the lack of high-quality training examples. However, the use of “in-context-learning” with “few-shot” examples has made this problem less critical. He called this new flavor of software, software 3.0 in the latest talk.

Software 1.0: Heuristic, and rule based solutions. (Manually curated rules/algorithms, e.g. python code)

Software 2.0: Small LM based solutions. (Trained on task specific datasets, , e.g. nltk).

Software 3.0: Foundational model + prompting/fine-tuning based solution.

Connected components example (best suited for 1.0)

Problem: Let’s say the problem is finding connected components in a graph. Given a graph as nodes and edges. Where nodes are [a, b, c, d, e, f] and edges are [(a,b), (a,c), (e,d)]. GPT-4o mini was able to come up with the basic reasoning to solve this problem (link to chat). However, imagine put such a system in production for finding connected components. It sounds strictly suboptimal in terms of speed. In particular:

A hand coded algorithm (software 1.0), if one exists, will almost always be faster and more accurate at solving a problem than solving it using an LLM based approach (software 3.0). A smaller fine tuned model (software 2.0) will likely be faster but still unnecessarily ineffecient.

The only fallacy to the above statement is problems which are too gnarly that 1.0 solution is simply non existent. It’s particularly easy to fall in the trap of using a 1.0 solution if the problem seems within reach. Let’s look at some such examples.

Stemming/Lemmatization example (best suited for 2.0)

There are a bunch of 1.0 rule-based approaches to these (https://en.wikipedia.org/wiki/Stemming). Traditional stemming algorithms like Porter and Snowball offer fast, rule-based methods to reduce words to their common root, often resulting in non-valid words (e.g., “happili” from “happily”) and sometimes conflating unrelated terms (“univers” for “universal” and “university”). Whereas, Machine Learning approaches, particularly for lemmatization, leverage extensive training data and often contextual understanding (e.g., Part-of-Speech tagging) to produce linguistically accurate and valid base forms (lemmas), such as correctly mapping “ran” to “run” or distinguishing “leaves” as a noun from “leaves” as a verb. Note that there is also enough high quality training data that a 2.0 model works pretty well for this problem, even LSTM based models (e.g. Stanza) can perform pretty well. A foundational model (3.0 based) approach is of course more accurate at the expense of higher latency, in that order respectively.

Text-to-SQL example (best suited for 3.0)

Now let’s look at the text-to-sql problem. Assume you have a specific dataset at your work, with a bunch of tables. You want to build a system which can convert a given natural language text query text to sql. Note that just using text-to-sql corpus to train an ML model is not enough. The primary reason being the problem is more comple than the amount training data at hand. For instance, look at Spider or Bird benchmark. Any 2.0 based approaches are considerably lower accuracy than 3.0 (foundational model + prompting) based approaches.

Summary

Efficiency: Software 3.0 < Software 2.0 < Software 1.0

Efficacy for vague problems: {Software 3.0, Software 2.0} > Software 1.0

With 3.0 being more suited when training data is not enough.

Read Entire Article