Freelance coders take solace: while AI models can perform a lot of the real-world coding tasks that companies contract out, they do so less effectively than a human.
At least that was the case two months ago, when researchers with Alabama-based engineering consultancy PeopleTec set out to compare how four LLMs performed on freelance coding jobs.
David Noever, chief scientist at PeopleTec, and Forrest McKee, AI/ML data scientist at PeopleTec, describe their project in a preprint paper titled, "Can AI Freelancers Compete? Benchmarking Earnings, Reliability, and Task Success at Scale."
"We found that there is a great data set of genuine [freelance job] bids on Kaggle as a competition, and so we thought: why not put that to large language models and see what they can do?"
Using the Kaggle dataset of Freelancer.com jobs, the authors built a set of 1,115 programming and data analysis challenges that could be evaluated using automated tests. The benchmarked programming tasks necessary to perform the freelance jobs were also assigned a monetary value, at an average of $306 (median $250), such that the paper stated that completing every freelance job could achieve a total potential value of "roughly $1.6 million."
Then they evaluated four models: Claude 3.5 Haiku, GPT-4o-mini, Qwen 2.5, and Mistral, the first two representing commercial models and the latter two being open source. The authors estimate that a human software engineer would be able to solve more than 95 percent of the challenges. No model did as well as that, but Claude came closest.
"Claude 3.5 Haiku narrowly outperformed GPT-4o-mini, both in accuracy and in dollar earnings," the paper reports, noting that Claude managed to capture about $1.52 million in theoretical payments out of the possible $1.6 million.
"It solved 877 tasks with all tests passing, which is 78.7 percent of the benchmark – a very high score for such a diverse task set. GPT-4o-mini was close behind, solving 862 tasks (77.3 percent). Qwen 2.5 was the third best, solving 764 tasks (68.5 percent). Mistral 7B lagged behind, solving 474 tasks (42.5 percent)."
Inspired by OpenAI's SWE-Lancer benchmark
Noever told The Register that the project came about in response to OpenAI's SWE-Lancer benchmark, published in February.
"They had accumulated a million dollars' worth of software tasks that were genuinely market reflective of [what companies were actually asking for]," said Noever. "It was unlike any other benchmark we've seen, and you know there's millions of those. And so we wanted to make it more universal beyond just ChatGPT."
Overall, the models evaluated had much less success with the OpenAI SWE-Lancer benchmark than with the benchmarks the researchers created, possibly because the range of problems was more difficult in the OpenAI study. The payouts in OpenAI's SWE-Lancer study, with a total work value of $1 million, came to $403,325 for Claude 3.5 Sonnet, $380,350 for GPT-o1, and $303,525 for GPT-4o.
On one specific subset of tasks in the OpenAI study, the best performing model was more or less worthless.
"The best performing model, Claude 3.5 Sonnet, earns $208,050 on the SWE-Lancer Diamond set and resolves 26.2 percent of IC SWE issues; however, the majority of its solutions are incorrect, and higher reliability is needed for trustworthy deployment," the OpenAI paper says.
- Google's AI vision clouded by business model hallucinations
- Nvidia CEO Jensen Huang labels US GPU export bans 'precisely wrong' and 'a failure'
- Microsoft-backed AI out-forecasts hurricane experts without crunching the physics
- Estimating AI energy usage is fiendishly hard – but this report took a shot
Regardless, while AI models cannot replace freelance coders, Noever said people are already using them to help them fulfill freelance software engineering tasks. "I don't know whether someone's completely automated the pipeline," he said. "But I think that's coming, and I think that could be months."
People, he said, are already using AI models to generate freelance job requirements. And those are being answered by AI models and scored by AI models. It's AI all the way down.
"It's really phenomenal to watch," he said.
One of the interesting findings to come out of this study, Noever said, was that open source models break at 30 billion parameters. "That's right at the limit of a consumer GPU," he said. "I think Codestral is probably one of the strongest [of these open source models], but it's not going to complete these tasks. …So as it plays out, I think it does take infrastructure. There's just no way around that." ®