OpenAI researchers have released findings indicating that even the most advanced AI models struggle significantly with coding tasks. Despite claims from CEO Sam Altman that AI could outperform low-level software engineers by the end of the year, the research suggests otherwise.
The study utilized a benchmark named SWE-Lancer, which comprises over 1,400 software engineering tasks sourced from the freelancer platform Upwork. This benchmark assessed three large language models (LLMs): OpenAI’s own o1 reasoning model, GPT-4o, and Anthropic’s Claude 3.5 Sonnet.
Two types of tasks were evaluated: individual tasks, focusing on bug resolution and implementation of fixes, and management tasks, which required higher-level decision-making. Notably, the models did not have internet access during the testing, preventing them from retrieving existing solutions online.
While the models tackled tasks valued at hundreds of thousands of dollars, their performance was limited. They managed to address surface-level software issues but failed to identify bugs in larger projects or determine their root causes. The results revealed that, although the models operated faster than humans, they lacked an understanding of the context and prevalence of bugs, resulting in incorrect or incomplete solutions.
Among the models, Claude 3.5 Sonnet outperformed the OpenAI models, generating more revenue despite a majority of incorrect answers. The researchers emphasized that any AI model would require higher reliability to be deemed trustworthy for real-world coding tasks.
The findings underscore that while LLMs have made significant advancements, they are still not equipped to replace human engineers in software development. This gap remains despite ongoing discussions about AI’s potential to automate coding jobs.
For further details, visit the original article here.