The latest ARC-AGI-2 benchmark presents a challenging new test for artificial intelligence models, revealing that even the most advanced systems currently available are struggling to meet the criteria for artificial general intelligence (AGI). This benchmark assesses not only the capabilities of AI models but also the efficiency and cost associated with their operation.
AGI is generally defined as AI that can perform any cognitive task that humans are capable of. Historically, the ARC Prize Foundation introduced ARC-AGI-1 to evaluate AI reasoning abilities. Last December, a high score from OpenAI’s model sparked discussions about the company’s progress toward AGI.
However, the introduction of ARC-AGI-2 has significantly raised expectations. Current AI systems are unable to achieve more than a single-digit score out of 100, despite every question being successfully answered by at least two humans in under two attempts.
ARC president Greg Kamradt emphasized the importance of this new benchmark, stating it requires a blend of adaptability and efficiency to excel, differentiating it from previous evaluations. “To beat it, you must demonstrate both a high level of adaptability and high efficiency,” he remarked.
Unlike other benchmarks that assess complex tasks, ARC-AGI-2 emphasizes basic tasks, such as making changes to an image based on prior examples. While current models excel in deep learning tasks measured by ARC-AGI-1, they fall short in completing these seemingly simpler challenges that demand intricate reasoning and interaction. For instance, OpenAI’s o3-low model achieves a score of 75.7% on ARC-AGI-1 but only manages 4% on ARC-AGI-2.
The new benchmark introduces a crucial perspective by evaluating AI problem-solving efficiency, factoring in the operational costs. For instance, while human testers were compensated $17 per task, the estimated cost for OpenAI’s o3-low to complete similar tasks is approximately $200.
Joseph Imperial from the University of Bath highlights that this focus on balancing performance with efficiency is a notable advancement in evaluating AI. He notes that this shift may lead to more sustainable AI development, addressing concerns about energy consumption in pursuit of performance.
Nevertheless, not all experts agree with the implications of ARC-AGI-2. Catherine Flick from the University of Staffordshire argues that framing it as a measure of intelligence may be misleading, as the benchmarks primarily evaluate the ability to accomplish specific tasks rather than general intelligence. She cautions against overinterpreting these scores as evidence of human-level intelligence, stating, “What they are doing is really just responding to a particular prompt accurately.”
The future of AGI benchmarks remains an open question. If a model were to succeed in passing ARC-AGI-2, discussions around the need for continued evolution of benchmarks, such as a potential ARC-AGI-3, would likely intensify. This ongoing dialogue indicates that the pursuit of true artificial general intelligence is far from reaching a conclusion.
Topics: