New ARC-AGI-2 Benchmark Exposes Limits of Leading AI Models

1 min read

A new benchmark designed to test artificial general intelligence (AGI) has revealed that even the most advanced AI systems continue to fall far short of human-like reasoning. The ARC-AGI-2 test, developed by the Arc Prize Foundation and co-created by renowned AI researcher François Chollet, challenges models with visually grounded puzzles that require abstract pattern recognition, spatial reasoning, and inductive logic—skills essential for general intelligence.

The puzzles, which resemble matrix-style visual tests, present colored square grids that AI systems must analyze and complete by identifying the underlying rule. Unlike traditional benchmarks that reward scale and brute-force computation, ARC-AGI-2 has been specifically designed to test genuine problem-solving ability without relying on massive model sizes or billions of training parameters. In its current iteration, the benchmark penalizes solutions that depend heavily on large-scale compute, putting a premium on efficiency and true cognitive reasoning.

The results have been sobering. While human participants averaged around 60% accuracy on the tasks, most leading AI models—including state-of-the-art large language models and multimodal systems—scored near 1%. These low results underscore a persistent challenge in the field: current AI systems remain fundamentally limited when faced with tasks that require flexible thinking, generalization, and adaptation beyond training data.

To incentivize breakthroughs, the Arc Prize Foundation has launched the Arc Prize 2025, offering a substantial reward for any team that can achieve 85% accuracy on ARC-AGI-2 while maintaining a cost of $0.42 or less per task. The goal is not just to push performance higher, but to encourage efficient and scalable solutions that could one day underpin real-world AGI systems.

The benchmark arrives at a time when claims of approaching AGI have become increasingly common in the tech world. However, ARC-AGI-2 acts as a sobering counterpoint, illustrating just how far current models remain from human-level cognition. As excitement grows around AI’s potential, tests like this play a vital role in grounding expectations and setting meaningful milestones for progress.

In highlighting the wide gap between current AI capabilities and true general intelligence, ARC-AGI-2 is poised to become a cornerstone benchmark in the years ahead—challenging researchers to move beyond scale and toward substance in the quest for intelligent machines.

Global Tech Insider