How smart are the latest AI models compared to humans? Let’s take a look at how the most competent AI systems compare with humans in various domains. The list below is regularly updated to reflect the latest developments.

Last update: 2025-06-28

Superhuman (Better than all humans)

Games: For many games (Chess, Go
, Starcraft, Dota, Gran Turismo
etc.) the best AI is better than the best human.
Working memory: An average human can remember about 7 items (such as numbers) at a time. Gemini 1.5 Pro can read and remember 99% of 7 million words
.
Reading speed: A model like Gemini 1.5 Pro can read an entire book in 30 seconds. It can learn an entirely new language and translate texts in half a minute.
Writing speed: AI models can write at speeds far surpassing any human, writing entire computer programmes in seconds.
Amount of knowledge: Modern LLMs know far more than any human, its knowledge spanning virtually every domain. There is no human whose knowledge breadth comes close.

Better than most humans

Programming: o3 beats 99.9% of human coders
in the very challenging Codeforces competition. It manages to solve 71.7% of coding issues in the SWE benchmark, which shows it can also solve real-world software engineering problems very effectively.
Writing: In December 2023, an AI-written novel won an award at a science fiction national competition
. The professor who used the AI crafted the narrative from a draft of 43,000 characters generated in just three hours with 66 prompts. The best language models have superhuman vocabulary and can write in many different styles.
Translating: And they can respond and translate to all major languages fluently.
Creativity: Better than 99% of humans on the Torrance Tests of Creative Thinking
where relevant and useful ideas need to be generated. However, the tests were relatively small and for larger projects (e.g. setting up a new business) AI is not autonomous enough yet.
Domain expertise: o3 correctly answers 87.7%
of GPQA diamond questions, outperforming human domain experts (PhDs) who only get 69.7%.
Visual reasoning: o3 achieved a score of 87.5% on the ARC-AGI benchmark
(human average is 60%), which was specifically designed to be hard for large language models.
Maths: Gemini 2.5 pro got a gold medal
in the International Math Olympiad - the world’s most prestigious math competition.
Persuasion: GPT-4 with access to personal information was able to increase participants’ agreement with their opponents’ arguments by a remarkable 81.7 percent
compared to debates between humans - almost twice as persuasive as the human debaters.
IQ tests: With verbal IQ tests, LLMs have been outperforming 95 to 99% of humans for a while (score between 125
and 155
). With non-verbal (pattern matching) IQ tests, the 2024 o1-preview model scored 120 on the Mensa test
, beating 91% of humans.
Specialized knowledge: GPT-4 Scores 75% in the Medical Knowledge Self-Assessment Program
, humans on average between 65 and 75%
. It scores better than 68
to 90%
of law students on the bar exam.
Art: Image generation models have won art
and even photography contests
.
Research: GPT-4 can do autonomous chemical research
and DeepMind has built an AI that has found a solution to an open mathematical problem
. However, these architectures require a lot of human engineering and are not general.
Hacking: GPT-4 can autonomously hack websites
and beats 89% of hackers
in a Capture-the-Flag competition.
Using a web-browser: Gemini 2.0 achieved 84% on the WebVoyager benchmark
, outperforming humans (72%)
Being a convincing human in a chat: GPT-4.5 passed the Turing test
, and was considered to be human more often than actual humans.

Worse than most humans

Saying “I don’t know”. Virtually all Large Language Models have this problem of ‘hallucination’, making up information instead of saying it does not know. This might seem like a relatively minor shortcoming, but it’s a very important one. It makes LLMs unreliable and strongly limits their applicability. However, studies show
that larger models hallucinate far less than smaller ones.
Dextrous movement. No robots can move around like a human can, but we’re getting closer. The Atlas robot can walk, throw objects and do somersaults
. Google’s RT-2
can turn objectives into actions in the real world, like “move the cup to the wine bottle”. Tesla’s Optimus robot can fold clothes
and Figure’s biped can make coffee
.
Self-replication. All lifeforms on earth can replicate themselves. AI models could spread from computer to computer through the internet, but this requires a set of skills that AI models do not yet possess. A 2023 study
lists a set of 12 tasks for self-replication, of which tested models completed 4. In December 2024, a study
showed that various open source models can self-replicate on a machine, given some tooling. In a 2025 study
, Claude 3.7 Sonnet had a >50% score on 15/20 self-replication tasks. An AI that successfully self-replicates might lead to an AI takeover.
Continual learning. Current SOTA LLMs separate learning (‘training’) from doing (‘inference’). Although LLMs can learn using their context, they cannot update their weights while being used. Humans learn and do at the same time. However, there are multiple potential approaches towards this
. A 2024 study
detailed some recent approaches for continual learning in LLMs.
Planning. LLMs are not yet very good at planning (e.g. reasoning about how to stack blocks on a table)
. However, larger models do perform way better than smaller ones.

The endpoint

As time progresses and capabilities improve, we move items from lower sections to the top section. When some specific dangerous capabilities are achieved, AI will pose new risks. At some point, AI will outcompete every human in every metric imaginable. When we have built this superintelligence, we will probably soon be dead. Let’s implement a pause to make sure we don’t get there.

(Top)

State-of-the-art AI capabilities vs humans

Superhuman (Better than all humans)

Better than most humans

Worse than most humans

The endpoint