Maicon Matsubara
Maicon Matsubara
Especialista em IA & Agentes Inteligentes

// technical reference

LLM <span class="accent-word">Benchmarks</span> Guide

What each benchmark truly measures — and which one to use depending on your use case.

14 benchmarks
5 categorias
2 idiomas

// benchmarks — click filters to highlight

SWE-bench
Agentic Coding

The model receives a real GitHub repo with a reported bug and must write the patch that fixes it — no hints about where the problem is.

Measures: ability to navigate complex codebases, understand broad context, plan multi-file edits and execute sequential actions autonomously.

Relevance
Agentic
5
Multi-Agent
3
Vibe Coding
5
General Assistant
1
Reasoning
2
HumanEval / MBPP
Coding

Isolated programming problems: given a description, generate the correct function. HumanEval uses Python; MBPP includes beginner-to-intermediate problems.

Measures: code synthesis in a single function. Does not test agentic reasoning or real file editing. Good for "can the model code at all?"

Relevance
Agentic
1
Multi-Agent
Vibe Coding
4
General Assistant
1
Reasoning
2
MMLU
General Assistant

57 knowledge domains: medicine, law, history, physics, math, ethics... The model answers multiple-choice questions at university/grad level.

Measures: breadth of factual knowledge. Great indicator of whether the model is a well-rounded generalist. Does not test deep reasoning.

Relevance
Agentic
Multi-Agent
Vibe Coding
1
General Assistant
5
Reasoning
2
GPQA (Diamond)
Reasoning

Questions created by PhDs in physics, biology, and chemistry — so hard that non-domain experts get 30% wrong even with internet access. Diamond = hardest subset.

Measures: deep and rigorous scientific reasoning. Signals advanced technical research capability. Does not directly test code or agentic behavior.

Relevance
Agentic
Multi-Agent
Vibe Coding
1
General Assistant
2
Reasoning
5
MATH / AIME
Reasoning

MATH: 12,500 competition problems (algebra, calculus, probability). AIME: US math olympiad — 15 problems per competition, no multiple-choice.

Measures: step-by-step mathematical reasoning, logical rigor, and multi-step problem solving. Direct benchmark for reasoning models (o1, R1, etc).

Relevance
Agentic
Multi-Agent
Vibe Coding
2
General Assistant
1
Reasoning
5
TAU-bench
Agentic Multi-Agent

Simulates customer support agents with real tools (databases, APIs). The agent must solve multi-step tasks by conversing with a simulated user.

Measures: tool use, decision-making in reactive environments, policy following, and error resilience. Benchmark for production agents.

Relevance
Agentic
5
Multi-Agent
4
Vibe Coding
1
General Assistant
2
Reasoning
1
GAIA
Agentic Multi-Agent

Real-world tasks humans could solve in minutes but that require the agent to web search, read PDFs, run code, and combine multiple information sources.

Measures: orchestration of heterogeneous tools, long-term planning, and real-world grounding. Widely used to evaluate agentic assistants like Deep Research.

Relevance
Agentic
4
Multi-Agent
5
Vibe Coding
2
General Assistant
2
Reasoning
3
WebArena
Agentic

The agent controls a real browser and performs tasks on simulated sites (Reddit, GitLab, e-commerce, maps). VisualWebArena adds visual perception (screenshots).

Measures: autonomous web navigation, real UI interaction, sequential action planning. Reference for computer-use agents and automation agents.

Relevance
Agentic
5
Multi-Agent
2
Vibe Coding
1
General Assistant
1
Reasoning
2
Chatbot Arena (LMSYS)
General Assistant

Real humans compare two anonymous models in open-ended conversations and vote for the best. The ELO ranking is computed from thousands of real preferences.

Measures: real human preference in everyday use — writing, conversation, instruction following, tone. Best proxy for "which model do people actually prefer".

Relevance
Agentic
2
Multi-Agent
1
Vibe Coding
2
General Assistant
5
Reasoning
2
MT-Bench
General Assistant

80 multi-turn conversations across 8 categories (writing, roleplay, extraction, reasoning, math, code, etc.). GPT-4 acts as a judge evaluating responses.

Measures: instruction following in real chat context and multi-turn coherence. Closer to real chatbot use than multiple-choice benchmarks.

Relevance
Agentic
2
Multi-Agent
1
Vibe Coding
2
General Assistant
4
Reasoning
2
LiveCodeBench
Coding

Competitive programming problems collected after model cutoffs (LeetCode, Codeforces, AtCoder) — avoids training data contamination.

Measures: ability to solve novel algorithm and data structure problems. More reliable than HumanEval because it is the live test of the model.

Relevance
Agentic
1
Multi-Agent
Vibe Coding
5
General Assistant
1
Reasoning
3
AgentBench
Multi-Agent

Suite of 8 environments: OS, database, web, games, shopping, and more. The model acts as an autonomous agent in each interactive environment.

Measures: generalization of agentic behavior across diverse environments — crucial for evaluating frameworks like AutoGPT, CrewAI and similar.

Relevance
Agentic
3
Multi-Agent
5
Vibe Coding
2
General Assistant
1
Reasoning
1
IFEval
General Assistant Agentic

Instructions with explicitly verifiable constraints: "respond in under 100 words", "use exactly 3 sections with headings", "do not use the word X".

Measures: precise instruction and constraint following — critical skill for both assistants and agents receiving detailed system prompts.

Relevance
Agentic
4
Multi-Agent
2
Vibe Coding
1
General Assistant
4
Reasoning
1
SimpleQA
General Assistant

Short, verifiable factual questions about the real world. Designed to measure calibration: the model should not confidently hallucinate when it does not know.

Measures: factual honesty and confidence calibration. A model with a high hallucination rate will score poorly here even with a high MMLU.

Relevance
Agentic
Multi-Agent
Vibe Coding
General Assistant
3
Reasoning
1

// resumo por caso de uso — benchmarks prioritários

Agentic
  • SWE-bench Verified
  • TAU-bench
  • WebArena / VisualWebArena
  • GAIA
  • IFEval
Multi-Agent
  • GAIA
  • AgentBench
  • TAU-bench
  • SWE-bench
Vibe Coding
  • SWE-bench
  • LiveCodeBench
  • HumanEval / MBPP
  • BigCodeBench
General Assistant
  • Chatbot Arena
  • MT-Bench
  • MMLU
  • SimpleQA
  • IFEval
Reasoning
  • GPQA Diamond
  • AIME 2024/2025
  • MATH-500
  • ARC-Challenge
Beware of marketing: companies choose which benchmarks to report. A model can have a high SWE-bench but a mediocre Chatbot Arena score — pick the benchmark aligned with YOUR use case.

Data contamination is a real issue: models may have been trained on data that includes the benchmarks. LiveCodeBench and GPQA Diamond are more resistant due to being newer or expert-curated.