Mercury 2: the first reasoning model fast enough to pick up the phone

Learn more

Mercury 2: the fastest reasoning model for voice agents

Mercury 2 vs Cerebras

Cerebras-class speed. None of the procurement.

Mercury 2 delivers ~1,000 tok/s on standard NVIDIA GPUs. Haiku-tier quality, no minimum contract, self-serve API key.

Try API

Read benchmarks

Trusted by teams at

Comparable performance.
Fraction of the cost.

THROUGHPUT

~1,000

tok/s

on standard NVIDIA GPUs

QUALITY TIER

Haiku & GPT-5 Mini

comparable quality

PRICING

$0.25

/ $0.75 at 1M

Pay as you go

MIN COMMITMENT

Zero commitment

API Key ready in 60s

Mercury 2 vs Cerebras.
Side by side.

Mercury 2

GPT-OSS on

Cerebras

Peak output speed

~1,000 tok/s

~1200-1700 tok/s

Quality tier

~Claude Haiku 4.5 / GPT-5 mini

Pricing / 1M tok

$0.25 in / $0.75 out

$0.35 in / $0.75 out

Prompt caching

Available - cached input $0.025/M tokens

Not currently offered - cached tokens billed at full $0.35/M

Rate-limit accounting

Elastic capacity metered on tokens you actually use

Reserves your max output up front - throttled before a token is generated

Architecture

Diffusion LLM on NVIDIA GPUs

Wafer-scale custom silicon (WSE-3)

Capacity model

Elastic, standard cloud GPU supply

Allocation-gated

*Comparison reflects publicly disclosed information at time of publication. Quality and speed tier benchmarks via Artificial Analysis. Methodology in our docs.

Real-time reasoning

Coding agents

Sub-seconds loops

Autocomplete, multi-step agents loops, and refactors that land before developer breaks flow.

Voice

Real-time pipelines

Conversational interfaces where latency is the product. Tightest latency budgets in AI.

Enterprise Search

Reasoning at retrieval speed

Multi-hop retrieval, reranking, and summarization without blowing the latency budget.

Verify it yourself

Don’t take our word. Run the harness on your traffic

Same prompts, both endpoints, your requests. It prints p50/p95 latency and cost per request. If we're wrong about your workload, you'll know in ten minutes.

import os, json, time, statistics, requests

PROMPTS = [json.loads(line)["prompt"] for line in open("traffic_sample.jsonl")]  # 50 production prompts
PROVIDERS = {
    "mercury-2": {"url": "https://api.inceptionlabs.ai/v1/chat/completions", "model": "mercury-2",    "key": os.environ["INCEPTION_API_KEY"], "in": 0.25, "out": 0.75},
    "cerebras":  {"url": "https://api.cerebras.ai/v1/chat/completions",      "model": "llama-3.3-70b", "key": os.environ["CEREBRAS_API_KEY"],  "in": 0.85, "out": 1.20},
}

def bench(p):
    lat, cost = [], 0.0
    for prompt in PROMPTS:
        t = time.perf_counter()
        u = requests.post(p["url"], headers={"Authorization": f"Bearer {p['key']}"},
                          json={"model": p["model"], "max_tokens": 1000,
                                "messages": [{"role": "user", "content": prompt}]}).json()["usage"]
        lat.append(time.perf_counter() - t)
        cost += (u["prompt_tokens"] * p["in"] + u["completion_tokens"] * p["out"]) / 1e6
    q = statistics.quantiles(lat, n=100, method="inclusive")
    return {"p50": q[49], "p95": q[94], "$/req": cost / len(PROMPTS)}

for name, p in PROVIDERS.items():
    print(name, bench(p))

Cerebras achieves speed through wafer-scale custom chips and through the capacity constraints and contract minimums that come with them.
Mercury 2 hits comparable throughput through a fundamentally different path: parallel difussion-based generation running on widely available NVIDIA instrastructure.

Is the quality really there at this speed?

Mercury 2 sits in the speed-optimized tier alongside Claude 4.5 Haiku and GPT-5 Mini on Artificial Analysis’s Agentic Index, not competing with frontier models like Opus or GPT-5.

If your workload runs well on Haiku or Mini today, Mercury 2 will hit it at roughly an order of magnitude more throughput. If you need frontier reasoning, you need a frontier model. We’re honest about which tier this is.

Benchmark it on your own workload

Spin up an API key, drop in the OpenAI-compatible endpoint, run it against Cerebras on your real prompts.

Try API

Contact sales

Products

Company