Mercury 2 vs Cerebras

Cerebras-class speed. None of the procurement.

Cerebras-class speed. None of the procurement.

Cerebras-class speed. None of the procurement.

Mercury 2 delivers ~1,000 tok/s on standard NVIDIA GPUs. Haiku-tier quality, no minimum contract, self-serve API key.

Trusted by teams at

Trusted by teams at

Comparable performance.
Fraction of the cost.

Comparable performance.
Fraction of the cost.

THROUGHPUT

~1,000

~1,000

tok/s

on standard NVIDIA GPUs

QUALITY TIER

Haiku & GPT-5 Mini

comparable quality

PRICING

$0.25

$0.25

/ $0.75 at 1M

Pay as you go

MIN COMMITMENT

Zero commitment

API Key ready in 60s

Mercury 2 vs Cerebras.
Side by side.

Mercury 2

Cerebras

Architecture

Diffusion LLM on NVIDIA GPUs

Wafer-scale custom silicon (WSE-3)

Model

Mercury 2 — one proprietary model

Open-weight catalog you select (Llama, Qwen3, gpt-oss, GLM, Kimi K2)

Quality tier

~Claude Haiku 4.5 / GPT-5 mini

Depends on model run — entry to frontier-class

Peak output speed

~1,000 tok/s

~1800 tok/s


Pricing / 1M tok

$0.25 in / $0.75 out — flat

$0.10–$2.75, ~25× spread by model

Path to production

Self-serve, pay as you go — no contract

Self-serve dev tier; enterprise contract for production capacity

Capacity model

Elastic, standard cloud GPU supply

Allocation-gated; Code Pro/Max self-serve sold out¹

API

OpenAI-compatible drop-in

OpenAI-compatible

*Comparison reflects publicly disclosed information at time of publication. Quality and speed tier benchmarks via Artificial Analysis. Methodology in our docs.

Verify it yourself

Don’t take our word. Run the harness on your traffic

Same prompts, both endpoints, your requests. It prints p50/p95 latency and cost per request. If we're wrong about your workload, you'll know in ten minutes.

import os, json, time, statistics, requests

PROMPTS = [json.loads(line)["prompt"] for line in open("traffic_sample.jsonl")]  # 50 production prompts
PROVIDERS = {
    "mercury-2": {"url": "https://api.inceptionlabs.ai/v1/chat/completions", "model": "mercury-2",    "key": os.environ["INCEPTION_API_KEY"], "in": 0.25, "out": 0.75},
    "cerebras":  {"url": "https://api.cerebras.ai/v1/chat/completions",      "model": "llama-3.3-70b", "key": os.environ["CEREBRAS_API_KEY"],  "in": 0.85, "out": 1.20},
}

def bench(p):
    lat, cost = [], 0.0
    for prompt in PROMPTS:
        t = time.perf_counter()
        u = requests.post(p["url"], headers={"Authorization": f"Bearer {p['key']}"},
                          json={"model": p["model"], "max_tokens": 1000,
                                "messages": [{"role": "user", "content": prompt}]}).json()["usage"]
        lat.append(time.perf_counter() - t)
        cost += (u["prompt_tokens"] * p["in"] + u["completion_tokens"] * p["out"]) / 1e6
    q = statistics.quantiles(lat, n=100, method="inclusive")
    return {"p50": q[49], "p95": q[94], "$/req": cost / len(PROMPTS)}

for name, p in PROVIDERS.items():
    print(name, bench(p))

Real-time reasoning

Coding agents

Sub-seconds loops

Autocomplete, multi-step agents loops, and refactors that land before developer breaks flow.

Voice

Real-time pipelines

Conversational interfaces where latency is the product. Tightest latency budgets in AI.

Enterprise Search

Reasoning at retrieval speed

Multi-hop retrieval, reranking, and summarization without blowing the latency budget.

Cerebras achieves speed through wafer-scale custom chips and through the capacity constraints and contract minimums that come with them.
Mercury 2 hits comparable throughput through a fundamentally different path: parallel difussion-based generation running on widely available NVIDIA instrastructure.

Is the quality really there at this speed?

Mercury 2 sits in the speed-optimized tier alongside Claude 4.5 Haiku and GPT-5 Mini on Artificial Analysis’s Agentic Index, not competing with frontier models like Opus or GPT-5.

If your workload runs well on Haiku or Mini today, Mercury 2 will hit it at roughly an order of magnitude more throughput. If you need frontier reasoning, you need a frontier model. We’re honest about which tier this is.

Benchmark it on your own workload

Spin up an API key, drop in the OpenAI-compatible endpoint, run it against Cerebras on your real prompts.