Techmeme 20260501 CAISI Evaluation of DeepSeek V4 Pro Summary

Generated by Codex with GPT-5

What happened

Techmeme surfaced this May 3, 2026 item in its Techmeme cluster, and the direct source used here is NIST’s May 1, 2026 CAISI Evaluation of DeepSeek V4 Pro.

The Center for AI Standards and Innovation evaluated DeepSeek V4 Pro, DeepSeek’s latest open-weight model, across cyber, software engineering, natural science, abstract reasoning, and math tasks. CAISI’s headline finding is deliberately two-sided: DeepSeek V4 Pro is the most capable Chinese model the group has evaluated so far, but it still appears to trail the leading US frontier models by about eight months in aggregate capability.

That matters because it cuts against the most optimistic reading of DeepSeek’s own launch materials. DeepSeek’s self-reported results made V4 Pro look roughly comparable to recent frontier models such as Anthropic Opus 4.6 and OpenAI GPT-5.4. CAISI’s benchmark suite tells a different story. On the agency’s mix of public, semi-private, and internally built evaluations, DeepSeek V4 Pro looks closer to GPT-5, a model CAISI describes as roughly eight months older than the current frontier.

The evaluation is especially interesting because the gap is not uniform. DeepSeek V4 Pro looks strong in math and competitive in some science benchmarks. It scores 97% on OTIS-AIME-2025, 96% on PUMaC 2024, 96% on SMT 2025, and 90% on GPQA-Diamond. It is also close to GPT-5.4 mini on SWE-Bench Verified, at 74% versus 73%.

The weaker points show up in the tasks that are harder to reduce to public leaderboard performance. DeepSeek V4 Pro scores 44% on PortBench, CAISI’s internal software-engineering benchmark for porting command-line tools between programming languages, compared with 60% for Opus 4.6 and 78% for GPT-5.5. On the CTF-Archive-Diamond cyber benchmark, DeepSeek V4 Pro is at 32%, level with GPT-5.4 mini but behind Opus 4.6 at 46% and GPT-5.5 at 71%. On ARC-AGI-2’s semi-private abstract reasoning set, DeepSeek V4 Pro scores 46%, behind Opus 4.6 at 63% and GPT-5.5 at 79%.

CAISI summarizes the aggregate result with an Item Response Theory-inspired capability score. In that framing, GPT-5.5 is far ahead at 1260 Elo, Opus 4.6 sits at 999, DeepSeek V4 Pro lands at 800, and GPT-5.4 mini lands at 749. That makes DeepSeek V4 Pro look meaningfully below the frontier, but also slightly above the cheaper US reference model that CAISI used for cost comparisons.

The cost comparison is the second half of the story. DeepSeek V4 Pro was more cost efficient than GPT-5.4 mini on five of the seven benchmarks CAISI included in its cost analysis. Across those benchmarks, DeepSeek ranged from 53% less expensive to 41% more expensive. The token prices used in the analysis were developer-reported: DeepSeek V4 Pro at \$1.74 per million uncached input tokens, \$0.0145 per million cached input tokens, and \$3.48 per million output tokens; GPT-5.4 mini at \$0.75, \$0.075, and \$4.50 on the same dimensions.

Why it matters

The useful thing about this evaluation is that it reframes DeepSeek V4 Pro from “frontier parity” to “frontier pressure.”

DeepSeek’s own report emphasized a model that could compete with the strongest closed systems while remaining open-weight and efficient. CAISI’s evaluation makes that claim look too generous, at least on harder reasoning and agentic workloads. But the evaluation does not dismiss DeepSeek. It says the opposite in a more precise way: DeepSeek V4 Pro is the strongest Chinese model CAISI has measured, it is competitive with a cheaper US reference model, and it may be cheaper to run for many tasks at similar capability.

That is a more strategically useful conclusion than a simple winner-loser benchmark story. If a model is clearly worse than GPT-5.5 but cheaper and strong enough for software, science, and math workloads, it can still reshape the market. Enterprises do not always buy the smartest possible model. They buy the model that clears the quality bar at the best combination of price, latency, control, and deployment constraints. Open weights and lower operating costs can matter as much as leaderboard rank when a company is trying to run agentic workflows at scale.

The evaluation also shows why public benchmarks are becoming less sufficient. DeepSeek looks better on the benchmarks it highlighted than on CAISI’s held-out or less visible tests. That does not prove bad faith; every lab naturally presents its strongest numbers. But it does show why serious model comparisons need independent suites, controlled scaffolding, token-budget assumptions, and tasks that are not already part of the public optimization loop. Otherwise, the market ends up comparing press releases rather than usable capability.

There are caveats. CAISI is a US government body evaluating a Chinese model in a politically charged race, so the institutional context matters. The report also notes method-specific details that affect comparability: its SWE-Bench results can differ from other evaluators because of prompts, scaffolding, and token budgets; ARC-AGI-2 is semi-private rather than fully private; PortBench is not yet publicly described in depth; and two benchmarks were excluded from the cost comparison. The evaluation is useful evidence, not a final measurement of model quality.

Even with those caveats, the direction is clear. DeepSeek V4 Pro appears less like proof that open Chinese models have caught the US frontier and more like proof that they are close enough to matter economically. That distinction is important. The frontier labs may still have the best raw capability, especially on agentic reasoning, cyber, and abstract tasks. But if open-weight competitors continue improving while holding down cost, the premium for closed frontier access will come under pressure.

Takeaway

The strongest idea in this Techmeme-surfaced NIST piece is that the AI race is no longer captured by a single frontier ranking.

DeepSeek V4 Pro looks behind the best US models by CAISI’s measurement, but ahead of where many skeptics would have placed an open Chinese model only a short time ago. It is not a clear parity moment. It is also not a dismissal. It is a sign that the competitive gap is now complicated: US frontier labs lead on the hardest aggregate capability measures, while DeepSeek is pushing hard on openness, cost efficiency, and enough capability to be dangerous in practical markets.

That makes the evaluation a useful correction to the hype around DeepSeek’s own launch. The model may not be as close to GPT-5.5 or Opus 4.6 as selective public benchmarks implied. But if it can deliver near-mini-frontier performance at attractive economics, it still changes how buyers, labs, and policymakers should think about the next phase of the AI race. The pressure point is not just who has the best model. It is who can make capable models cheap, available, and deployable enough to become infrastructure.