DeepSeek-V4 claims GPT-4.5 performance at $0.80 per million tokens - 84% cheaper

Chinese AI lab DeepSeek released V4, a 600B-parameter MoE model that matches GPT-4.5 Turbo on coding benchmarks while undercutting OpenAI's API pricing by 84%. The model ships with open weights under Apache 2.0, positioning it for enterprises with data sovereignty requirements. History suggests skepticism - verify benchmarks independently before migration.

TheBiggish Editorial · Monday, February 2, 2026

DeepSeek-V4 Released: Performance Claims and Pricing

DeepSeek released V4 this morning - a 600-billion parameter Mixture-of-Experts model claiming GPT-4.5 Turbo-level performance at $0.80 per million input tokens. OpenAI charges $5.00 for the same.

The headline benchmark: 94.1% on HumanEval (Python coding), surpassing GPT-4.5's 92.8%. Math reasoning (MATH benchmark) hit 78.2% versus OpenAI's 76.5%. General knowledge (MMLU-Pro) trails slightly at 89.4% versus 89.9%.

The model uses 45B active parameters during inference despite the 600B total count - a sparse activation technique that reduces VRAM requirements. DeepSeek ships base weights under Apache 2.0. The fine-tuned chat version uses a stricter community license.

What This Means in Practice

Three deployment scenarios matter:

API switching: DeepSeek's API uses OpenAI's format. Changing the base URL could cut costs 84% if benchmarks hold. The fine print: verify performance on your specific workloads first. Independent benchmarks matter more than lab reports.

Local deployment: The 7B quantized version runs on SnapDragon Gen 5 chips. For enterprises with data sovereignty requirements - fintech, healthcare, defense - this enables on-premises inference without API calls to external servers.

Hybrid architectures: Run reasoning-heavy tasks on DeepSeek, keep latency-sensitive workloads on established providers. The 128k context window handles document analysis without chunking.

The Trade-offs

DeepSeek's efficiency comes from China's hardware constraints - U.S. export controls forced optimization. Their recent manifold-constrained hyper-connections paper shows stability improvements during training, but the flagship R2 model delayed from 2025 due to chip shortages.

U.S. scrutiny increased last week when a House committee raised national security concerns over DeepSeek's alleged PLA integration for military applications. Enterprises routing production traffic through Chinese-backed infrastructure should review compliance requirements.

The "Silent Reasoning" module - chain-of-thought processing without token output - cuts API costs further. Clever engineering. Whether it translates to production reliability is the open question.

History Suggests Caution

We've seen this pattern before. A new model claims breakthrough performance at fraction-of-the-cost pricing. Early benchmarks look strong. Six months later, edge cases emerge and the total cost of ownership story gets complicated.

DeepSeek's track record on coding is solid - their previous models performed well on HumanEval. This release appears to extend that strength. But enterprises migrating production workloads should:

Run internal benchmarks on representative tasks
Test failure modes and edge cases
Calculate true TCO including monitoring and debugging costs
Review data routing and sovereignty implications

The efficiency breakthrough is real - Chinese labs trained competitive models on ~1% of U.S. resources. Whether that translates to enterprise reliability at scale, we'll see.

DeepSeek-V4 Released: Performance Claims and Pricing

What This Means in Practice

The Trade-offs

History Suggests Caution

Related Articles

Enterprise RAG deployments hit measurement gap as retrieval becomes critical infrastructure

Stack Overflow opens chat to all users, ships AI Assist speed boost

Fluid Protocol stablecoin looping costs detailed - Part 1 of new DeFi analysis