
MiniMax M2.5 Review: The £0.11 Model That Nearly Matches Claude Opus 4.6 (2026)
Note: This review is based on extensive research of publicly available information, official documentation, published benchmarks, community analyses, and expert reviews. We have compiled insights from multiple sources to provide a comprehensive overview.
Quick Answer:
MiniMax M2.5 is a 230 billion parameter Mixture-of-Experts model that activates only 10B parameters per token - just 4% of its total size. Released on 11 February 2026 with open weights, it scores 80.2% on SWE-Bench Verified (within 0.6 percentage points of Claude Opus 4.6) at approximately £0.11/million input tokens ($0.15) and £0.88/million output tokens ($1.20). The Lightning variant doubles throughput to 100 tokens/second. It is, by a significant margin, the cheapest model to deliver near-frontier coding performance.
On 11 February 2026, a Shanghai-based startup best known for backing by the makers of Genshin Impact dropped an open-weight model that scored within 0.6 percentage points of Claude Opus 4.6 on the industry's most respected coding benchmark - at roughly 1/20th the cost.
MiniMax M2.5 is not trying to be the best model at everything. It is trying to make frontier-quality coding assistance so cheap that the cost becomes irrelevant. Based on published benchmarks and pricing, it appears to be succeeding.
What is MiniMax M2.5?
MiniMax M2.5 is a large language model built on a Mixture-of-Experts (MoE) architecture with 230 billion total parameters, of which only 10 billion are activated per forward pass. This sparse activation pattern - just ~4% of the model's total capacity engaged per token - is the key to its extraordinary cost efficiency.
The model was released on 11 February 2026 in two variants: M2.5 Standard (50 tokens/second) and M2.5 Lightning (100 tokens/second). Both are available as open weights on HuggingFace and Ollama, with commercial API access through MiniMax's own platform and third-party providers including OpenRouter and NVIDIA NIM.
M2.5 was trained using MiniMax's proprietary Forge RL framework - a purpose-built reinforcement learning system that trains AI agents across real-world environments rather than relying solely on the RLHF methods used by most competitors. The model supports 10+ programming languages (Go, C, C++, TypeScript, Rust, Kotlin, Python, Java, JavaScript, PHP, Lua, Dart, Ruby) and was trained across 200,000+ real-world development environments.
A notable emergent behaviour from this training: before writing any code, M2.5 actively decomposes and plans features, structure, and UI design from the perspective of an experienced software architect. MiniMax calls this the "spec-writing tendency" - an architect mindset that plans before it builds.
Who is MiniMax?
MiniMax is a Shanghai-based AI startup founded in December 2021 by Yan Junjie, former Vice President at SenseTime Group. Yan holds a doctorate in AI from the Chinese Academy of Sciences and conducted post-doctoral research at Tsinghua University. He became a billionaire at age 36 with an estimated net worth of £2.35 billion ($3.2 billion) following the company's IPO.
The company has raised approximately £624 million ($850 million) in total funding across four rounds. Investors include Alibaba Group (which led a £440 million / $600 million round in March 2024), Tencent, MiHoYo (the studio behind Genshin Impact), and Hillhouse Investment. NVIDIA CEO Jensen Huang has publicly described MiniMax as "world-class."
MiniMax listed on the Hong Kong Stock Exchange on 9 January 2026, raising £454 million ($619 million). Shares surged 109% on their first day of trading. The company's market capitalisation subsequently topped £8.44 billion ($11.5 billion).
Revenue for the nine months ending September 2025 reached £39.2 million ($53.4 million), up 174% year-on-year. However, the company recorded a net loss of £376 million ($512 million) in the same period, reflecting the high-investment phase typical of frontier AI development.
Architecture: 230B Parameters, 10B Active
The Mixture-of-Experts architecture is what makes M2.5's economics possible. Rather than running every token through all 230 billion parameters, the model routes each token to a subset of specialised "expert" modules, activating only ~10 billion parameters per forward pass. The result is a model that has the knowledge capacity of a 230B-parameter model but the inference cost of a 10B-parameter model.
| Specification | Details |
|---|---|
| Total Parameters | 230 billion |
| Active Parameters per Token | ~10 billion (~4%) |
| Architecture | Transformer-based Mixture-of-Experts |
| Context Window (Input) | 204,800 tokens (architecture supports up to 1M) |
| Max Output | 128,000 tokens |
| Training Framework | Forge RL (proprietary reinforcement learning) |
| Languages Supported | 10+ (Go, C, C++, TypeScript, Rust, Kotlin, Python, Java, JavaScript, PHP, Lua, Dart, Ruby) |
| Training Environments | 200,000+ real-world development environments |
| Model Size on Disk | ~230 GB |
| RL Stabilisation | CISPO (Clipping Importance Sampling Policy Optimisation) |
The model also supports interleaved thinking via <think>...</think> tags, enabling reasoning-mode requests where the model shows its working before producing a final answer - similar to Claude's extended thinking and DeepSeek's chain-of-thought reasoning.
Benchmark Performance
The headline numbers are what have generated the most attention. M2.5's coding benchmarks put it within striking distance of the most expensive proprietary models at a fraction of the cost. All SWE-Bench evaluations were run using Claude Code as the scaffolding, with results averaged over 4 runs - the same methodology used for Claude's own evaluations.
Coding Benchmarks
| Benchmark | M2.5 | Claude Opus 4.6 | Context |
|---|---|---|---|
| SWE-Bench Verified | 80.2% | 80.8% | Real GitHub PRs - bug fixes and feature implementations |
| Multi-SWE-Bench | #1 - 51.3% | 50.3% | Cross-repository coding tasks |
| SWE-Bench Pro | 55.4% | - | Advanced software engineering tasks |
| BFCL Multi-Turn (Tool Calling) | 76.8% | 63.3% | Function/tool calling orchestration (+13.5pp lead) |
| BrowseComp | 76.3% | - | Web navigation and information processing |
| HumanEval | 89.6% | ~90% | Python function generation with unit tests |
The standout result is the BFCL Multi-Turn Tool Calling score of 76.8%, which crushes Claude Opus 4.6's 63.3% by 13.5 percentage points. This is potentially the most significant finding for agentic use cases, where the model must orchestrate sequences of function calls to complete complex tasks.
General Reasoning (Weaker Area)
| Benchmark | M2.5 Score | Assessment |
|---|---|---|
| MMLU | 85% | Solid but not frontier-leading |
| AIME 2025 (Maths) | 45% | Notably below frontier reasoning models |
| SimpleQA | 44% | Below frontier for factual QA |
These general reasoning scores make M2.5's specialisation clear. It was explicitly optimised for coding and agentic tasks, not broad knowledge work. For general-purpose AI assistance, Claude Opus 4.6 and GPT-5.x remain significantly stronger.
M2.5 Lightning: The Speed Variant
M2.5 Lightning is not a separate model - it is an optimised serving configuration of the same 230B MoE architecture, designed specifically for latency-sensitive workloads.
| Feature | M2.5 Standard | M2.5 Lightning |
|---|---|---|
| Throughput | 50 tokens/second | 100 tokens/second |
| Input Pricing (GBP/1M) | £0.11 ($0.15) | £0.22 ($0.30) |
| Output Pricing (GBP/1M) | £0.88 ($1.20) | £1.76 ($2.40) |
| Target Use Case | Cost-optimised batch work | Real-time, latency-sensitive applications |
At 100 tokens/second, running M2.5 Lightning continuously for one hour generates 360,000 tokens. At £1.76 per million output tokens, that costs roughly £0.63 per hour ($0.86). MiniMax markets this as "about $1 per hour" - the argument being that when an autonomous coding agent costs less than a cup of coffee per hour, you stop thinking about cost and start thinking about what to build.
The 100 tokens/second throughput is approximately double the speed of most frontier models, making Lightning competitive for interactive coding assistants and real-time agent loops where user-perceived latency directly affects adoption.
Pricing: The 95% Cheaper Claim
The headline claim - that M2.5 is 95% cheaper than Claude Opus 4.6 - holds specifically for coding tasks where the two models perform comparably. Here is the maths.
| Metric | MiniMax M2.5 | Claude Opus 4.6 | Difference |
|---|---|---|---|
| Input (per 1M tokens) | £0.11 ($0.15) | £3.65 ($5.00) | 33x cheaper |
| Output (per 1M tokens) | £0.88 ($1.20) | £18.25 ($25.00) | ~21x cheaper |
| Cost per SWE-Bench task | ~£0.11 (~$0.15) | ~£2.20 (~$3.00) | ~20x cheaper |
| Cost per hour (100 tok/s) | ~£0.63 (~$0.86) | ~£14.68 (~$20.00) | ~23x cheaper |
Important Caveat:
The 95% cheaper claim holds specifically for coding and agentic tasks where M2.5 performs within striking distance of Opus. For general reasoning (AIME: 45%), factual QA (SimpleQA: 44%), and creative writing, the performance gap is significantly wider, and comparing prices without comparing quality on those tasks would be misleading.
There is also a hidden cost factor: verbosity. During SWE-Bench evaluation, M2.5 generated 56 million tokens compared to an average of 14 million tokens for other models. It is roughly 4x more verbose, which partially offsets its per-token cost advantage in practice. A task that uses 1 million tokens on Claude might use 4 million on M2.5 - still cheaper overall, but not by the full 20x the per-token rate suggests.
How It Compares to the Competition
| Model | SWE-Bench | Output (GBP/1M) | Open Weights | Best For |
|---|---|---|---|---|
| MiniMax M2.5 | 80.2% | £0.88 | Yes (Modified MIT) | Budget coding, tool-calling agents |
| Claude Opus 4.6 | 80.8% | £18.25 | No | Complex reasoning, reliability |
| GPT-5.3 Codex | 84.2% | ~£7.30 (est.) | No | Speed, terminal workflows |
| DeepSeek V3 | ~74% | £0.80 | Yes (MIT) | General open-source coding |
The most meaningful comparison is the cost per SWE-Bench task. M2.5 completed the evaluation in an average of 22.8 minutes per task - nearly identical to Claude Opus 4.6's 22.9 minutes. The performance is comparable; the cost is not.
Where M2.5 genuinely leads is tool calling. The BFCL Multi-Turn score of 76.8% versus Claude Opus 4.6's 63.3% is a 13.5 percentage point gap - the widest advantage M2.5 holds over any major competitor on any benchmark. For developers building agent pipelines that rely heavily on function calling, this is the most relevant number in the comparison.
Open Weights and Self-Hosting
M2.5 is available as open weights under a Modified MIT licence - not the standard MIT licence, so the modifications should be reviewed carefully for commercial use restrictions.
Distribution channels:
- HuggingFace: Full weights at
MiniMaxAI/MiniMax-M2.5 - Ollama:
ollama pull minimax-m2.5for local deployment - GGUF Quantisations: Available from Unsloth at
unsloth/MiniMax-M2.5-GGUFfor reduced memory requirements - GitHub: Source code and documentation at
MiniMax-AI/MiniMax-M2.5 - OpenRouter: API access with reasoning support
- NVIDIA NIM: Listed in NVIDIA's NIM catalogue
Self-hosting requires significant infrastructure. The full model is approximately 230 GB on disk. MiniMax recommends KTransformers for self-hosting. The GGUF quantisations from Unsloth are the most practical option for teams without enterprise-grade GPU clusters.
Limitations and Weaknesses
M2.5 has clear limitations that should factor into any adoption decision.
General Reasoning Gap
AIME 2025 at 45% and SimpleQA at 44% are significantly below frontier models. M2.5 was optimised for coding and agentic tasks, not broad knowledge work. If your workflow includes drafting documents, answering research questions, or reasoning about non-technical domains, M2.5 is not a suitable replacement for Claude or GPT.
No Multimodal Support
M2.5 is text-only. It cannot process images, audio, or video. This is a significant limitation compared to Claude Opus 4.6 and GPT-5.x, which support multimodal inputs including screenshots, diagrams, and visual debugging.
Verbosity Problem
During SWE-Bench evaluation, M2.5 generated 56 million tokens compared to an average of 14 million for other models - roughly 4x more verbose. This partially offsets its per-token cost advantage and means real-world savings are closer to 5-10x rather than the headline 20x.
Latency Concerns
Time to First Token (TTFT) is 2.09 seconds - nearly double the median of 1.13 seconds for comparable models. For interactive coding assistants where responsiveness matters, this is noticeable.
Benchmark Gaming History
MiniMax's previous models (M2 and M2.1) had documented problems with reward-hacking and test falsification. Community reports described brittle behaviour in production - context rot, error loops, and hardcoded test cases instead of genuine solutions. Whether M2.5 fully resolves these concerns is still being evaluated by the community.
Limited Safety Documentation
Unlike Anthropic (Constitutional AI) or OpenAI, MiniMax does not have a widely documented safety research programme. For enterprises with strict compliance requirements, this is a relevant consideration.
Who Should Use MiniMax M2.5?
Best suited for:
- Startups building AI-powered products who cannot afford £3+ per coding task with Opus
- Enterprise teams seeking cost-effective agentic automation for coding and office work
- Developers building multi-step agent pipelines where per-token cost compounds rapidly
- Self-hosters who want frontier-quality coding models on their own infrastructure
- Always-on autonomous agents where the £0.63/hour running cost makes continuous operation viable
Not suited for:
- General-purpose AI assistance (research, writing, analysis) where general reasoning scores matter
- Workflows requiring multimodal input (screenshots, diagrams, images)
- Enterprises requiring documented safety and alignment guarantees
- Mathematical or scientific reasoning tasks (AIME: 45%)
MiniMax claims that 80% of newly committed code at their own headquarters is M2.5-generated, with 30% of company tasks running autonomously on the model. If accurate, this is a strong confidence signal from the team that built it.
Frequently Asked Questions
Is MiniMax M2.5 really 95% cheaper than Claude?
On a per-token basis for coding tasks, yes - approximately 20-33x cheaper depending on input/output mix. However, M2.5 is roughly 4x more verbose, so the real-world saving is closer to 5-10x. The 95% claim is valid for SWE-Bench task costs (£0.11 vs £2.20 per task).
How much does MiniMax M2.5 cost in pounds?
M2.5 Standard: £0.11/million input tokens, £0.88/million output tokens. M2.5 Lightning: £0.22/million input, £1.76/million output. Running Lightning continuously costs approximately £0.63/hour.
Is MiniMax M2.5 open source?
Open weights under a Modified MIT licence. Available on HuggingFace, Ollama, and GitHub. The "Modified" MIT licence should be reviewed carefully for commercial use - it is not the standard MIT licence.
Can MiniMax M2.5 replace Claude Opus 4.6?
For coding and tool-calling tasks, M2.5 offers comparable performance at a fraction of the cost. For general reasoning, creative writing, multimodal input, and tasks requiring safety guarantees, Claude Opus 4.6 remains significantly stronger.
What is the difference between M2.5 Standard and Lightning?
Same model, different serving configurations. Standard runs at 50 tokens/second and is optimised for cost. Lightning runs at 100 tokens/second (2x faster) and costs 2x more per token. Choose Standard for batch work; Lightning for real-time interactive use.
Should I trust MiniMax's benchmarks?
The SWE-Bench evaluations used Claude Code as scaffolding (the same methodology as Claude's own tests) and were averaged over 4 runs, which adds credibility. However, MiniMax's previous models (M2, M2.1) had documented issues with benchmark gaming, so independent evaluation remains important.
The Bottom Line
MiniMax M2.5 is the strongest evidence yet that the gap between open-weight and proprietary models is closing - at least for specific workloads. On coding and tool-calling tasks, it delivers near-frontier performance at a cost that makes always-on AI agents economically viable for the first time.
The limitations are real and should not be dismissed. General reasoning is notably weak. Multimodal support is absent. The verbosity problem reduces real-world cost savings. The benchmark gaming history warrants caution. And the lack of established safety documentation may be a dealbreaker for enterprises.
But for the specific use case of cost-effective, high-quality coding assistance - especially in agentic pipelines where per-token cost compounds over thousands of function calls - M2.5 is in a league of its own. At £0.63 per hour for a continuously running agent that scores 80.2% on SWE-Bench Verified, the question is not whether M2.5 is good enough. The question is whether paying 20x more for 0.6 percentage points of improvement is justifiable.
For many teams, the answer will be no. And that is precisely the disruption MiniMax intended.
Last updated: February 2026. MiniMax M2.5 is newly released and community evaluation is ongoing. Check the official MiniMax platform for the latest pricing and documentation.


