MiniMax M3: How to Use It for Free, Full Benchmarks & Complete Review (2026)

Aditya Kachhawa

37 min read
AI & Machine Learning
MiniMax M3 free open-weights AI model — frontier coding, 1M token context and native video input, released June 2026
TL;DR — MiniMax M3 launched June 1, 2026 as the first open-weights AI model to combine three things that have never existed together in one free model: frontier-level coding, a genuine 1 million token context window, and native understanding of text, images, and video all at once. Vendor-reported benchmarks put it ahead of GPT-5.5 and Gemini 3.1 Pro on real-world software engineering. The weights are live on Hugging Face. You can start using it right now for free through MiniMax's chat interface, OpenRouter, or by downloading the weights yourself. This article walks you through all of it.

Something quietly significant has been happening in AI over the last year or so. The gap between what big proprietary labs like OpenAI and Anthropic offer and what you can get completely free has been shrinking fast. Models that would have cost serious money to access two years ago are now available as free downloads.

MiniMax M3 is the sharpest example of that shift yet. A Shanghai-based lab most people outside China have never heard of just released a model that does three things at once that no free model has done before: serious coding, a working million-token memory, and native video understanding. All free to download.

This guide is written for everyone whether you've never used an AI model before or you spend your days wiring LLMs into production pipelines. I'll explain the concepts clearly without dumbing them down, and give you the code and commands you actually need.

What Is MiniMax M3?

MiniMax M3 is the flagship AI model from MiniMax (officially Xiyu Technology), a Shanghai-based AI lab founded in 2021. They've been listed on the Hong Kong Stock Exchange since January 2026 so this isn't a scrappy startup, it's a public company with serious research and deployment infrastructure.

If you're new to AI: Think of M3 as a very powerful AI brain that can read text, look at images, watch videos, write code, answer questions, and help you solve complex problems. You can use it through a chat interface in your browser similar to how you'd use ChatGPT or plug it into your own software via an API.

M3 launched on June 1, 2026, with the API going live on May 31. The open weights — the actual model files you can download and run yourself are now publicly available on Hugging Face, confirmed at approximately 428 billion total parameters (more on what that means below).

Here's what makes M3 stand out: it's the first model of any kind to deliver all three of these at once:

  1. Frontier-level coding --> 59% on SWE-Bench Pro (vendor-reported), beating GPT-5.5 and Gemini 3.1 Pro
  2. A genuine 1 million token context window --> powered by a new architecture that cuts compute cost to 1/20th of its predecessor
  3. Native multimodality --> text, images, and video as inputs, trained together from the ground up, not bolted on

That combination hasn't existed in an open-weights model before.

Key Specs at a Glance

PropertyDetails
Model NameMiniMax M3
DeveloperMiniMax (Xiyu Technology), Shanghai
Release DateJune 1, 2026 (API); weights followed within 10 days
Total Parameters~428 Billion (Sparse MoE)
Active Parameters per Query~23 Billion
Context WindowUp to 1,000,000 tokens (512K guaranteed; >512K access-gated)
Max Output Tokens512,000 tokens
ArchitectureSparse MoE + MiniMax Sparse Attention (MSA)
Input ModalitiesText, Image, Video
Output ModalityText only
LicenseMiniMax Community License (read before commercial use)
Weights AvailableYes — HuggingFace: MiniMaxAI/MiniMax-M3
API Pricing (promo)$0.30 / $1.20 per 1M input/output tokens
API Pricing (standard)$0.60 / $2.40 per 1M input/output tokens
Subscription Entry$20/month (Plus plan, ~1.7B tokens)
Inference FrameworksSGLang (recommended), vLLM, Transformers, KTransformers

↔️ Scroll horizontally to see all columns

Quick note on "428 billion parameters": Parameters are the internal numerical values the model has learned during training basically, the size of its knowledge and reasoning capability. More parameters generally means more capable, but also bigger files and more computing power required. For comparison, GPT-3 had 175 billion. M3 is more than twice that size. The clever part is that it only uses about 23 billion of those parameters per question a design choice called Mixture-of-Experts that keeps it fast and affordable to run.

The Three Things That Make M3 Different

1. Coding Performance That Actually Holds Up

Before diving into numbers, a quick caveat worth knowing: software engineering benchmarks are vendor-reported. Labs run their own models on their own infrastructure with their own tooling. That doesn't make the results fake, but it means you should treat them as directional, not gospel and test on your own work before committing.

With that said: MiniMax's SWE-Bench Pro score of 59.0% is consistent with independent evaluation from Artificial Analysis (who rated M3 at 55 on their Intelligence Index) and has been covered by multiple tech outlets. Both data points point to a genuinely strong coding model.

What does SWE-Bench Pro actually measure? It's a benchmark that gives an AI real GitHub issues from real software projects and asks it to write the code that fixes the bug or adds the feature. Not toy examples actual production code problems. A 59% score means M3 solved 59 out of every 100 such problems correctly.

What I find more convincing than the benchmark number is what MiniMax chose to highlight as a real-world demo: M3 reportedly reproduced the experiments of an entire ICLR machine learning research paper in 12 hours and ran autonomously for 24 hours straight on a CUDA kernel optimization task, making 1,959 tool calls without stopping or losing context. You can dismiss a benchmark number. A 24-hour uninterrupted run is a different kind of signal.

2. A 1 Million Token Context That Is Actually Affordable to Use

For beginners: A "context window" is how much text an AI can read and remember at once. Most models cap out around 128,000–200,000 tokens (roughly 100,000–150,000 words). A 1 million token context means M3 can hold roughly 750,000 words in memory at once that's like reading seven full novels before answering your question. For developers, this means you could feed it your entire codebase and have it reason about all of it simultaneously.

But context windows are often marketing. A 1M window that costs $50 to fill is a feature nobody uses in practice.

M3's costs are different. At promo pricing ($0.30/$1.20 per million tokens), a typical coding agent task consuming 500,000 input and 100,000 output tokens costs approximately $0.27. The equivalent task on Claude Opus costs roughly $5.00. That's an 18× cost difference and it's what moves long context from marketing to reality.

The architecture behind this is called MiniMax Sparse Attention (MSA) explained in detail later, but the headline is: it delivers 9× faster prefill, 15× faster decoding, and 1/20th the per-token compute compared to M3's predecessor at 1M context length.

3. Native Multimodality — Not Bolted On

For beginners: "Multimodal" means the AI understands multiple types of input not just text, but also images and video. Many AI models claim image support, but it was added as an afterthought.

M3 is different because MiniMax trained it on text, images, and video from the very first training step --> a design they call "mixed-modality training." The practical difference is significant: when you give M3 a screenshot of a UI and ask it to write code that replicates it, it doesn't translate the image into a text description first. It reasons across both simultaneously. When you upload a video of a bug and ask for a fix, it processes the visual and temporal information natively.

No other open-weights model at this capability level does all three modalities natively as of June 2026.

How to Use MiniMax M3 for Free — 6 Methods

Here's every legitimate way to use MiniMax M3 without paying, ordered from easiest to most technical. No credit card required for any of these.

Method 1: MiniMax Chat — Zero Setup, Works in Your Browser

The simplest option. MiniMax runs a consumer chat product that anyone can use immediately.

MiniMax M3 free chat interface at chat.minimax.io showing conversation window with image and video upload support, no credit card required, June 2026
MiniMax Chat at chat.minimax.io — free access to MiniMax M3 with no credit card required. Supports text, image, and video input natively. Screenshot: June 2026

Step-by-step:

  1. Go to chat.minimax.io
  2. Click "Sign Up" — only your email address is required
  3. Verify your email
  4. You're in — MiniMax M3 is the default model
  5. Start chatting, paste code, upload images, or upload a short video

What you get on the free tier:

  • Full M3 access for chat and coding tasks
  • Image and video upload supported
  • Daily message limits apply (enough to properly evaluate the model)

Best for: Anyone who just wants to try M3 right now whether you're a developer or just curious. This is also the best way to test the image and video features without any setup.

Method 2: MiniMax Code — The Free Coding Agent (Best for Developers)

MiniMax built a dedicated coding agent called MiniMax Code at code.minimax.io. Think of it as their version of Claude Code an AI that can work through your actual project files, not just answer questions about code in isolation.

Step-by-step:

  1. Go to code.minimax.io
  2. Sign in with your MiniMax account (same login as chat.minimax.io)
  3. Optional --> install the CLI to connect it to local projects:
Bash / Terminal
npm install -g @minimax/code
minimax-code login
  1. Point it at your project:
Bash / Terminal
cd your-project
minimax-code
  1. Ask it to write, refactor, debug, or document code across your entire repo

What you get free:

  • Full M3 coding agent experience
  • Multi-file context handling (this is where the 1M context window shines)
  • Tool calling and terminal execution
  • Free daily quota for testing

This is the same setup MiniMax used for the 24-hour autonomous engineering run mentioned earlier. It's built for deep, long-running work not one-off code generation.

Method 3: OpenRouter — No MiniMax Account Needed

OpenRouter is a single platform that gives you access to 400+ AI models through one unified API. If you want to try M3 without creating a MiniMax account, this is your fastest path and the easiest way to compare M3 head-to-head with other models before committing to any single provider.

OpenRouter playground with MiniMax M3 model selected showing model ID minimax slash minimax-m3 at 0.30 dollars per million input tokens with free trial credits, June 2026
MiniMax M3 on OpenRouter Playground — free trial credits, no MiniMax account needed. Model string: minimax/minimax-m3. Pricing: $0.30/1M input tokens at promo rate. Source: openrouter.ai, June 2026

Step-by-step:

  1. Go to openrouter.ai
  2. Sign up with Google or email --> you'll get free credits automatically
  3. Go to openrouter.ai/playground
  4. Search for minimax and select MiniMax: MiniMax M3
  5. Type your prompt and run

For developers — Python (OpenAI SDK):

Python
from openai import OpenAI
import os

client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=os.environ["OPENROUTER_API_KEY"]
)

response = client.chat.completions.create(
    model="minimax/minimax-m3",
    messages=[
        {"role": "user", "content": "Explain how MiniMax Sparse Attention works in simple terms."}
    ]
)

print(response.choices[0].message.content)

Current pricing on OpenRouter: $0.30/1M input, $1.20/1M output (promo rate verify current rates at openrouter.ai before committing).

Best for: Developers who want to evaluate M3 against other models side-by-side, or anyone who wants a single API key that switches between models without juggling accounts.

Method 4: NVIDIA NIM Free API — No Account Required (Fastest Developer Path)

This is the fastest zero-cost API path, confirmed live. NVIDIA hosts MiniMax M3 on their NIM (NVIDIA Inference Microservices) platform at build.nvidia.com/minimaxai/minimax-m3 and provides a free API endpoint, no MiniMax account, no credit card, just an NVIDIA developer key. The endpoint uses the standard OpenAI-compatible chat completions format, so existing OpenAI SDK code works with two-line changes.

Nvidia minimax m3 free api endpoint for individuals
MiniMax M3 on NVIDIA NIM — free API endpoint, no MiniMax account needed. Source: build.nvidia.com, June 2026

Step-by-step:

  1. Go to build.nvidia.com/minimaxai/minimax-m3
  2. Click "Generate API Key" --> a free NVIDIA developer account is all you need
  3. Install the requests library (or the OpenAI SDK both work): pip install openai
  4. Use it in your code:

Python — using requests (as shown on NVIDIA's official page):

Python
import requests
import os

invoke_url = "https://integrate.api.nvidia.com/v1/chat/completions"

headers = {
    "Authorization": f"Bearer {os.environ['NVIDIA_API_KEY']}",
    "Accept": "application/json"
}

payload = {
    "model": "minimaxai/minimax-m3",
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write a Python function that parses nested JSON safely, with full error handling."}
    ],
    "max_tokens": 4096,
    "temperature": 1.00,
    "top_p": 0.95,
    "stream": False
}

response = requests.post(invoke_url, headers=headers, json=payload)
print(response.json()["choices"][0]["message"]["content"])

Python — using the OpenAI SDK (same endpoint, cleaner syntax):

Python
from openai import OpenAI
import os

client = OpenAI(
    base_url="https://integrate.api.nvidia.com/v1",
    api_key=os.environ["NVIDIA_API_KEY"]
)

response = client.chat.completions.create(
    model="minimaxai/minimax-m3",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write a Python function that parses nested JSON safely, with full error handling."}
    ],
    temperature=1.0,
    max_tokens=4096
)

print(response.choices[0].message.content)

Sending images or video (NVIDIA NIM supports multimodal input):

Python
# Pass a list of content parts — URLs or base64 data URIs both work
payload["messages"] = [
    {"role": "user", "content": [
        {"type": "text", "text": "Describe what's happening in this video."},
        {"type": "video_url", "video_url": {"url": "https://example.com/video.mp4"}},
    ]}
]

Key details confirmed from official NVIDIA docs:

PropertyValue
Endpointhttps://integrate.api.nvidia.com/v1/chat/completions
Model stringminimaxai/minimax-m3 (lowercase, with provider prefix)
Auth headerAuthorization: Bearer $NVIDIA_API_KEY
Free endpoint✅ Available (confirmed on build.nvidia.com)
StreamingSupported — set "stream": true and Accept: text/event-stream
Multimodal✅ Text, image, and video input supported
Context window1,000,000 tokens
Non-commercial useThis endpoint is governed by NVIDIA's API Trial Terms + MiniMax Non-Commercial License
API referencedocs.api.nvidia.com/nim/reference/minimaxai-minimax-m3

↔️ Scroll horizontally to see all columns

⚠️ License note: NVIDIA's hosted MiniMax M3 endpoint is explicitly marked for non-commercial use on the NIM model card. The governing licenses are the NVIDIA API Trial Terms of Service and the MiniMax Non-Commercial License. For commercial production use, switch to Method 5 (MiniMax's own API) or Method 6 (self-hosted weights under the Community License read that license carefully).

Best for: Developers who want zero-friction API access to M3 right now no MiniMax account, no billing setup. One NVIDIA key gets you access to M3 and dozens of other frontier models on the same platform. Ideal for evaluation, prototyping, and non-commercial projects.

Method 5: MiniMax Official API (Best for Commercial Production Use)

The official MiniMax API at platform.minimax.io is the right path when you need commercial licensing, direct pricing control, or features like the service_tier=priority option for stable high-load throughput. The API is fully OpenAI-compatible if you've used the OpenAI SDK before, you change two lines.

Step-by-step:

  1. Go to platform.minimax.io
  2. Sign up and verify your email
  3. Navigate to "API Keys" and create a key (pay-as-you-go)
  4. Install the OpenAI SDK if you haven't already: pip install openai
  5. Use it in your code:
Python
from openai import OpenAI
import os

client = OpenAI(
    base_url="https://api.minimax.io/v1",
    api_key=os.environ["MINIMAX_API_KEY"]
)

response = client.chat.completions.create(
    model="MiniMax-M3",
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content": "Write a Python function that parses nested JSON safely, with full error handling."
        }
    ],
    temperature=1.0,
    max_tokens=4096
)

print(response.choices[0].message.content)

Alternatively, use the Anthropic SDK (MiniMax supports both formats — confirmed in official docs):

Python
import anthropic
import os

# Set these environment variables before running:
# export ANTHROPIC_BASE_URL=https://api.minimax.io/anthropic
# export ANTHROPIC_API_KEY=your_minimax_api_key

client = anthropic.Anthropic()

message = client.messages.create(
    model="MiniMax-M3",
    max_tokens=1024,
    system="You are a helpful assistant.",
    messages=[
        {
            "role": "user",
            "content": "Write a Python function that parses nested JSON safely, with full error handling."
        }
    ]
)

print(message.content[0].text)

Pricing note (confirmed from official MiniMax blog): MiniMax bills at a standard rate for calls with ≤512K input tokens, and at a higher long-context rate for calls above 512K. Promo pricing during launch was $0.30/$1.20 per million input/output tokens (standard tier). Verify current rates at platform.minimax.io before building.

MiniMax M3 API setup on platform.minimax.io showing API key creation and official pricing for MiniMax M3 access in 2026
MiniMax Open Platform — create an API key to access MiniMax M3 via OpenAI-compatible or Anthropic-compatible APIs. Usage requires account balance, a Token Plan, purchased Credits, or Team-assigned resources. Source: platform.minimax.io, June 2026

Best for: Commercial production workloads, teams that need SLA and priority tier access (service_tier=priority), and developers who want direct billing control without routing through a third-party aggregator.

Method 6: Download Weights and Self-Host (For Teams with the Hardware)

If you have data privacy requirements, or you want zero per-query cost at scale, you can run M3 entirely on your own infrastructure.

Bash / Terminal
# Install git-lfs (required for large model files)
git lfs install

# Download the weights from Hugging Face
huggingface-cli download MiniMaxAI/MiniMax-M3 --local-dir ./MiniMax-M3

Hardware requirements:

SetupVRAM NeededContext Supported
FP8, 4× A100 80GB GPUs~320GBUp to ~256K tokens
FP8, 8× H100 80GB GPUs~640GBUp to 1M tokens
INT4 quantized (GGUF)~100GB+Up to 128K (no MSA)

↔️ Scroll horizontally to see all columns

This is data-center hardware not something you'd run on a personal machine. If you're an individual developer, Methods 1–5 are more practical.

MiniMax M3 official Hugging Face model card at huggingface.co/MiniMaxAI showing 428 billion parameter open-weights files available for download under MiniMax Community License, June 2026
MiniMax M3 on Hugging Face — 428B open-weights model, freely downloadable. License: MiniMax Community License (commercial use requires review). Source: huggingface.co/MiniMaxAI/MiniMax-M3, June 2026

Don't have A100s or H100s in-house? Most developers rent cloud GPUs to test large open-weights models like MiniMax M3 before buying hardware.

→ Rent GPU compute on Runpod: https://runpod.io?ref=39rjwy2z) -->hourly A100/H100 GPU instances for testing self-hosted MiniMax M3 without long-term hardware commitment.
Disclosure: Referral link. I may earn a commission if you sign up, at no extra cost to you.

License note: MiniMax M3 uses the MiniMax Community License, not MIT or Apache 2.0. Commercial use restrictions apply. Read the full license at huggingface.co/MiniMaxAI/MiniMax-M3 before building a product on the self-hosted weights.

MiniMax M3 Benchmark Results

All scores below are from MiniMax's official published results unless noted otherwise. Vendor-reported scores are directional signals treat them as such, and test on your own workloads before making production decisions.

MiniMax M3 benchmark results chart showing 59 percent SWE-Bench Pro, 66 percent Terminal-Bench 2.1, 74.2 percent MCP Atlas scores versus competing AI models June 2026
MiniMax M3 official benchmark results — 59% SWE-Bench Pro (beats GPT-5.5), 66% Terminal-Bench 2.1, 74.2% MCP Atlas. Note: vendor-reported scores run on MiniMax's own infrastructure. Source: MiniMax official announcement, June 1, 2026

Coding and Software Engineering

BenchmarkM3 ScoreWhat It Measures
SWE-Bench Pro59.0%Fixing real GitHub issues in real production codebases
Terminal-Bench 2.166.0%Completing tasks autonomously in a terminal/CLI environment
SWE-fficiency34.8%Solving software problems efficiently (not just correctly)
KernelBench Hard28.8%Optimizing GPU (CUDA) kernel code — very hard, niche benchmark
MCP Atlas74.2%Calling external tools correctly as an AI agent

↔️ Scroll horizontally to see all columns

Source: MiniMax official announcement, June 1, 2026.

Autonomous Browsing

BenchmarkMiniMax M3Claude Opus 4.7What It Tests
BrowseComp83.579.3Autonomous web browsing and information retrieval
SVG-BenchSurpasses Opus 4.7ReferenceWriting code to generate SVG graphics

↔️ Scroll horizontally to see all columns

The BrowseComp result is notable M3 beating Claude Opus 4.7 by 4 points on autonomous browsing. This isn't just about clicking links. It requires the model to hold its goal in mind across hundreds of tool calls without getting confused or losing the thread. That kind of sustained focus is genuinely hard.

Independent Third-Party Evaluation (Artificial Analysis)

These numbers come from Artificial Analysis — an independent firm that evaluates AI models using their own benchmarks, not the vendor's. This is the more reliable comparison for cross-model ranking.

rtificial Analysis Intelligence Index chart showing MiniMax M3 ranked number 1 among 217 models in its pricing tier with a score of 55 as of June 2026
MiniMax M3 on the Artificial Analysis Intelligence Index — independent evaluation scoring M3 at 55, ranked #1 in its pricing tier among 217 models. Source: Artificial Analysis (artificialanalysis.ai/models/minimax-m3), June 2026
MetricScore / Result
AA Intelligence Index55
Ranking#1 of 217 models in its pricing tier
Output tokens per evaluation91M (very verbose — peers average ~23M)
Speed85.4 tokens/second (above average)
Time to first token3.49 seconds (slightly slower to start than peers)

↔️ Scroll horizontally to see all columns

Source: Artificial Analysis, artificialanalysis.ai/models/minimax-m3

One thing worth flagging: M3 generated 91 million tokens across the AA evaluation, compared to an average of 23 million for its peers. It thinks out loud extensively. For complex reasoning tasks that's often an advantage. For production cost, it's a factor you need to account for.

The Long-Horizon Tests (The Most Convincing Ones)

Beyond benchmark percentages, here's what M3 actually did when left to run:

  • ICLR paper reproduction: Autonomously replicated the full experimental pipeline of a machine learning research paper in 12 hours from reading the paper to running experiments to producing results
  • CUDA kernel optimization: Ran for 24 consecutive hours, made 1,959 tool calls, kept improving performance after multiple plateaus, never reset or lost context

These aren't pass/fail benchmark scores. They're descriptions of real behavior. That kind of sustained, coherent, multi-day agentic work is genuinely hard to fake with benchmark tuning.

MiniMax M3 vs GPT-5.5 vs Claude Opus 4.7

Honest comparison based on sourced data. Where competitor figures weren't available from cited sources, the cell is left blank rather than filled with estimates.

DimensionMiniMax M3GPT-5.5Claude Opus 4.7
SWE-Bench Pro59.0% (vendor)Below 59% (per MiniMax)Close to M3 (per MiniMax)
BrowseComp83.579.3
AA Intelligence Index55 (#1 in price tier)
Context Window1M tokensNot published
Input ModalitiesText + Image + VideoText + ImageText + Image
Open WeightsYes (Community License)NoNo
API Input Price$0.30/1M (promo)~$10/1M~$5/1M
API Output Price$1.20/1M (promo)~$30/1M~$25/1M
Video InputYesNoNo
Self-HostingYes (license check required)NoNo
Cost per Task (est.)~$0.03–$0.27Significantly higherSignificantly higher

↔️ Scroll horizontally to see all columns

The honest bottom line: the competitor benchmark comparisons are vendor-reported and run on different infrastructure. What's independently confirmed is the AA ranking, the pricing, the architecture capabilities, and the multimodal features. The cost story is real and significant regardless of exact benchmark margins.

The MSA Architecture Explained Simply

This section explains why M3 can offer a 1 million token context window at $0.30/1M tokens when nobody else can match that. It's worth understanding even if you're not a developer.

MiniMax Sparse Attention MSA architecture diagram showing KV block selection that reduces per-token compute by 20x at 1 million token context window
MiniMax Sparse Attention (MSA) — the architecture that makes M3's 1M context window practically usable. Cuts per-token compute to 1/20th, delivers 9× faster prefill and 15× faster decoding vs M2. Source: MiniMax official blog (minimax.io/blog/minimax-m3)

The problem with normal AI attention:

Every standard AI model works by comparing each word (token) it's reading against every other word in the context. This is called "attention." If you have 1 million tokens, that means 1 trillion pairs to compare and the compute required grows quadratically (double the context = 4× the cost). This is why long context has historically been expensive or slow.

What MiniMax Sparse Attention (MSA) does differently:

Think of the difference between searching a library by reading every book vs using a card catalogue. Standard attention reads every book. MSA builds an index first.

M3 groups the key-value memory (all the tokens it's read) into blocks, uses a lightweight index to identify only the relevant blocks for each new question, then runs full precise attention only on those selected blocks ignoring the 99% that aren't relevant.

The results at 1M context vs M3's predecessor (M2):

  • 9× faster prefill (loading the context into memory)
  • 15× faster decoding (generating the response)
  • 1/20th the per-token compute cost

And critically: MiniMax ran this selection on uncompressed, full-precision key-values not a lossy compressed version. The precision is maintained. Only the search scope is narrowed.

One practical note for developers: The GGUF/llama.cpp quantized version of M3 doesn't yet support MSA it falls back to standard dense attention. If you're running locally with llama.cpp, you won't get the long-context speed benefits. Use SGLang or vLLM for full MSA performance.

How to Run MiniMax M3 Locally

This section is for developers with access to data-center grade hardware. If you're an individual developer, skip to the API section — hosted access is more practical.

Hardware Requirements

ConfigurationVRAM RequiredContext Supported
FP8 precision, 4× A100 80GB~320GBUp to ~256K tokens
FP8 precision, 8× H100 80GB~640GBUp to 1M tokens
INT4 quantized (GGUF)~100GB+Up to 128K (no MSA)

↔️ Scroll horizontally to see all columns

M3 has 428B total parameters but only ~23B activate per query. The VRAM requirement is mainly for storing the weights, not running them.

No in-house data center? Most developers rent GPU time to experiment before committing to on-premise hardware. Cloud GPU rental lets you spin up an 8× A100 cluster, run your benchmarks, and pay only for the hours you use.

→ Rent GPU compute on Runpod: https://runpod.io?ref=39rjwy2z) — hourly A100/H100 GPU instances for testing self-hosted MiniMax M3 without long-term hardware commitment.
Disclosure: Referral link. I may earn a commission if you sign up, at no extra cost to you.

Bash / Terminal
# Install SGLang
pip install sglang[all]

# Download the weights
huggingface-cli download MiniMaxAI/MiniMax-M3 --local-dir ./MiniMax-M3

# Launch with 8 GPUs and 512K context
python -m sglang.launch_server \
  --model-path ./MiniMax-M3 \
  --tp 8 \
  --context-length 524288 \
  --trust-remote-code

vLLM Deployment

Bash / Terminal
pip install vllm

vllm serve MiniMaxAI/MiniMax-M3 \
  --tensor-parallel-size 8 \
  --max-model-len 131072 \
  --trust-remote-code

Querying Your Local Server

Once SGLang or vLLM is running, it exposes an OpenAI-compatible endpoint on localhost. Use the same OpenAI SDK pattern:

Python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="MiniMax-M3",
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant built by MiniMax."
        },
        {
            "role": "user",
            "content": "Analyze the performance bottleneck in this codebase and suggest optimizations."
        }
    ],
    temperature=1.0,
    max_tokens=4096
)

print(response.choices[0].message.content)

Quantized (GGUF) — Lower Hardware Option

Unsloth MiniMax M3 GGUF quantized versions on Hugging Face showing available quantization levels for local llama.cpp inference, with note that MSA sparse attention is not yet supported, June 2026
Unsloth's GGUF-quantized MiniMax M3 on Hugging Face — lower VRAM option for local inference via llama.cpp. Important: MSA sparse attention not yet supported; model falls back to standard dense attention for long-context use. Source: huggingface.co/unsloth/MiniMax-M3-GGUF, June 2026

Unsloth has released GGUF quantized versions at huggingface.co/unsloth/MiniMax-M3-GGUF. Note: as of June 2026, llama.cpp doesn't yet support MSA, quantized versions fall back to dense attention, so long-context performance won't match the full model.

Bash / Terminal
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
git fetch origin pull/24523/head:minimax-m3
git checkout minimax-m3
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j --target llama-cli llama-server

./build/bin/llama-cli -hf unsloth/MiniMax-M3-GGUF:UD-IQ1_M

MiniMax M3 API Setup — Complete Code Guide

MiniMax M3's API is fully compatible with both the OpenAI SDK and the Anthropic SDK. If you already use either, you only need to change the base URL and model name.

Python
from openai import OpenAI
import os

client = OpenAI(
    base_url="https://api.minimax.io/v1",
    api_key=os.environ["MINIMAX_API_KEY"]
)

def chat_with_m3(prompt, system_prompt="You are a helpful assistant."):
    response = client.chat.completions.create(
        model="MiniMax-M3",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": prompt}
        ],
        temperature=1.0,
        max_tokens=4096
    )
    return response.choices[0].message.content

# Example usage
result = chat_with_m3(
    "Write a complete FastAPI app with JWT authentication and PostgreSQL integration.",
    system_prompt="You are an expert software engineer. Always write production-ready code."
)
print(result)

Streaming Responses (Better UX for Long Outputs)

Python
from openai import OpenAI
import os

client = OpenAI(
    base_url="https://api.minimax.io/v1",
    api_key=os.environ["MINIMAX_API_KEY"]
)

def stream_m3(prompt):
    stream = client.chat.completions.create(
        model="MiniMax-M3",
        messages=[{"role": "user", "content": prompt}],
        temperature=1.0,
        stream=True
    )
    for chunk in stream:
        content = chunk.choices[0].delta.content
        if content:
            print(content, end="", flush=True)
    print()

stream_m3("Explain the entire history of neural network architectures in depth.")

Multimodal: Screenshot to Code

One of M3's most practical use cases — give it an image of a UI and ask it to write the code. M3 accepts images as base64 or URL, same format as the OpenAI vision API:

Python
from openai import OpenAI
import base64
import os

client = OpenAI(
    base_url="https://api.minimax.io/v1",
    api_key=os.environ["MINIMAX_API_KEY"]
)

def image_to_code(image_path, instruction):
    with open(image_path, "rb") as f:
        image_b64 = base64.b64encode(f.read()).decode("utf-8")

    response = client.chat.completions.create(
        model="MiniMax-M3",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/png;base64,{image_b64}"
                        }
                    },
                    {
                        "type": "text",
                        "text": instruction
                    }
                ]
            }
        ],
        temperature=1.0,
        max_tokens=8192
    )
    return response.choices[0].message.content

code = image_to_code(
    "ui-mockup.png",
    "Write complete React + Tailwind CSS code to replicate this UI exactly. Include all components."
)
print(code)

Using the Anthropic SDK Format

MiniMax also supports the Anthropic Messages API format, useful if you're already using Claude tooling and want to swap in M3 with minimal changes:

Python
import anthropic
import os

# Configure these environment variables:
# export ANTHROPIC_BASE_URL=https://api.minimax.io/anthropic
# export ANTHROPIC_API_KEY=your_minimax_api_key

client = anthropic.Anthropic()

message = client.messages.create(
    model="MiniMax-M3",
    max_tokens=2048,
    system="You are an expert software engineer.",
    messages=[
        {
            "role": "user",
            "content": "Write a thread-safe connection pool in Python with health checks and automatic reconnection."
        }
    ]
)

print(message.content[0].text)

Via OpenRouter (One API Key for All Models)

If you use OpenRouter, M3 is available immediately with no separate MiniMax account:

Python
from openai import OpenAI
import os

client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=os.environ["OPENROUTER_API_KEY"]
)

response = client.chat.completions.create(
    model="minimax/minimax-m3",
    messages=[
        {"role": "user", "content": "Write a thread-safe connection pool in Python."}
    ]
)

print(response.choices[0].message.content)

Note on model string casing: When calling the MiniMax API directly (api.minimax.io/v1), use "MiniMax-M3" (capitalized). When calling via OpenRouter, use "minimax/minimax-m3" (lowercase). This is confirmed in official documentation.

Pricing Breakdown

Understanding M3's pricing requires knowing about the two tiers within the 1M context window.

Pay-As-You-Go API

Context LengthInput /1M tokensOutput /1M tokensCache Read /1M
≤512K tokens$0.30 promo / $0.60 standard$1.20 promo / $2.40 standard$0.06
>512K tokens$0.60 promo / $1.20 standard$2.40 promo / $4.80 standard$0.12

↔️ Scroll horizontally to see all columns

Important: Inputs above 512K tokens are currently access-gated. You need to contact MiniMax directly for access if you need the full 1M window via API.

Monthly Subscription Plans

PlanMonthly PriceToken QuotaBest For
Plus$20/month~1.7B tokensTesting and light production
Max$50/month~5.1B tokensRegular development
Ultra$120/month~9.8B tokensHeavy production workloads

↔️ Scroll horizontally to see all columns

Verify current pricing at platform.minimax.io before committing — promo rates are time-limited.

Real Cost Comparison

A realistic coding agent task (500K input + 100K output tokens):

ModelCost per TaskNotes
MiniMax M3 (promo)~$0.27Best current deal
MiniMax M3 (standard)~$0.54After promo ends
Claude Opus 4.7~$5.00At standard pricing
GPT-5.5~$8.00At standard pricing

↔️ Scroll horizontally to see all columns

At promo pricing, M3 costs about 5% of Claude Opus and 3.4% of GPT-5.5 for the same task. Even at standard rates, it's still a fraction of the cost.

Horizontal bar chart comparing AI model cost per coding task in 2026 — MiniMax M3 at 0.27 dollars versus Claude Opus 4.7 at 5 dollars and GPT-5.5 at 8 dollars for a 500K token input task
Cost per task comparison (500K input + 100K output tokens): MiniMax M3 at ~$0.27 vs Claude Opus 4.7 at ~$5.00 and GPT-5.5 at ~$8.00. Data: techaffiliate.in based on published API pricing, June 2026. Promo pricing — verify at platform.minimax.io.

One caveat: because M3 is verbose (~91M output tokens across evaluations vs 23M average), the per-task cost adds up faster than the per-token price alone suggests. Model your actual workload before assuming the savings.

Real-World Use Cases

Large codebase analysis --> Feed an entire enterprise repository to M3 and ask it to identify security vulnerabilities, performance bottlenecks, or architectural improvements. The 1M context window means it can reason about the whole codebase, not just individual files.

Screenshot to working code --> Take a screenshot of any web interface and ask M3 to write the React, Vue, or HTML/CSS code to replicate it. Native multimodal training means it understands visual layout natively — not through a text translation step.

Video debugging --> Record your screen while a bug occurs and upload the video. M3 can watch what happened and suggest the fix without you having to describe it in text.

Autonomous research workflows --> The ICLR paper reproduction result suggests M3 can handle long-horizon research tasks: reading papers, designing experiments, running analysis, and producing results within a single coherent session.

Multi-language codebase work --> MiniMax trained M3 on real production code across 10+ languages: Python, TypeScript, Rust, Go, Java, C++, Kotlin, PHP, Lua, and Dart. Not toy examples, actual production repositories.

Self-hosted inference for data-sensitive workloads --> If your organization can't send code to external APIs (compliance, IP protection, regulated industries), the open weights let you run M3 entirely on your own infrastructure. Budget the hardware honestly and read the Community License carefully.

Which Coding Agent Should You Use in 2026?

This section is for developers actively evaluating coding tools not just M3, but the whole landscape. It's the question I get most after publishing model reviews.

M3's 59.0% SWE-Bench Pro score puts it in direct competition with the model backends powering tools like Cursor and Windsurf. Here's an honest comparison of the full tools model capability plus UX plus pricing:

FeatureMiniMax CodeCursorWindsurf (Codeium)
Underlying ModelMiniMax M3 (native)User's choice — M3, Claude, GPTCodeium's models + user choice
Best Benchmark (backend)59.0% SWE-Bench (M3)Depends on model chosenVaries
IDE IntegrationCLI + web-basedFull VS Code forkFull VS Code fork
Context Window1M tokens (M3 native)Depends on modelDepends on model
Video / Image InputYes, native (M3)Yes, via model choiceLimited
Free TierYes — daily quotaYes — limitedYes — generous
Paid Plan Entry$20/month$20/month (Pro)$15/month (Pro)
Best ForLong-context agentic work, multimodal inputTeams already in VS Code, model flexibilityFast autocomplete + low-cost agents

↔️ Scroll horizontally to see all columns

The honest take:

If you're building long-running agents or need to analyze entire repositories in one pass, MiniMax Code with M3 native is the right call — nothing else gives you 1M context natively at this price.

If you're a developer who lives in VS Code and wants the flexibility to switch between M3, Claude, and GPT depending on the task, Cursor is the better daily driver. The UX is more polished and the IDE integration runs deeper.

If your team wants the most generous free tier before committing to a subscription, Windsurf is worth evaluating first.

The smart order: Start with MiniMax Code free → if you want VS Code integration, add Cursor free → upgrade whichever you actually use every day.

Community Reports and Early Impressions

A few honest observations from developers in the first weeks after launch:

On verbosity: Responses tend to be thorough — sometimes to the point of being longer than necessary. M3 averages about 91M tokens across Artificial Analysis's evaluation suite vs a 23M average for peers. In practice, consider using a lower max_tokens setting and prompting for concise output when brevity matters.

On server availability: Post-launch demand has been high. Some users in the Asia-Pacific region reported timeouts through the official hosted API. Routing through OpenRouter (which distributes across multiple providers) can improve reliability during peak periods.

On the promo pricing: The $0.30/1M input promo rate is time-limited. Standard rates are double ($0.60/$1.20), and the rates above 512K context double again. Model your actual long-context workload costs before committing production workflows to M3.

On the Community License: Unlike GLM-5.2 (MIT) or Qwen3 (Apache 2.0), MiniMax's license has commercial use conditions. If you're building a product on the self-hosted weights, read the full license at huggingface.co/MiniMaxAI/MiniMax-M3 before shipping.

Pros and Cons

What's Great

  • First open-weights model combining frontier coding + 1M context + native video input simultaneously
  • 59.0% SWE-Bench Pro (vendor-reported) --> strongest open-weights coding score at launch
  • Beats Claude Opus 4.7 on BrowseComp (83.5 vs 79.3) independently notable
  • MSA architecture makes 1M context genuinely affordabl not just advertised
  • Multiple free access paths --> no credit card required for any of them
  • Open weights on Hugging Face --> downloadable and self-hostable
  • $0.30/$1.20 promo pricing is 5–10% of frontier proprietary model costs
  • Compatible with both OpenAI and Anthropic SDK formats --> minimal migration effort
  • Natively trained on mixed modalities --> not a text model with vision grafted on

What to Know Before You Commit

  • SWE-Bench scores are vendor-reported — independent verification still pending for some benchmarks
  • Community License, not MIT or Apache — commercial use conditions apply; read before building products
  • Very verbose — ~91M output tokens per evaluation vs ~23M average; production costs can surprise you
  • Full 1M context via API is access-gated above 512K — not self-serve yet
  • GGUF/llama.cpp versions don't support MSA — true long-context performance requires SGLang or vLLM
  • Server congestion after launch — expect some API reliability issues under high demand
  • Requires data-center hardware for local deployment — not consumer-grade
  • Chinese company data privacy considerations for cloud API use

Who Should Use MiniMax M3?

M3 is the right choice if you:

  • Are a developer who wants frontier-level coding AI at a fraction of the cost — at promo pricing, it's the most capable model per dollar for coding tasks right now
  • Need to analyze large codebases end-to-end — the 1M context window is the only way to hold an entire enterprise repo in a single context
  • Work with UI mockups, screenshots, or design files — native vision means you can go from screenshot directly to production code
  • Are building long-running AI agents — the 24-hour autonomous run capability is matched by very few models at any price
  • Need open weights for compliance or data residency — self-hostable (read the Community License first)
  • Are cost-sensitive but unwilling to compromise on capability — the per-task math is genuinely compelling

Consider a different model if you:

  • Need a pure MIT license with no commercial restrictions → use GLM-5.2
  • Need the absolute lowest per-task cost → DeepSeek V4 Flash (~$0.05/task) is cheaper
  • Work primarily with text-only coding without multimodal needs → Kimi K2.7 Code is more efficient
  • Need maximum abstract reasoning → frontier proprietary models still lead here
  • Don't have GPU infrastructure and need local deployment → M3's 428B parameters aren't consumer hardware friendly even quantized

Frequently Asked Questions

Is MiniMax M3 free?

Yes, in several ways. MiniMax Chat at chat.minimax.io offers free access with daily limits — no credit card needed. MiniMax Code at code.minimax.io has a free tier for coding tasks. NVIDIA NIM at build.nvidia.com/minimaxai/minimax-m3 provides a confirmed free API endpoint — just a free NVIDIA developer account required, no MiniMax account needed. OpenRouter provides free trial credits. The official MiniMax API at platform.minimax.io is pay-as-you-go (no guaranteed free credits — check current onboarding). The model weights are free to download on Hugging Face under the Community License. The $20/month Plus plan is the cheapest paid option if you exceed free limits.

What does 428 billion parameters mean?

Parameters are the internal numbers the model has learned during training — they represent its knowledge and reasoning capability. M3 has 428 billion of them in total, but only activates about 23 billion per query (using a design called Mixture-of-Experts). More parameters generally means more capable — but also bigger files and more computing power to run. For reference, GPT-3 had 175 billion.

Is MiniMax M3 open source?

It's open-weights, which is different from fully open source. The model files are publicly downloadable, but the MiniMax Community License has commercial use conditions. "Open weights" means you can download and inspect the model. "Open source" (like MIT or Apache) means you can use it commercially with no restrictions. M3 is the former, not the latter. Read the license before building a commercial product on the weights.

How does it compare to GLM-5.2?

Both are strong open-weights models from Chinese labs released in June 2026. GLM-5.2 (Z.ai) has a pure MIT license — more permissive for commercial use — but is text-only. MiniMax M3 has better multimodal capabilities (images + video) and a slightly different strength profile. For pure coding with unrestricted commercial licensing, GLM-5.2 is likely the safer pick. For multimodal and agentic workflows, M3 is the better fit.

Can I use M3 without a credit card?

Yes. chat.minimax.io, platform.minimax.io free credits, and OpenRouter trial credits all require only an email address. No credit card is needed for any of the free access methods in this article.

What hardware do I need to run it locally?

For full model inference with MSA at long context: minimum 4× A100 80GB GPUs (FP8). For quantized GGUF inference: ~100GB+ VRAM across multiple GPUs. M3 is not runnable on consumer hardware even at heavy quantization. If you don't have data-center grade hardware, use the hosted API or rent GPU compute by the hour via services like Runpod.

What happened to M2.7?

M2.7 (229B parameters) is still available via API and on Hugging Face. M3 is a generational upgrade — it brings back sparse attention that M2 had removed, adds native video input, and significantly improves coding. M2.7 is faster and cheaper per token; it's a viable option for workloads where M3's specific capabilities (1M context, video, frontier coding) aren't required.

Which SDK should I use — OpenAI or Anthropic?

Both work with M3. Use the OpenAI SDK (base_url="https://api.minimax.io/v1", model "MiniMax-M3") if you're building new integrations — it's the primary path. Use the Anthropic SDK (ANTHROPIC_BASE_URL=https://api.minimax.io/anthropic) if you're coming from Claude tooling and want minimal code changes. On OpenRouter, use "minimax/minimax-m3" (lowercase) regardless of SDK.

Final Verdict

MiniMax M3 is not the most powerful AI model in the world. Proprietary models from Anthropic and OpenAI still lead on abstract reasoning and certain benchmarks.

What's genuinely remarkable is the combination at this access level and price. A 428B open-weights model that understands video, holds 1 million tokens of context, ran autonomously for 24 hours making nearly 2,000 tool calls — downloadable free from Hugging Face, usable through a free browser interface with no credit card.

Two years ago, that capability combination would have been the exclusive property of an expensive proprietary subscription. Today it's a free download.

That's the real story. Not the exact benchmark percentages. The access.

If you just want to try it, start with Method 1 — you'll be chatting with M3 in under two minutes. If you're a developer who wants instant API access with no MiniMax account, Method 4 (NVIDIA NIM) gets you a working API key in five minutes. If you're evaluating M3 alongside other models, Method 3 (OpenRouter) gives you a single key that switches between providers. If you're building something commercial and need direct billing, Method 5 (MiniMax Official API) is the production path. And if you have data residency requirements, Method 6 (self-hosted weights) is the answer — just budget the hardware honestly, consider cloud GPU rental for initial testing, and read the Community License carefully.

Use CaseToolLink
Try M3 free, no setupMiniMax Chatchat.minimax.io
Coding agent, freeMiniMax Codecode.minimax.io
API access, compare modelsOpenRouteropenrouter.ai
Free API, no MiniMax accountNVIDIA NIMbuild.nvidia.com/minimaxai/minimax-m3
Official API + commercial useMiniMax Platformplatform.minimax.io
GPU compute for local testingRunpodrunpod.io
Model weightsHugging Facehuggingface.co/MiniMaxAI/MiniMax-M3

↔️ Scroll horizontally to see all columns

Use MiniMax M3 free right now:

  • Chat interface: chat.minimax.io
  • Coding agent: code.minimax.io
  • Free API (no MiniMax account): build.nvidia.com/minimaxai/minimax-m3
  • API (commercial): platform.minimax.io
  • Model weights: huggingface.co/MiniMaxAI/MiniMax-M3
  • OpenRouter: openrouter.ai/minimax/minimax-m3

Found this helpful? Share it with others!

Affiliate Disclosure

TechAffiliate may earn a commission if you purchase through our links. This helps support our work but does not influence our reviews. We always provide honest assessments of all products.

Comments (0)

Leave a Comment

No comments yet

Be the first to share your thoughts!