Best field report is 28.0 tok/s; keep ranking movement provisional until Bench evidence hardens.
Bench: Mac Studio M4 Ultra 256GB batch plannedBest local LLMs for this Mac
MacBook Pro M5 Max 128GB 16-inch ranked for coding with a most capable bias, using the best available runtime evidence. focused on the current market set.
Current ranking evidence
Fresh releases stay visible, but sparse evidence remains explicit.
Best field report is 85.5 tok/s; keep ranking movement provisional until Bench evidence hardens.
Bench: Mac Studio M4 Ultra 256GB batch plannedBest field report is 203.1 tok/s; keep ranking movement provisional until Bench evidence hardens.
Bench: Mac Studio M4 Ultra 256GB batch plannedBest field report is 44.0 tok/s; keep ranking movement provisional until Bench evidence hardens.
Bench: Mac Studio M4 Ultra 256GB batch planned| Rank | Model | Score | Quant | Tok/s | Runtime | Evidence | Headroom | Context | Why it ranks here |
|---|---|---|---|---|---|---|---|---|---|
| 1 | Gemma 4 31B30.7B parameters | 290 | 8bit | 26.0 tok/s Fastest evidence path: 8bit · 26.0 tok/s · MLX · Estimated | MLX | EstimatedFirst-party M5 batch queued | 91.4 GB | 87k | Recent frontier candidate in the current catalog. 8bit is the highest practical quality here. 26.0 tok/s estimated from nearby benchmark coverage, with MLX backend as the best runtime hint. 91.4 GB headroom leaves workable context margin. |
| 2 | Qwen3.5-27B27B parameters | 283 | 8bit | 31.6 tok/s Fastest evidence path: 8bit · 31.6 tok/s · MLX · Estimated | MLX | EstimatedFirst-party M5 batch queued | 100.4 GB | 262k | Recent frontier candidate in the current catalog. 8bit is the highest practical quality here. 31.6 tok/s estimated from nearby benchmark coverage, with MLX backend as the best runtime hint. 100.4 GB headroom leaves workable context margin. |
| 3 | Qwen3.6-27B27B parameters | 280 | 8bit | 16.6 tok/s Fastest evidence path: 8bit · 16.6 tok/s · Ollama · Estimated | Ollama | EstimatedFirst-party M5 batch queued | 100.4 GB | 262k | Recent frontier candidate in the current catalog. 8bit is the highest practical quality here. 16.6 tok/s estimated from nearby benchmark coverage, with Ollama wrapper on llama.cpp as the best runtime hint. 100.4 GB headroom leaves workable context margin. |
| 4 | Devstral Small 2 24B24B parameters | 274 | 8bit | 23.4 tok/s Fastest evidence path: 8bit · 23.4 tok/s · llama.cpp · Estimated | llama.cpp | EstimatedFirst-party M5 batch queued | 103.9 GB | 262k | Recent frontier candidate in the current catalog. 8bit is the highest practical quality here. 23.4 tok/s estimated from nearby benchmark coverage, with llama.cpp backend as the best runtime hint. 103.9 GB headroom leaves workable context margin. |
| 5 | Qwen3.6-35B-A3B3B active / 35B total | 252 | 8bit | 55.0 tok/s Fastest evidence path: 8bit · 55.0 tok/s · MLX · Estimated | MLX | EstimatedFirst-party M5 batch queued | 94.3 GB | 262k | Recent frontier candidate in the current catalog. 8bit is the highest practical quality here. 55.0 tok/s estimated from nearby benchmark coverage, with MLX backend as the best runtime hint. 94.3 GB headroom leaves workable context margin. |
| 6 | Mistral Small 4 119B6.5B active / 119B total | 193 | Q6_K | 42.0 tok/s Fastest evidence path: Q6_K · 42.0 tok/s · Ollama · Estimated | Ollama | EstimatedFirst-party M5 batch queued | 32.1 GB | 32k | Recent frontier candidate in the current catalog. Q6_K is the highest practical quality here. 42.0 tok/s estimated from nearby benchmark coverage, with Ollama wrapper on llama.cpp as the best runtime hint. 32.1 GB headroom leaves workable context margin. |
| 7 | MiniMax M2.7229B parameters | 422 | 3bit Source-backed MLX MiniMax-M2.7-3bit - 112 GB min | 40.4 tok/s Fastest evidence path: 3bit · 40.4 tok/s · Best available · Field signal | Best available | Field signalFirst-party M5 batch queued | 16.0 GB | 116k | Recent model release in the current catalog. 3bit is the highest practical quality here. 40.4 tok/s only from Apple Silicon field signals. 16.0 GB headroom leaves workable context margin. |
| 8 | Gemma 4 26B-A4B3.8B active / 25.2B total | 252 | 8bit | 50.0 tok/s Fastest evidence path: 8bit · 50.0 tok/s · MLX · Estimated | MLX | EstimatedFirst-party M5 batch queued | 102.2 GB | 262k | Recent model release in the current catalog. 8bit is the highest practical quality here. 50.0 tok/s estimated from nearby benchmark coverage, with MLX backend as the best runtime hint. 102.2 GB headroom leaves workable context margin. |
| 9 | Nemotron Cascade 2 30B-A3B3B active / 30B total | 250 | 8bit | 28.0 tok/s Fastest evidence path: 8bit · 28.0 tok/s · Ollama · Estimated | Ollama | EstimatedFirst-party M5 batch queued | 99.2 GB | 1000k | Recent model release in the current catalog. 8bit is the highest practical quality here. 28.0 tok/s estimated from nearby benchmark coverage, with Ollama wrapper on llama.cpp as the best runtime hint. 99.2 GB headroom leaves workable context margin. |
| 10 | Qwen3.5-35B-A3B3B active / 35B total | 249 | 8bit | 48.0 tok/s Fastest evidence path: 8bit · 48.0 tok/s · Ollama · Estimated | Ollama | EstimatedFirst-party M5 batch queued | 94.3 GB | 262k | Recent model release in the current catalog. 8bit is the highest practical quality here. 48.0 tok/s estimated from nearby benchmark coverage, with Ollama wrapper on llama.cpp as the best runtime hint. 94.3 GB headroom leaves workable context margin. |
| 11 | GLM-4.7-Flash3B active / 30B total | 247 | 8bit | 58.0 tok/s Fastest evidence path: 8bit · 58.0 tok/s · llama.cpp · Estimated | llama.cpp | EstimatedFirst-party M5 batch queued | 92.2 GB | 90k | Recent model release in the current catalog. 8bit is the highest practical quality here. 58.0 tok/s estimated from nearby benchmark coverage, with llama.cpp backend as the best runtime hint. 92.2 GB headroom leaves workable context margin. |
| 12 | Nemotron-3-Nano-30B-A3B3.5B active / 30B total | 246 | 8bit | 43.7 tok/s Fastest evidence path: 8bit · 43.7 tok/s · llama.cpp · Estimated | llama.cpp | EstimatedFirst-party M5 batch queued | 99.2 GB | 1000k | Recent model release in the current catalog. 8bit is the highest practical quality here. 43.7 tok/s estimated from nearby benchmark coverage, with llama.cpp backend as the best runtime hint. 99.2 GB headroom leaves workable context margin. |
| 13 | Magistral Small24B parameters | 242 | 8bit | Measure it | Best available | Fit-firstFirst-party M5 batch queued | 103.9 GB | 41k | 8bit is the highest practical quality here. Speed still needs direct speed coverage. 103.9 GB headroom leaves workable context margin. |
| 14 | Ministral 3 14B14B parameters | 232 | 8bit | 40.0 tok/s Fastest evidence path: 8bit · 40.0 tok/s · Ollama · Estimated | Ollama | EstimatedFirst-party M5 batch queued | 113.2 GB | 262k | Recent model release in the current catalog. 8bit is the highest practical quality here. 40.0 tok/s estimated from nearby benchmark coverage, with Ollama wrapper on llama.cpp as the best runtime hint. 113.2 GB headroom leaves workable context margin. |
| 15 | Gemma 4 E4B8B parameters | 230 | 8bit | 128.0 tok/s Fastest evidence path: 8bit · 128.0 tok/s · MLX · Estimated | MLX | EstimatedFirst-party M5 batch queued | 119.4 GB | 131k | Recent model release in the current catalog. 8bit is the highest practical quality here. 128.0 tok/s estimated from nearby benchmark coverage, with MLX backend as the best runtime hint. 119.4 GB headroom leaves workable context margin. |
| 16 | Qwen3.5-9B9B parameters | 230 | 8bit | 78.0 tok/s Fastest evidence path: 8bit · 78.0 tok/s · MLX · Estimated | MLX | EstimatedFirst-party M5 batch queued | 118.1 GB | 262k | Recent model release in the current catalog. 8bit is the highest practical quality here. 78.0 tok/s estimated from nearby benchmark coverage, with MLX backend as the best runtime hint. 118.1 GB headroom leaves workable context margin. |
| 17 | Ministral 3 8B8B parameters | 223 | 8bit | 72.0 tok/s Fastest evidence path: 8bit · 72.0 tok/s · Ollama · Estimated | Ollama | EstimatedFirst-party M5 batch queued | 119.0 GB | 262k | Recent model release in the current catalog. 8bit is the highest practical quality here. 72.0 tok/s estimated from nearby benchmark coverage, with Ollama wrapper on llama.cpp as the best runtime hint. 119.0 GB headroom leaves workable context margin. |
| 18 | gpt-oss 20B3.6B active / 21B total | 217 | 8bit | Measure it | MLX | Fit-firstFirst-party M5 batch queued | 107.6 GB | 131k | 8bit is the highest practical quality here. Speed still needs direct speed coverage. 107.6 GB headroom leaves workable context margin. |
| 19 | Qwen3.5-122B-A10B10B active / 122B total | 197 | Q6_K | 60.6 tok/s Fastest evidence path: Q6_K · 60.6 tok/s · MLX · Estimated | MLX | EstimatedBitter Mill import queued | 33.6 GB | 165k | Recent model release in the current catalog. Q6_K is the highest practical quality here. 60.6 tok/s estimated from nearby benchmark coverage, with MLX backend as the best runtime hint. 33.6 GB headroom leaves workable context margin. |
| 20 | Llama 4 Scout 17B-16E17B active / 109B total | 196 | 8bit | 26.0 tok/s Fastest evidence path: 8bit · 26.0 tok/s · Ollama · Estimated | Ollama | EstimatedFirst-party M5 batch queued | 24.5 GB | 37k | 8bit is the highest practical quality here. 26.0 tok/s estimated from nearby benchmark coverage, with Ollama wrapper on llama.cpp as the best runtime hint. 24.5 GB headroom leaves workable context margin. |
| 21 | Qwen3-Coder-Next3B active / 80B total | 192 | 8bit | 74.3 tok/s Fastest evidence path: 8bit · 74.3 tok/s · MLX · Community row | MLX | Community rowFirst-party M5 batch queued | 52.2 GB | 262k | Recent model release in the current catalog. 8bit is the highest practical quality here. 74.3 tok/s benchmark-backed on MLX backend. 52.2 GB headroom leaves workable context margin. |
| 22 | GLM-4.5-Air12B active / 106B total | 190 | 8bit | 18.0 tok/s Fastest evidence path: 8bit · 18.0 tok/s · LM Studio · Estimated | LM Studio | EstimatedFirst-party M5 batch queued | 27.3 GB | 55k | 8bit is the highest practical quality here. 18.0 tok/s estimated from nearby benchmark coverage, with LM Studio wrapper on mixed as the best runtime hint. 27.3 GB headroom leaves workable context margin. |
| 23 | Gemma 4 E2B5.1B parameters | 170 | 8bit | 158.0 tok/s Fastest evidence path: 8bit · 158.0 tok/s · MLX · Estimated | MLX | EstimatedFirst-party M5 batch queued | 122.5 GB | 131k | Recent model release in the current catalog. 8bit is the highest practical quality here. 158.0 tok/s estimated from nearby benchmark coverage, with MLX backend as the best runtime hint. 122.5 GB headroom leaves workable context margin. |
| 24 | Qwen3.5-4B4B parameters | 167 | 8bit | 148.0 tok/s Fastest evidence path: 8bit · 148.0 tok/s · MLX · Estimated | MLX | EstimatedFirst-party M5 batch queued | 122.8 GB | 262k | Recent model release in the current catalog. 8bit is the highest practical quality here. 148.0 tok/s estimated from nearby benchmark coverage, with MLX backend as the best runtime hint. 122.8 GB headroom leaves workable context margin. |
| 25 | gpt-oss 120B5.1B active / 117B total | 164 | Q6_K | 7.0 tok/s Fastest evidence path: Q6_K · 7.0 tok/s · Ollama · Estimated | Ollama | EstimatedFirst-party M5 batch queued | 37.6 GB | 131k | Q6_K is the highest practical quality here. 7.0 tok/s estimated from nearby benchmark coverage, with Ollama wrapper on llama.cpp as the best runtime hint. 37.6 GB headroom leaves workable context margin. |
Featured Environment
Why this Mac is featured
The featured Mac is the MacBook Pro M5 Max 128GB 16-inch, the current first-party frontier inference environment. Use the ranking above to switch to any other Mac.
Featured Mac: MacBook Pro M5 Max 128GB 16-inch
Qwen3.6-27B is the current agent-capability answer on the featured Mac. The table keeps fit, quantization, runtime, speed, and evidence status together so field signals do not look like first-party truth.
Runtime and quantization drilldown
MacBook Pro M5 Max 128GB 16-inch currently has 33 direct benchmark rows and 45 matching field-speed claims. Use the runtime selector and Bench before treating any speed winner as settled.
Next featured environment
The default is a pointer, not a fork. Mac Studio M4 Ultra 256GB is planned for June 2026; the same Mac-first structure moves there only after arrival validation and clean first-party evidence, while the current featured record remains available as a normal machine page and Bench audit target.
If the real decision is local Mac versus rented GPU economics, compare the hardware path in Worth with AI Datacenter Index.
Frequently asked questions
- Why does Silicon Score default to MacBook Pro M5 Max 128GB 16-inch right now?
- MacBook Pro M5 Max 128GB 16-inch is the local featured machine in hand for 2026 Apple Silicon inference. That makes it the best default lab question for now: what is the strongest model that fits, which runtime is fastest, and how much evidence backs the answer?
- How should I read field signals versus benchmark rows?
- Field signals are directional reports from Apple Silicon practitioners and competitor surfaces. They stay labeled until Silicon Score reproduces the setup with clean local recordings, methodology, and provenance suitable for Bench.
- How much unified memory do you need for local LLMs on a Mac?
- Unified memory determines which quantization tier fits cleanly, how much context you can keep live, and whether a recommendation stays practical after launch. In practice, 16GB can be useful for compact models, 24GB to 48GB opens more working room, and 64GB-plus tiers are where larger frontier setups become more credible. Use Fit to verify the exact model and headroom instead of relying on RAM labels alone.
- What does tok/s mean and how much do I need?
- Tokens per second measures how fast a model generates text. Around 8 to 15 tok/s usually feels interactive, while heavier coding, agent, or batch workflows benefit from more. Silicon Score ranks Macs with benchmark-backed or explicitly labeled estimated speed so you can trade responsiveness against model quality with the evidence visible.
- Is local LLM inference on a Mac cheaper than using an API?
- If you run daily, want predictable privacy, or repeatedly use the same model class, local hardware can make sense. If your usage is spiky or you expect to burst into much larger models, compare the Mac path against rented GPU economics on AI Datacenter Index before you over-buy hardware.
- What quantization format should I use on a Mac?
- GGUF remains a common path for llama.cpp and Ollama on Macs, but there is no single best quantization for every machine. The practical answer depends on memory ceiling, runtime support, and how much quality loss you can tolerate. Start from the quantization and runtime surfaced in Rankings and Bench instead of assuming one preset wins everywhere.