Best local LLMs for this Mac

25 current modelsCatalog current through April 22, 2026Benchmark evidence through April 27, 2026

MacBook Pro M5 Max 128GB 16-inch ranked for coding with a most capable bias, using the best available runtime evidence. focused on the current market set.

Use the strongest current runtime evidence for each row.Largest fit: MiniMax M2.7 at 3bit (229B parameters)Fastest read: Gemma 4 E2B at 158.0 tok/s on MLXRanking evidence: Gemma 4 31B, Qwen3.6-27B, Qwen3.6-35B-A3B +1 are current candidates; sparse rows stay labeled until first-party evidence lands.Next featured Mac: Mac Studio M4 Ultra 256GB planned for June 2026; current default changes after arrival validation and clean first-party evidence.25 historical baseline rows hidden

Current ranking evidence

Fresh releases stay visible, but sparse evidence remains explicit.

Gemma 4 31B

released 2026-04-02 · 5 official specs captured · 4 benchmark rows · 6 Apple Silicon field sources · first-party measurement queued · Mac Studio M4 Ultra 256GB batch planned

Best field report is 28.0 tok/s; keep ranking movement provisional until Bench evidence hardens.

Bench: Mac Studio M4 Ultra 256GB batch planned
Qwen3.6-27B

released 2026-04-22 · 5 official specs captured · 1 benchmark row · 9 Apple Silicon field sources · first-party measurement queued · Mac Studio M4 Ultra 256GB batch planned

Best field report is 85.5 tok/s; keep ranking movement provisional until Bench evidence hardens.

Bench: Mac Studio M4 Ultra 256GB batch planned
Qwen3.6-35B-A3B

released 2026-04-15 · 5 official specs captured · 4 benchmark rows · 15 Apple Silicon field sources · first-party measurement queued · Mac Studio M4 Ultra 256GB batch planned

Best field report is 203.1 tok/s; keep ranking movement provisional until Bench evidence hardens.

Bench: Mac Studio M4 Ultra 256GB batch planned
Mistral Small 4 119B

released 2026-03-16 · 6 official specs captured · 3 benchmark rows · 1 Apple Silicon field source · first-party measurement queued · Mac Studio M4 Ultra 256GB batch planned

Best field report is 44.0 tok/s; keep ranking movement provisional until Bench evidence hardens.

Bench: Mac Studio M4 Ultra 256GB batch planned
RankModelScoreQuantTok/sRuntimeEvidenceHeadroomContextWhy it ranks here
1Gemma 4 31B30.7B parameters2908bit 26.0 tok/s Fastest evidence path: 8bit · 26.0 tok/s · MLX · EstimatedMLXEstimatedFirst-party M5 batch queued91.4 GB87kRecent frontier candidate in the current catalog. 8bit is the highest practical quality here. 26.0 tok/s estimated from nearby benchmark coverage, with MLX backend as the best runtime hint. 91.4 GB headroom leaves workable context margin.
2Qwen3.5-27B27B parameters2838bit 31.6 tok/s Fastest evidence path: 8bit · 31.6 tok/s · MLX · EstimatedMLXEstimatedFirst-party M5 batch queued100.4 GB262kRecent frontier candidate in the current catalog. 8bit is the highest practical quality here. 31.6 tok/s estimated from nearby benchmark coverage, with MLX backend as the best runtime hint. 100.4 GB headroom leaves workable context margin.
3Qwen3.6-27B27B parameters2808bit 16.6 tok/s Fastest evidence path: 8bit · 16.6 tok/s · Ollama · EstimatedOllamaEstimatedFirst-party M5 batch queued100.4 GB262kRecent frontier candidate in the current catalog. 8bit is the highest practical quality here. 16.6 tok/s estimated from nearby benchmark coverage, with Ollama wrapper on llama.cpp as the best runtime hint. 100.4 GB headroom leaves workable context margin.
4Devstral Small 2 24B24B parameters2748bit 23.4 tok/s Fastest evidence path: 8bit · 23.4 tok/s · llama.cpp · Estimatedllama.cppEstimatedFirst-party M5 batch queued103.9 GB262kRecent frontier candidate in the current catalog. 8bit is the highest practical quality here. 23.4 tok/s estimated from nearby benchmark coverage, with llama.cpp backend as the best runtime hint. 103.9 GB headroom leaves workable context margin.
5Qwen3.6-35B-A3B3B active / 35B total2528bit 55.0 tok/s Fastest evidence path: 8bit · 55.0 tok/s · MLX · EstimatedMLXEstimatedFirst-party M5 batch queued94.3 GB262kRecent frontier candidate in the current catalog. 8bit is the highest practical quality here. 55.0 tok/s estimated from nearby benchmark coverage, with MLX backend as the best runtime hint. 94.3 GB headroom leaves workable context margin.
6Mistral Small 4 119B6.5B active / 119B total193Q6_K 42.0 tok/s Fastest evidence path: Q6_K · 42.0 tok/s · Ollama · EstimatedOllamaEstimatedFirst-party M5 batch queued32.1 GB32kRecent frontier candidate in the current catalog. Q6_K is the highest practical quality here. 42.0 tok/s estimated from nearby benchmark coverage, with Ollama wrapper on llama.cpp as the best runtime hint. 32.1 GB headroom leaves workable context margin.
7MiniMax M2.7229B parameters4223bit Source-backed MLX MiniMax-M2.7-3bit - 112 GB min40.4 tok/s Fastest evidence path: 3bit · 40.4 tok/s · Best available · Field signalBest availableField signalFirst-party M5 batch queued16.0 GB116kRecent model release in the current catalog. 3bit is the highest practical quality here. 40.4 tok/s only from Apple Silicon field signals. 16.0 GB headroom leaves workable context margin.
8Gemma 4 26B-A4B3.8B active / 25.2B total2528bit 50.0 tok/s Fastest evidence path: 8bit · 50.0 tok/s · MLX · EstimatedMLXEstimatedFirst-party M5 batch queued102.2 GB262kRecent model release in the current catalog. 8bit is the highest practical quality here. 50.0 tok/s estimated from nearby benchmark coverage, with MLX backend as the best runtime hint. 102.2 GB headroom leaves workable context margin.
9Nemotron Cascade 2 30B-A3B3B active / 30B total2508bit 28.0 tok/s Fastest evidence path: 8bit · 28.0 tok/s · Ollama · EstimatedOllamaEstimatedFirst-party M5 batch queued99.2 GB1000kRecent model release in the current catalog. 8bit is the highest practical quality here. 28.0 tok/s estimated from nearby benchmark coverage, with Ollama wrapper on llama.cpp as the best runtime hint. 99.2 GB headroom leaves workable context margin.
10Qwen3.5-35B-A3B3B active / 35B total2498bit 48.0 tok/s Fastest evidence path: 8bit · 48.0 tok/s · Ollama · EstimatedOllamaEstimatedFirst-party M5 batch queued94.3 GB262kRecent model release in the current catalog. 8bit is the highest practical quality here. 48.0 tok/s estimated from nearby benchmark coverage, with Ollama wrapper on llama.cpp as the best runtime hint. 94.3 GB headroom leaves workable context margin.
11GLM-4.7-Flash3B active / 30B total2478bit 58.0 tok/s Fastest evidence path: 8bit · 58.0 tok/s · llama.cpp · Estimatedllama.cppEstimatedFirst-party M5 batch queued92.2 GB90kRecent model release in the current catalog. 8bit is the highest practical quality here. 58.0 tok/s estimated from nearby benchmark coverage, with llama.cpp backend as the best runtime hint. 92.2 GB headroom leaves workable context margin.
12Nemotron-3-Nano-30B-A3B3.5B active / 30B total2468bit 43.7 tok/s Fastest evidence path: 8bit · 43.7 tok/s · llama.cpp · Estimatedllama.cppEstimatedFirst-party M5 batch queued99.2 GB1000kRecent model release in the current catalog. 8bit is the highest practical quality here. 43.7 tok/s estimated from nearby benchmark coverage, with llama.cpp backend as the best runtime hint. 99.2 GB headroom leaves workable context margin.
13Magistral Small24B parameters2428bit Measure it Best availableFit-firstFirst-party M5 batch queued103.9 GB41k8bit is the highest practical quality here. Speed still needs direct speed coverage. 103.9 GB headroom leaves workable context margin.
14Ministral 3 14B14B parameters2328bit 40.0 tok/s Fastest evidence path: 8bit · 40.0 tok/s · Ollama · EstimatedOllamaEstimatedFirst-party M5 batch queued113.2 GB262kRecent model release in the current catalog. 8bit is the highest practical quality here. 40.0 tok/s estimated from nearby benchmark coverage, with Ollama wrapper on llama.cpp as the best runtime hint. 113.2 GB headroom leaves workable context margin.
15Gemma 4 E4B8B parameters2308bit 128.0 tok/s Fastest evidence path: 8bit · 128.0 tok/s · MLX · EstimatedMLXEstimatedFirst-party M5 batch queued119.4 GB131kRecent model release in the current catalog. 8bit is the highest practical quality here. 128.0 tok/s estimated from nearby benchmark coverage, with MLX backend as the best runtime hint. 119.4 GB headroom leaves workable context margin.
16Qwen3.5-9B9B parameters2308bit 78.0 tok/s Fastest evidence path: 8bit · 78.0 tok/s · MLX · EstimatedMLXEstimatedFirst-party M5 batch queued118.1 GB262kRecent model release in the current catalog. 8bit is the highest practical quality here. 78.0 tok/s estimated from nearby benchmark coverage, with MLX backend as the best runtime hint. 118.1 GB headroom leaves workable context margin.
17Ministral 3 8B8B parameters2238bit 72.0 tok/s Fastest evidence path: 8bit · 72.0 tok/s · Ollama · EstimatedOllamaEstimatedFirst-party M5 batch queued119.0 GB262kRecent model release in the current catalog. 8bit is the highest practical quality here. 72.0 tok/s estimated from nearby benchmark coverage, with Ollama wrapper on llama.cpp as the best runtime hint. 119.0 GB headroom leaves workable context margin.
18gpt-oss 20B3.6B active / 21B total2178bit Measure it MLXFit-firstFirst-party M5 batch queued107.6 GB131k8bit is the highest practical quality here. Speed still needs direct speed coverage. 107.6 GB headroom leaves workable context margin.
19Qwen3.5-122B-A10B10B active / 122B total197Q6_K 60.6 tok/s Fastest evidence path: Q6_K · 60.6 tok/s · MLX · EstimatedMLXEstimatedBitter Mill import queued33.6 GB165kRecent model release in the current catalog. Q6_K is the highest practical quality here. 60.6 tok/s estimated from nearby benchmark coverage, with MLX backend as the best runtime hint. 33.6 GB headroom leaves workable context margin.
20Llama 4 Scout 17B-16E17B active / 109B total1968bit 26.0 tok/s Fastest evidence path: 8bit · 26.0 tok/s · Ollama · EstimatedOllamaEstimatedFirst-party M5 batch queued24.5 GB37k8bit is the highest practical quality here. 26.0 tok/s estimated from nearby benchmark coverage, with Ollama wrapper on llama.cpp as the best runtime hint. 24.5 GB headroom leaves workable context margin.
21Qwen3-Coder-Next3B active / 80B total1928bit 74.3 tok/s Fastest evidence path: 8bit · 74.3 tok/s · MLX · Community rowMLXCommunity rowFirst-party M5 batch queued52.2 GB262kRecent model release in the current catalog. 8bit is the highest practical quality here. 74.3 tok/s benchmark-backed on MLX backend. 52.2 GB headroom leaves workable context margin.
22GLM-4.5-Air12B active / 106B total1908bit 18.0 tok/s Fastest evidence path: 8bit · 18.0 tok/s · LM Studio · EstimatedLM StudioEstimatedFirst-party M5 batch queued27.3 GB55k8bit is the highest practical quality here. 18.0 tok/s estimated from nearby benchmark coverage, with LM Studio wrapper on mixed as the best runtime hint. 27.3 GB headroom leaves workable context margin.
23Gemma 4 E2B5.1B parameters1708bit 158.0 tok/s Fastest evidence path: 8bit · 158.0 tok/s · MLX · EstimatedMLXEstimatedFirst-party M5 batch queued122.5 GB131kRecent model release in the current catalog. 8bit is the highest practical quality here. 158.0 tok/s estimated from nearby benchmark coverage, with MLX backend as the best runtime hint. 122.5 GB headroom leaves workable context margin.
24Qwen3.5-4B4B parameters1678bit 148.0 tok/s Fastest evidence path: 8bit · 148.0 tok/s · MLX · EstimatedMLXEstimatedFirst-party M5 batch queued122.8 GB262kRecent model release in the current catalog. 8bit is the highest practical quality here. 148.0 tok/s estimated from nearby benchmark coverage, with MLX backend as the best runtime hint. 122.8 GB headroom leaves workable context margin.
25gpt-oss 120B5.1B active / 117B total164Q6_K 7.0 tok/s Fastest evidence path: Q6_K · 7.0 tok/s · Ollama · EstimatedOllamaEstimatedFirst-party M5 batch queued37.6 GB131kQ6_K is the highest practical quality here. 7.0 tok/s estimated from nearby benchmark coverage, with Ollama wrapper on llama.cpp as the best runtime hint. 37.6 GB headroom leaves workable context margin.

Why this Mac is featured

The featured Mac is the MacBook Pro M5 Max 128GB 16-inch, the current first-party frontier inference environment. Use the ranking above to switch to any other Mac.

Featured Mac: MacBook Pro M5 Max 128GB 16-inch

Qwen3.6-27B is the current agent-capability answer on the featured Mac. The table keeps fit, quantization, runtime, speed, and evidence status together so field signals do not look like first-party truth.

Runtime and quantization drilldown

MacBook Pro M5 Max 128GB 16-inch currently has 33 direct benchmark rows and 45 matching field-speed claims. Use the runtime selector and Bench before treating any speed winner as settled.

Next featured environment

The default is a pointer, not a fork. Mac Studio M4 Ultra 256GB is planned for June 2026; the same Mac-first structure moves there only after arrival validation and clean first-party evidence, while the current featured record remains available as a normal machine page and Bench audit target.

If the real decision is local Mac versus rented GPU economics, compare the hardware path in Worth with AI Datacenter Index.

Frequently asked questions

Why does Silicon Score default to MacBook Pro M5 Max 128GB 16-inch right now?
MacBook Pro M5 Max 128GB 16-inch is the local featured machine in hand for 2026 Apple Silicon inference. That makes it the best default lab question for now: what is the strongest model that fits, which runtime is fastest, and how much evidence backs the answer?
How should I read field signals versus benchmark rows?
Field signals are directional reports from Apple Silicon practitioners and competitor surfaces. They stay labeled until Silicon Score reproduces the setup with clean local recordings, methodology, and provenance suitable for Bench.
How much unified memory do you need for local LLMs on a Mac?
Unified memory determines which quantization tier fits cleanly, how much context you can keep live, and whether a recommendation stays practical after launch. In practice, 16GB can be useful for compact models, 24GB to 48GB opens more working room, and 64GB-plus tiers are where larger frontier setups become more credible. Use Fit to verify the exact model and headroom instead of relying on RAM labels alone.
What does tok/s mean and how much do I need?
Tokens per second measures how fast a model generates text. Around 8 to 15 tok/s usually feels interactive, while heavier coding, agent, or batch workflows benefit from more. Silicon Score ranks Macs with benchmark-backed or explicitly labeled estimated speed so you can trade responsiveness against model quality with the evidence visible.
Is local LLM inference on a Mac cheaper than using an API?
If you run daily, want predictable privacy, or repeatedly use the same model class, local hardware can make sense. If your usage is spiky or you expect to burst into much larger models, compare the Mac path against rented GPU economics on AI Datacenter Index before you over-buy hardware.
What quantization format should I use on a Mac?
GGUF remains a common path for llama.cpp and Ollama on Macs, but there is no single best quantization for every machine. The practical answer depends on memory ceiling, runtime support, and how much quality loss you can tolerate. Start from the quantization and runtime surfaced in Rankings and Bench instead of assuming one preset wins everywhere.