Why does Silicon Score default to MacBook Pro M5 Max 128GB 16-inch right now?

MacBook Pro M5 Max 128GB 16-inch is the local featured machine in hand for 2026 Apple Silicon inference. That makes it the best default lab question for now: what is the strongest model that fits, which runtime is fastest, and how much evidence backs the answer?

How should I read field signals versus benchmark rows?

Field signals are directional reports from Apple Silicon practitioners and competitor surfaces. They stay labeled until Silicon Score reproduces the setup with clean local recordings, methodology, and provenance suitable for Bench.

How much unified memory do you need for local LLMs on a Mac?

Unified memory determines which quantization tier fits cleanly, how much context you can keep live, and whether a recommendation stays practical after launch. In practice, 16GB can be useful for compact models, 24GB to 48GB opens more working room, and 64GB-plus tiers are where larger frontier setups become more credible. Use Fit to verify the exact model and headroom instead of relying on RAM labels alone.

What does tok/s mean and how much do I need?

Tokens per second measures how fast a model generates text. Around 8 to 15 tok/s usually feels interactive, while heavier coding, agent, or batch workflows benefit from more. Silicon Score ranks Macs with benchmark-backed or explicitly labeled estimated speed so you can trade responsiveness against model quality with the evidence visible.

Is local LLM inference on a Mac cheaper than using an API?

If you run daily, want predictable privacy, or repeatedly use the same model class, local hardware can make sense. If your usage is spiky or you expect to burst into much larger models, compare the Mac path against rented GPU economics on AI Datacenter Index before you over-buy hardware.

What quantization format should I use on a Mac?

GGUF remains a common path for llama.cpp and Ollama on Macs, but there is no single best quantization for every machine. The practical answer depends on memory ceiling, runtime support, and how much quality loss you can tolerate. Start from the quantization and runtime surfaced in Rankings and Bench instead of assuming one preset wins everywhere.

Best local LLMs for this Mac

25 ranked modelsNewest listed release April 22, 2026Latest benchmark source April 27, 2026

MacBook Pro M5 Max 128GB 16-inch, ranked for coding with most capable favored and using the strongest available runtime evidence for each model. Older baseline models are hidden.

Mac

Capability

Bias

Runtime

25 older models hiddenShow older models

Evidence notes for recent modelsSee sources, gaps, and supporting ranking details.

Use the strongest current runtime evidence for each row.Largest fit: MiniMax M2.7 at 3bit (229B parameters)Fastest read: Gemma 4 E2B at 158.0 tok/s on MLXEvidence is still sparse for Gemma 4 31B, Qwen3.6-27B, Qwen3.6-35B-A3B +1; those rows remain provisional.

Gemma 4 31B

released 2026-04-02 · 5 official specs captured · 4 benchmark rows · 6 Apple Silicon field sources · first-party measurement queued

Best field report: 28.0 tok/s. The ranking stays provisional until a first-party measurement is available.

Qwen3.6-27B

released 2026-04-22 · 5 official specs captured · 1 benchmark row · 9 Apple Silicon field sources · first-party measurement queued

Best field report: 85.5 tok/s. The ranking stays provisional until a first-party measurement is available.

Qwen3.6-35B-A3B

released 2026-04-15 · 5 official specs captured · 4 benchmark rows · 15 Apple Silicon field sources · first-party measurement queued

Best field report: 203.1 tok/s. The ranking stays provisional until a first-party measurement is available.

Mistral Small 4 119B

released 2026-03-16 · 6 official specs captured · 3 benchmark rows · 1 Apple Silicon field source · first-party measurement queued

Best field report: 44.0 tok/s. The ranking stays provisional until a first-party measurement is available.

Rank	Model	Score	Quant	Tok/s	Runtime	Evidence	Headroom	Context	Why it ranks here
1	Gemma 4 31B30.7B parameters	290	8bit	26.0 tok/s Fastest evidence path: 8bit · 26.0 tok/s · MLX · Estimated	MLX	EstimatedFirst-party M5 batch queued	91.4 GB	87k	Recent frontier candidate in the current catalog. 8bit is the highest practical quality here. 26.0 tok/s estimated from nearby benchmark coverage, with MLX backend as the best runtime hint. 91.4 GB headroom leaves workable context margin.
2	Qwen3.5-27B27B parameters	283	8bit	31.6 tok/s Fastest evidence path: 8bit · 31.6 tok/s · MLX · Estimated	MLX	EstimatedFirst-party M5 batch queued	100.4 GB	262k	Recent frontier candidate in the current catalog. 8bit is the highest practical quality here. 31.6 tok/s estimated from nearby benchmark coverage, with MLX backend as the best runtime hint. 100.4 GB headroom leaves workable context margin.
3	Qwen3.6-27B27B parameters	280	8bit	16.6 tok/s Fastest evidence path: 8bit · 16.6 tok/s · Ollama · Estimated	Ollama	EstimatedFirst-party M5 batch queued	100.4 GB	262k	Recent frontier candidate in the current catalog. 8bit is the highest practical quality here. 16.6 tok/s estimated from nearby benchmark coverage, with Ollama wrapper on llama.cpp as the best runtime hint. 100.4 GB headroom leaves workable context margin.
4	Devstral Small 2 24B24B parameters	274	8bit	23.4 tok/s Fastest evidence path: 8bit · 23.4 tok/s · llama.cpp · Estimated	llama.cpp	EstimatedFirst-party M5 batch queued	103.9 GB	262k	Recent frontier candidate in the current catalog. 8bit is the highest practical quality here. 23.4 tok/s estimated from nearby benchmark coverage, with llama.cpp backend as the best runtime hint. 103.9 GB headroom leaves workable context margin.
5	Qwen3.6-35B-A3B3B active / 35B total	252	8bit	55.0 tok/s Fastest evidence path: 8bit · 55.0 tok/s · MLX · Estimated	MLX	EstimatedFirst-party M5 batch queued	94.3 GB	262k	Recent frontier candidate in the current catalog. 8bit is the highest practical quality here. 55.0 tok/s estimated from nearby benchmark coverage, with MLX backend as the best runtime hint. 94.3 GB headroom leaves workable context margin.
6	Mistral Small 4 119B6.5B active / 119B total	193	Q6_K	42.0 tok/s Fastest evidence path: Q6_K · 42.0 tok/s · Ollama · Estimated	Ollama	EstimatedFirst-party M5 batch queued	32.1 GB	32k	Recent frontier candidate in the current catalog. Q6_K is the highest practical quality here. 42.0 tok/s estimated from nearby benchmark coverage, with Ollama wrapper on llama.cpp as the best runtime hint. 32.1 GB headroom leaves workable context margin.
7	MiniMax M2.7229B parameters	422	3bit Source-backed MLX MiniMax-M2.7-3bit - 112 GB min	40.4 tok/s Fastest evidence path: 3bit · 40.4 tok/s · Best available · Field signal	Best available	Field signalFirst-party M5 batch queued	16.0 GB	116k	Recent model release in the current catalog. 3bit is the highest practical quality here. 40.4 tok/s only from Apple Silicon field signals. 16.0 GB headroom leaves workable context margin.
8	Gemma 4 26B-A4B3.8B active / 25.2B total	252	8bit	50.0 tok/s Fastest evidence path: 8bit · 50.0 tok/s · MLX · Estimated	MLX	EstimatedFirst-party M5 batch queued	102.2 GB	262k	Recent model release in the current catalog. 8bit is the highest practical quality here. 50.0 tok/s estimated from nearby benchmark coverage, with MLX backend as the best runtime hint. 102.2 GB headroom leaves workable context margin.
9	Nemotron Cascade 2 30B-A3B3B active / 30B total	250	8bit	28.0 tok/s Fastest evidence path: 8bit · 28.0 tok/s · Ollama · Estimated	Ollama	EstimatedFirst-party M5 batch queued	99.2 GB	1000k	Recent model release in the current catalog. 8bit is the highest practical quality here. 28.0 tok/s estimated from nearby benchmark coverage, with Ollama wrapper on llama.cpp as the best runtime hint. 99.2 GB headroom leaves workable context margin.
10	Qwen3.5-35B-A3B3B active / 35B total	249	8bit	48.0 tok/s Fastest evidence path: 8bit · 48.0 tok/s · Ollama · Estimated	Ollama	EstimatedFirst-party M5 batch queued	94.3 GB	262k	Recent model release in the current catalog. 8bit is the highest practical quality here. 48.0 tok/s estimated from nearby benchmark coverage, with Ollama wrapper on llama.cpp as the best runtime hint. 94.3 GB headroom leaves workable context margin.
11	GLM-4.7-Flash3B active / 30B total	247	8bit	58.0 tok/s Fastest evidence path: 8bit · 58.0 tok/s · llama.cpp · Estimated	llama.cpp	EstimatedFirst-party M5 batch queued	92.2 GB	90k	Recent model release in the current catalog. 8bit is the highest practical quality here. 58.0 tok/s estimated from nearby benchmark coverage, with llama.cpp backend as the best runtime hint. 92.2 GB headroom leaves workable context margin.
12	Nemotron-3-Nano-30B-A3B3.5B active / 30B total	246	8bit	43.7 tok/s Fastest evidence path: 8bit · 43.7 tok/s · llama.cpp · Estimated	llama.cpp	EstimatedFirst-party M5 batch queued	99.2 GB	1000k	Recent model release in the current catalog. 8bit is the highest practical quality here. 43.7 tok/s estimated from nearby benchmark coverage, with llama.cpp backend as the best runtime hint. 99.2 GB headroom leaves workable context margin.
13	Magistral Small24B parameters	242	8bit	Measure it	Best available	Fit-firstFirst-party M5 batch queued	103.9 GB	41k	8bit is the highest practical quality here. Speed still needs direct speed coverage. 103.9 GB headroom leaves workable context margin.
14	Ministral 3 14B14B parameters	232	8bit	40.0 tok/s Fastest evidence path: 8bit · 40.0 tok/s · Ollama · Estimated	Ollama	EstimatedFirst-party M5 batch queued	113.2 GB	262k	Recent model release in the current catalog. 8bit is the highest practical quality here. 40.0 tok/s estimated from nearby benchmark coverage, with Ollama wrapper on llama.cpp as the best runtime hint. 113.2 GB headroom leaves workable context margin.
15	Gemma 4 E4B8B parameters	230	8bit	128.0 tok/s Fastest evidence path: 8bit · 128.0 tok/s · MLX · Estimated	MLX	EstimatedFirst-party M5 batch queued	119.4 GB	131k	Recent model release in the current catalog. 8bit is the highest practical quality here. 128.0 tok/s estimated from nearby benchmark coverage, with MLX backend as the best runtime hint. 119.4 GB headroom leaves workable context margin.
16	Qwen3.5-9B9B parameters	230	8bit	78.0 tok/s Fastest evidence path: 8bit · 78.0 tok/s · MLX · Estimated	MLX	EstimatedFirst-party M5 batch queued	118.1 GB	262k	Recent model release in the current catalog. 8bit is the highest practical quality here. 78.0 tok/s estimated from nearby benchmark coverage, with MLX backend as the best runtime hint. 118.1 GB headroom leaves workable context margin.
17	Ministral 3 8B8B parameters	223	8bit	72.0 tok/s Fastest evidence path: 8bit · 72.0 tok/s · Ollama · Estimated	Ollama	EstimatedFirst-party M5 batch queued	119.0 GB	262k	Recent model release in the current catalog. 8bit is the highest practical quality here. 72.0 tok/s estimated from nearby benchmark coverage, with Ollama wrapper on llama.cpp as the best runtime hint. 119.0 GB headroom leaves workable context margin.
18	gpt-oss 20B3.6B active / 21B total	217	8bit	Measure it	MLX	Fit-firstFirst-party M5 batch queued	107.6 GB	131k	8bit is the highest practical quality here. Speed still needs direct speed coverage. 107.6 GB headroom leaves workable context margin.
19	Qwen3.5-122B-A10B10B active / 122B total	197	Q6_K	60.6 tok/s Fastest evidence path: Q6_K · 60.6 tok/s · MLX · Estimated	MLX	EstimatedBitter Mill import queued	33.6 GB	165k	Recent model release in the current catalog. Q6_K is the highest practical quality here. 60.6 tok/s estimated from nearby benchmark coverage, with MLX backend as the best runtime hint. 33.6 GB headroom leaves workable context margin.
20	Llama 4 Scout 17B-16E17B active / 109B total	196	8bit	26.0 tok/s Fastest evidence path: 8bit · 26.0 tok/s · Ollama · Estimated	Ollama	EstimatedFirst-party M5 batch queued	24.5 GB	37k	8bit is the highest practical quality here. 26.0 tok/s estimated from nearby benchmark coverage, with Ollama wrapper on llama.cpp as the best runtime hint. 24.5 GB headroom leaves workable context margin.
21	Qwen3-Coder-Next3B active / 80B total	192	8bit	74.3 tok/s Fastest evidence path: 8bit · 74.3 tok/s · MLX · Community row	MLX	Community rowFirst-party M5 batch queued	52.2 GB	262k	Recent model release in the current catalog. 8bit is the highest practical quality here. 74.3 tok/s benchmark-backed on MLX backend. 52.2 GB headroom leaves workable context margin.
22	GLM-4.5-Air12B active / 106B total	190	8bit	18.0 tok/s Fastest evidence path: 8bit · 18.0 tok/s · LM Studio · Estimated	LM Studio	EstimatedFirst-party M5 batch queued	27.3 GB	55k	8bit is the highest practical quality here. 18.0 tok/s estimated from nearby benchmark coverage, with LM Studio wrapper on mixed as the best runtime hint. 27.3 GB headroom leaves workable context margin.
23	Gemma 4 E2B5.1B parameters	170	8bit	158.0 tok/s Fastest evidence path: 8bit · 158.0 tok/s · MLX · Estimated	MLX	EstimatedFirst-party M5 batch queued	122.5 GB	131k	Recent model release in the current catalog. 8bit is the highest practical quality here. 158.0 tok/s estimated from nearby benchmark coverage, with MLX backend as the best runtime hint. 122.5 GB headroom leaves workable context margin.
24	Qwen3.5-4B4B parameters	167	8bit	148.0 tok/s Fastest evidence path: 8bit · 148.0 tok/s · MLX · Estimated	MLX	EstimatedFirst-party M5 batch queued	122.8 GB	262k	Recent model release in the current catalog. 8bit is the highest practical quality here. 148.0 tok/s estimated from nearby benchmark coverage, with MLX backend as the best runtime hint. 122.8 GB headroom leaves workable context margin.
25	gpt-oss 120B5.1B active / 117B total	164	Q6_K	7.0 tok/s Fastest evidence path: Q6_K · 7.0 tok/s · Ollama · Estimated	Ollama	EstimatedFirst-party M5 batch queued	37.6 GB	131k	Q6_K is the highest practical quality here. 7.0 tok/s estimated from nearby benchmark coverage, with Ollama wrapper on llama.cpp as the best runtime hint. 37.6 GB headroom leaves workable context margin.

Using Silicon Score

From ranking to decision

Rankings starts with the MacBook Pro M5 Max 128GB 16-inch, the Mac used for current first-party testing. You can switch to any listed Mac above.

Start with MacBook Pro M5 Max 128GB 16-inch

Qwen3.6-27B is the top-ranked answer for agent work in this catalog. The table keeps fit, speed, runtime, and evidence status together so an estimate never looks measured.

Open featured rankings Choose another Mac

Check how strong the evidence is

MacBook Pro M5 Max 128GB 16-inch has 33 direct benchmark rows and 45 matching field reports in the current data. Measured, sourced, and estimated results stay labeled.

Open Bench Check fit

Use the Mac you own—or the one you may buy

Already have a Mac? Select it above to see what fits. Still shopping? Browse the machine pages and compare headroom, speed, and evidence before paying for memory you may not need.

Browse machines Compare Macs

If the real decision is local Mac versus rented GPU economics, compare the hardware path in Worth with AI Datacenter Index.

Frequently asked questions

Why does Silicon Score default to MacBook Pro M5 Max 128GB 16-inch right now?: MacBook Pro M5 Max 128GB 16-inch is the local featured machine in hand for 2026 Apple Silicon inference. That makes it the best default lab question for now: what is the strongest model that fits, which runtime is fastest, and how much evidence backs the answer?
How should I read field signals versus benchmark rows?: Field signals are directional reports from Apple Silicon practitioners and competitor surfaces. They stay labeled until Silicon Score reproduces the setup with clean local recordings, methodology, and provenance suitable for Bench.
How much unified memory do you need for local LLMs on a Mac?: Unified memory determines which quantization tier fits cleanly, how much context you can keep live, and whether a recommendation stays practical after launch. In practice, 16GB can be useful for compact models, 24GB to 48GB opens more working room, and 64GB-plus tiers are where larger frontier setups become more credible. Use Fit to verify the exact model and headroom instead of relying on RAM labels alone.
What does tok/s mean and how much do I need?: Tokens per second measures how fast a model generates text. Around 8 to 15 tok/s usually feels interactive, while heavier coding, agent, or batch workflows benefit from more. Silicon Score ranks Macs with benchmark-backed or explicitly labeled estimated speed so you can trade responsiveness against model quality with the evidence visible.
Is local LLM inference on a Mac cheaper than using an API?: If you run daily, want predictable privacy, or repeatedly use the same model class, local hardware can make sense. If your usage is spiky or you expect to burst into much larger models, compare the Mac path against rented GPU economics on AI Datacenter Index before you over-buy hardware.
What quantization format should I use on a Mac?: GGUF remains a common path for llama.cpp and Ollama on Macs, but there is no single best quantization for every machine. The practical answer depends on memory ceiling, runtime support, and how much quality loss you can tolerate. Start from the quantization and runtime surfaced in Rankings and Bench instead of assuming one preset wins everywhere.