M5 MacBook — LLM Inference Benchmarks
How fast is the M5 chip family for local LLM inference? M5 base and M5 Max community reference data are both live. M5 Max (32-core, 36 GB) hits 61.6 tok/s on Llama 3.1 8B — matching M2 Ultra.
M5 vs M4 vs M3 — side-by-side (early data)
Early community reference benchmarks from LocalScore. All Q4_K - Medium quantization. These are from community submissions, not factory lab runs — treat as directional until verified.
| Model | M5 (32 GB) | M4 (32 GB) | M3 (24 GB) |
|---|---|---|---|
| Llama 3.2 1B Instruct | 98.4 tok/s | 75.6 tok/s | 64.7 tok/s |
| Llama 3.1 8B Instruct | 22.3 tok/s | 16.8 tok/s | 10.2 tok/s |
| Qwen 2.5 14B Instruct | 11.5 tok/s | 8.6 tok/s | 6.1 tok/s |
M5 source: LocalScore accelerator/2912. M4/M3 sources: LocalScore community aggregation. Factory lab data will replace or supplement these rows.
Visual comparison — Llama 3.1 8B Instruct
Bar represents tok/s relative to M5 (32 GB). All Q4_K - Medium.
What we know about M5 so far
Confirmed specs (M5 MacBook)
- 10-core CPU (4 performance + 6 efficiency cores)
- 10-core GPU (base configuration)
- Unified memory: 16 GB, 24 GB, or 32 GB
- Memory bandwidth: significantly higher than M4 base
- Neural Engine: 38 TOPS (claimed)
- Built on 3nm process (TSMC N3P)
LLM inference implications
- ~33% higher tok/s vs M4 base on Llama 3.1 8B (measured LocalScore data)
- 32 GB RAM tier enables 14B models at Q8 (13.4 GB) and some 32B at Q3
- Memory bandwidth is the primary driver for this gain
- Model fit is identical to M4 base at same RAM tier
- First-party 128 GB unit benchmarks to include 32B and potentially 70B
For M5 base MacBook: probably not — the ~30–65% throughput gain is real but the RAM ceiling stays at 32 GB, which limits you to the same model sizes as before. If your bottleneck is running larger models (32B+), M5 base doesn't change that. If you're running 8B–14B models and want faster responses, the upgrade is meaningful.
M5 Max is a different story. At 36 GB, M5 Max (32-core) hits 61.6 tok/s on Llama 3.1 8B — matching M2 Ultra and beating M3 Max 30-core by 64%. It's the highest-bandwidth option at 36 GB. See M5 Max vs M3 Max →
All M5 benchmark rows
Community reference rows from LocalScore. Factory lab rows (128 GB unit) coming soon. Sorted by avg tok/s.
| Chip | Model | Quant | Avg tok/s | Source |
|---|---|---|---|---|
| M5 Max (32-core GPU, 36 GB) | Llama 3.2 1B Instruct | Q4_K - Medium | 229.0 tok/s | ref |
| M5 (10-core GPU, 32 GB) | Llama 3.2 1B Instruct | Q4_K - Medium | 98.4 tok/s | ref |
| M5 (10-core GPU, 16 GB) | Llama 3.2 1B Instruct | Q4_K - Medium | 98.1 tok/s | ref |
| M5 Max (32-core GPU, 36 GB) | Llama 3.1 8B Instruct | Q4_K - Medium | 61.6 tok/s | ref |
| M5 Max (32-core GPU, 36 GB) | Qwen 2.5 14B Instruct | Q4_K - Medium | 34.3 tok/s | ref |
| M5 (10-core GPU, 32 GB) | Llama 3.1 8B Instruct | Q4_K - Medium | 22.3 tok/s | ref |
| M5 (10-core GPU, 32 GB) | Qwen 2.5 14B Instruct | Q4_K - Medium | 11.5 tok/s | ref |
Factory lab unit (128 GB) will add Qwen 3 32B, Llama 3.3 70B, and quantization ladder rows. Subscribe to be notified.
Related pages
Data
benchmarks.json — full dataset · chips.json — chip summaries · benchmarks.csv — CSV export
Rows marked "ref" are community-submitted reference benchmarks. Factory lab rows are marked "factory lab" and are first-party verified.