Running 32B LLMs on Apple Silicon
Dense 32B models and MoE 30B models are the most popular "serious quality" tier for local inference in 2025–2026. Here is what you need to run them, and how fast they go.
Dense 32B vs MoE 30B: two very different beasts
The 27B–32B parameter tier now has two architectures with radically different performance profiles:
Dense 32B (Qwen 3 32B, Gemma 3 27B)
- All 32B parameters active per token
- Q4_K_M weight file: ~20 GB RAM
- Speed is memory-bandwidth limited: ~22 tok/s on M4 Max 64 GB
- Higher quality on reasoning tasks — all weights contribute to every token
- Same RAM/bandwidth dynamics as any dense model: more RAM = more quant options
MoE 30B-A3B (Qwen 3 30B A3B)
- Only 3B parameters active per token (3B active / 30B total)
- Q4 weight file: ~18 GB RAM (full 30B structure stored, only 3B read per token)
- Speed feels like a 3B model: 92 tok/s on M4 Max 64 GB — 4× faster than dense 32B
- Quality is surprisingly close to dense 32B on many tasks
- Ideal for chat and Q&A where speed matters; dense wins on complex reasoning
If you have 24 GB+ RAM and haven't tried Qwen 3 30B A3B yet, it should be your first 30B-class model. At Q4, it fits in 18 GB, runs at 92 tok/s on M4 Max (vs 22 tok/s for dense 32B), and benchmarks show it's competitive with Qwen 3 32B on instruction-following and general Q&A. For deep reasoning, the dense 32B has an edge.
Measured benchmark data — 27B–32B models
All published results for this model tier. Sorted by tok/s descending.
| Chip | Model | Quant | Avg tok/s | Source |
|---|---|---|---|---|
| M4 Max (40-core GPU, 64 GB) | Qwen 3 30B A3B | Q4 | 92.1 tok/s | ref |
| M4 Max (40-core GPU, 64 GB) | Qwen 3 30B A3B | Q5 | 84.9 tok/s | ref |
| M4 Max (40-core GPU, 64 GB) | Qwen 3 30B A3B | Q6 | 76.7 tok/s | ref |
| M4 Max (128 GB) | Qwen 3 30B A3B | Q4_K_M | 70.2 tok/s | ref |
| M4 Max (40-core GPU, 64 GB) | Qwen 3 30B A3B | Q8 | 52.6 tok/s | ref |
| M4 Max (40-core GPU, 64 GB) | Qwen 3 32B | Q4_K_M | 22.0 tok/s | factory lab |
| M4 Max (128 GB) | Gemma 3 27B | Q8_0 | 14.5 tok/s | ref |
| M4 Max (32-core GPU) | Qwen 3 32B | iQ2_K_S | 13.2 tok/s | ref |
| M4 Max (128 GB) | Qwen 3 32B | Q4_K_M | 11.7 tok/s | ref |
Source: benchmarks.json. Factory lab data for Qwen 3 32B Q4_K_M on M4 Max 40-core 64 GB. All other rows are reference runs from community sources.
RAM requirements for 32B-class models
| Model | Quant | RAM needed | Fits in | Notes |
|---|---|---|---|---|
| Qwen 3 30B A3B (MoE) | Q4 | ~18 GB | 24 GB+ | Runs at 8B speeds — best quality-per-RAM in this tier |
| Qwen 3 32B (dense) | iQ2_K_S | ~11 GB | 16 GB+ | Very compressed; quality lower than Q4 |
| Gemma 3 27B (dense) | Q4_K_M | ~17 GB | 24 GB+ | Fits in 24 GB with room for context |
| Qwen 3 32B (dense) | Q4_K_M | ~20 GB | 24–36 GB | 24 GB is tight; 36 GB is comfortable |
| Qwen 3 30B A3B (MoE) | Q8 | ~34 GB | 36 GB+ | High quality MoE — needs 36 GB minimum |
| Qwen 3 32B (dense) | Q5_K_M | ~24 GB | 36 GB+ | Best quality/RAM balance for dense 32B |
| Gemma 3 27B (dense) | Q8_0 | ~29 GB | 36 GB+ | Near full-precision quality |
| Qwen 3 32B (dense) | Q8_0 | ~34 GB | 36–48 GB | Near lossless; fits in 36 GB with care |
RAM values are approximate and will vary by tool and KV cache settings. Add ~2–4 GB for OS/runtime overhead when planning your config.
Which chips can run 32B models?
16–24 GB (tight or no)
- M4 (16 GB) — can load iQ2 Qwen 3 32B at ~13 tok/s, but quality is degraded
- M4 Pro (24 GB) — Qwen 3 32B Q4 (~20 GB) technically fits, context is limited
- M5 (16 GB) — same as M4 16 GB tier; iQ2 only
- M5 (32 GB) — Qwen 3 30B A3B Q4 fits well (~18 GB); dense 32B Q4 is tight
- Best model for this tier: Qwen 3 30B A3B Q4 — 18 GB, 92 tok/s on M4 Max
36 GB (sweet spot)
- M5 Max (32-core, 36 GB) — new fastest in this tier; 61.6 tok/s on 8B — full benchmarks →
- M3 Max (30-core, 36 GB) / M4 Max (32-core, 36 GB)
- Qwen 3 32B Q4_K_M (~20 GB) fits with headroom
- Qwen 3 30B A3B Q8 (~34 GB) fits with care
- Gemma 3 27B Q5_K_M (~22 GB) fits comfortably
- M4 Max 36 GB: 22 tok/s on Qwen 3 32B Q4 (comparable to 64 GB)
48–64 GB (comfortable)
- M4 Max 48/64 GB: Qwen 3 32B Q5 (~24 GB) with full context headroom
- M4 Max 64 GB: 22 tok/s on Qwen 3 32B Q4 (factory lab verified)
- Can run dense 32B Q8 (~34 GB) at 48 GB
- Ideal for daily 32B use — weight + KV cache + OS all fit easily
96–128 GB+ (future-proof)
- Can run Qwen 3 32B at Q8_0 (~34 GB) alongside large context windows
- Headroom for 70B Q4 as well — covers the full consumer LLM range
- M3 Ultra and M4 Max 128 GB options in this tier
- Overkill for 32B alone, but right for users running multiple models
Qwen 3 32B vs Qwen 3 30B A3B — which should you run?
Choose Qwen 3 32B (dense) when:
- You're doing complex reasoning, math, or multi-step coding
- Output consistency matters more than speed
- You have 36 GB+ RAM and can spare it
- You want the most capable 32B model available on Metal
- You're using extended thinking mode (requires more sustained compute)
Choose Qwen 3 30B A3B (MoE) when:
- You want 90+ tok/s in this quality tier — 4× faster than dense 32B
- You have 24 GB RAM — it fits where dense 32B struggles
- You're doing chat, Q&A, summarization, or creative writing
- You want the best model that runs on an M4 Pro 24 GB
- Speed-to-quality tradeoff: MoE wins on this metric at 24 GB configs
Related guides & pages
Data
benchmarks.json — full dataset · models.json — model summaries · benchmarks.csv — CSV export