Running 32B LLMs on Apple Silicon

Dense 32B models and MoE 30B models are the most popular "serious quality" tier for local inference in 2025–2026. Here is what you need to run them, and how fast they go.

~20 GBQwen 3 32B Q4_K_M weight footprint

22 tok/sQwen 3 32B Q4 on M4 Max 64 GB (factory lab)

36 GBMinimum unified memory to run 32B Q4 comfortably

92 tok/sQwen 3 30B A3B (MoE) Q4 on M4 Max 64 GB

Dense 32B vs MoE 30B: two very different beasts

The 27B–32B parameter tier now has two architectures with radically different performance profiles:

Dense 32B (Qwen 3 32B, Gemma 3 27B)

All 32B parameters active per token
Q4_K_M weight file: ~20 GB RAM
Speed is memory-bandwidth limited: ~22 tok/s on M4 Max 64 GB
Higher quality on reasoning tasks — all weights contribute to every token
Same RAM/bandwidth dynamics as any dense model: more RAM = more quant options

MoE 30B-A3B (Qwen 3 30B A3B)

Only 3B parameters active per token (3B active / 30B total)
Q4 weight file: ~18 GB RAM (full 30B structure stored, only 3B read per token)
Speed feels like a 3B model: 92 tok/s on M4 Max 64 GB — 4× faster than dense 32B
Quality is surprisingly close to dense 32B on many tasks
Ideal for chat and Q&A where speed matters; dense wins on complex reasoning

Key insight: Qwen 3 30B A3B (MoE) runs at 8B speeds with near-32B quality.

If you have 24 GB+ RAM and haven't tried Qwen 3 30B A3B yet, it should be your first 30B-class model. At Q4, it fits in 18 GB, runs at 92 tok/s on M4 Max (vs 22 tok/s for dense 32B), and benchmarks show it's competitive with Qwen 3 32B on instruction-following and general Q&A. For deep reasoning, the dense 32B has an edge.

Measured benchmark data — 27B–32B models

All published results for this model tier. Sorted by tok/s descending.

Chip	Model	Quant	RAM req.	Avg tok/s	Source
M4 Max (40-core GPU, 64 GB)	Qwen 3 30B A3B	Q4	~18 GB	92.1 tok/s	ref
M4 Max (40-core GPU, 64 GB)	Qwen 3 30B A3B	Q5	~22 GB	84.9 tok/s	ref
M4 Max (40-core GPU, 64 GB)	Qwen 3 30B A3B	Q6	~26 GB	76.7 tok/s	ref
M4 Max (128 GB)	Qwen 3 30B A3B	Q4_K_M	~18 GB	70.2 tok/s	ref
M4 Max (40-core GPU, 64 GB)	Qwen 3 30B A3B	Q8	~34 GB	52.6 tok/s	ref
M4 Max (40-core GPU, 64 GB)	Qwen 3 32B	Q4_K_M	~20 GB	22.0 tok/s	factory lab
M4 Max (128 GB)	Gemma 3 27B	Q8_0	~29 GB	14.5 tok/s	ref
M4 Max (32-core GPU)	Qwen 3 32B	iQ2_K_S	~11 GB	13.2 tok/s	ref
M4 Max (128 GB)	Qwen 3 32B	Q4_K_M	~20 GB	11.7 tok/s	ref

Source: benchmarks.json. Factory lab data for Qwen 3 32B Q4_K_M on M4 Max 40-core 64 GB. All other rows are reference runs from community sources.

RAM requirements for 32B-class models

Model	Quant	RAM needed	Fits in	Notes
Qwen 3 30B A3B (MoE)	Q4	~18 GB	24 GB+	Runs at 8B speeds — best quality-per-RAM in this tier
Qwen 3 32B (dense)	iQ2_K_S	~11 GB	16 GB+	Very compressed; quality lower than Q4
Gemma 3 27B (dense)	Q4_K_M	~17 GB	24 GB+	Fits in 24 GB with room for context
Qwen 3 32B (dense)	Q4_K_M	~20 GB	24–36 GB	24 GB is tight; 36 GB is comfortable
Qwen 3 30B A3B (MoE)	Q8	~34 GB	36 GB+	High quality MoE — needs 36 GB minimum
Qwen 3 32B (dense)	Q5_K_M	~24 GB	36 GB+	Best quality/RAM balance for dense 32B
Gemma 3 27B (dense)	Q8_0	~29 GB	36 GB+	Near full-precision quality
Qwen 3 32B (dense)	Q8_0	~34 GB	36–48 GB	Near lossless; fits in 36 GB with care

RAM values are approximate and will vary by tool and KV cache settings. Add ~2–4 GB for OS/runtime overhead when planning your config.

Which chips can run 32B models?

16–24 GB (tight or no)

M4 (16 GB) — can load iQ2 Qwen 3 32B at ~13 tok/s, but quality is degraded
M4 Pro (24 GB) — Qwen 3 32B Q4 (~20 GB) technically fits, context is limited
M5 (16 GB) — same as M4 16 GB tier; iQ2 only
M5 (32 GB) — Qwen 3 30B A3B Q4 fits well (~18 GB); dense 32B Q4 is tight
Best model for this tier: Qwen 3 30B A3B Q4 — 18 GB, 92 tok/s on M4 Max

36 GB (sweet spot)

M5 Max (32-core, 36 GB) — new fastest in this tier; 61.6 tok/s on 8B — full benchmarks →
M3 Max (30-core, 36 GB) / M4 Max (32-core, 36 GB)
Qwen 3 32B Q4_K_M (~20 GB) fits with headroom
Qwen 3 30B A3B Q8 (~34 GB) fits with care
Gemma 3 27B Q5_K_M (~22 GB) fits comfortably
M4 Max 36 GB: 22 tok/s on Qwen 3 32B Q4 (comparable to 64 GB)

48–64 GB (comfortable)

M4 Max 48/64 GB: Qwen 3 32B Q5 (~24 GB) with full context headroom
M4 Max 64 GB: 22 tok/s on Qwen 3 32B Q4 (factory lab verified)
Can run dense 32B Q8 (~34 GB) at 48 GB
Ideal for daily 32B use — weight + KV cache + OS all fit easily

96–128 GB+ (future-proof)

Can run Qwen 3 32B at Q8_0 (~34 GB) alongside large context windows
Headroom for 70B Q4 as well — covers the full consumer LLM range
M3 Ultra and M4 Max 128 GB options in this tier
Overkill for 32B alone, but right for users running multiple models

Qwen 3 32B vs Qwen 3 30B A3B — which should you run?

Choose Qwen 3 32B (dense) when:

You're doing complex reasoning, math, or multi-step coding
Output consistency matters more than speed
You have 36 GB+ RAM and can spare it
You want the most capable 32B model available on Metal
You're using extended thinking mode (requires more sustained compute)

Choose Qwen 3 30B A3B (MoE) when:

You want 90+ tok/s in this quality tier — 4× faster than dense 32B
You have 24 GB RAM — it fits where dense 32B struggles
You're doing chat, Q&A, summarization, or creative writing
You want the best model that runs on an M4 Pro 24 GB
Speed-to-quality tradeoff: MoE wins on this metric at 24 GB configs

Related guides & pages

Minimum RAM for 70B models →Quantization guide →Buying guide →Qwen 3 32B benchmarks →Qwen 3 30B A3B benchmarks →M4 Max vs M4 Pro →M5 Max vs M3 Max →M4 Max vs M3 Max →M4 Max 64 GB benchmarks →

Data

benchmarks.json — full dataset · models.json — model summaries · benchmarks.csv — CSV export

See all benchmarks →