← All benchmarks

Running 32B LLMs on Apple Silicon

Dense 32B models and MoE 30B models are the most popular "serious quality" tier for local inference in 2025–2026. Here is what you need to run them, and how fast they go.

~20 GBQwen 3 32B Q4_K_M weight footprint
22 tok/sQwen 3 32B Q4 on M4 Max 64 GB (factory lab)
36 GBMinimum unified memory to run 32B Q4 comfortably
92 tok/sQwen 3 30B A3B (MoE) Q4 on M4 Max 64 GB

Dense 32B vs MoE 30B: two very different beasts

The 27B–32B parameter tier now has two architectures with radically different performance profiles:

Dense 32B (Qwen 3 32B, Gemma 3 27B)

  • All 32B parameters active per token
  • Q4_K_M weight file: ~20 GB RAM
  • Speed is memory-bandwidth limited: ~22 tok/s on M4 Max 64 GB
  • Higher quality on reasoning tasks — all weights contribute to every token
  • Same RAM/bandwidth dynamics as any dense model: more RAM = more quant options

MoE 30B-A3B (Qwen 3 30B A3B)

  • Only 3B parameters active per token (3B active / 30B total)
  • Q4 weight file: ~18 GB RAM (full 30B structure stored, only 3B read per token)
  • Speed feels like a 3B model: 92 tok/s on M4 Max 64 GB — 4× faster than dense 32B
  • Quality is surprisingly close to dense 32B on many tasks
  • Ideal for chat and Q&A where speed matters; dense wins on complex reasoning
Key insight: Qwen 3 30B A3B (MoE) runs at 8B speeds with near-32B quality.

If you have 24 GB+ RAM and haven't tried Qwen 3 30B A3B yet, it should be your first 30B-class model. At Q4, it fits in 18 GB, runs at 92 tok/s on M4 Max (vs 22 tok/s for dense 32B), and benchmarks show it's competitive with Qwen 3 32B on instruction-following and general Q&A. For deep reasoning, the dense 32B has an edge.

Measured benchmark data — 27B–32B models

All published results for this model tier. Sorted by tok/s descending.

ChipModelQuantRAM req.Avg tok/sSource
M4 Max (40-core GPU, 64 GB)Qwen 3 30B A3BQ4~18 GB92.1 tok/sref
M4 Max (40-core GPU, 64 GB)Qwen 3 30B A3BQ5~22 GB84.9 tok/sref
M4 Max (40-core GPU, 64 GB)Qwen 3 30B A3BQ6~26 GB76.7 tok/sref
M4 Max (128 GB)Qwen 3 30B A3BQ4_K_M~18 GB70.2 tok/sref
M4 Max (40-core GPU, 64 GB)Qwen 3 30B A3BQ8~34 GB52.6 tok/sref
M4 Max (40-core GPU, 64 GB)Qwen 3 32BQ4_K_M~20 GB22.0 tok/sfactory lab
M4 Max (128 GB)Gemma 3 27BQ8_0~29 GB14.5 tok/sref
M4 Max (32-core GPU)Qwen 3 32BiQ2_K_S~11 GB13.2 tok/sref
M4 Max (128 GB)Qwen 3 32BQ4_K_M~20 GB11.7 tok/sref

Source: benchmarks.json. Factory lab data for Qwen 3 32B Q4_K_M on M4 Max 40-core 64 GB. All other rows are reference runs from community sources.

RAM requirements for 32B-class models

ModelQuantRAM neededFits inNotes
Qwen 3 30B A3B (MoE)Q4~18 GB24 GB+Runs at 8B speeds — best quality-per-RAM in this tier
Qwen 3 32B (dense)iQ2_K_S~11 GB16 GB+Very compressed; quality lower than Q4
Gemma 3 27B (dense)Q4_K_M~17 GB24 GB+Fits in 24 GB with room for context
Qwen 3 32B (dense)Q4_K_M~20 GB24–36 GB24 GB is tight; 36 GB is comfortable
Qwen 3 30B A3B (MoE)Q8~34 GB36 GB+High quality MoE — needs 36 GB minimum
Qwen 3 32B (dense)Q5_K_M~24 GB36 GB+Best quality/RAM balance for dense 32B
Gemma 3 27B (dense)Q8_0~29 GB36 GB+Near full-precision quality
Qwen 3 32B (dense)Q8_0~34 GB36–48 GBNear lossless; fits in 36 GB with care

RAM values are approximate and will vary by tool and KV cache settings. Add ~2–4 GB for OS/runtime overhead when planning your config.

Which chips can run 32B models?

16–24 GB (tight or no)

  • M4 (16 GB) — can load iQ2 Qwen 3 32B at ~13 tok/s, but quality is degraded
  • M4 Pro (24 GB) — Qwen 3 32B Q4 (~20 GB) technically fits, context is limited
  • M5 (16 GB) — same as M4 16 GB tier; iQ2 only
  • M5 (32 GB) — Qwen 3 30B A3B Q4 fits well (~18 GB); dense 32B Q4 is tight
  • Best model for this tier: Qwen 3 30B A3B Q4 — 18 GB, 92 tok/s on M4 Max

36 GB (sweet spot)

  • M5 Max (32-core, 36 GB) — new fastest in this tier; 61.6 tok/s on 8B — full benchmarks →
  • M3 Max (30-core, 36 GB) / M4 Max (32-core, 36 GB)
  • Qwen 3 32B Q4_K_M (~20 GB) fits with headroom
  • Qwen 3 30B A3B Q8 (~34 GB) fits with care
  • Gemma 3 27B Q5_K_M (~22 GB) fits comfortably
  • M4 Max 36 GB: 22 tok/s on Qwen 3 32B Q4 (comparable to 64 GB)

48–64 GB (comfortable)

  • M4 Max 48/64 GB: Qwen 3 32B Q5 (~24 GB) with full context headroom
  • M4 Max 64 GB: 22 tok/s on Qwen 3 32B Q4 (factory lab verified)
  • Can run dense 32B Q8 (~34 GB) at 48 GB
  • Ideal for daily 32B use — weight + KV cache + OS all fit easily

96–128 GB+ (future-proof)

  • Can run Qwen 3 32B at Q8_0 (~34 GB) alongside large context windows
  • Headroom for 70B Q4 as well — covers the full consumer LLM range
  • M3 Ultra and M4 Max 128 GB options in this tier
  • Overkill for 32B alone, but right for users running multiple models

Qwen 3 32B vs Qwen 3 30B A3B — which should you run?

Choose Qwen 3 32B (dense) when:

  • You're doing complex reasoning, math, or multi-step coding
  • Output consistency matters more than speed
  • You have 36 GB+ RAM and can spare it
  • You want the most capable 32B model available on Metal
  • You're using extended thinking mode (requires more sustained compute)

Choose Qwen 3 30B A3B (MoE) when:

  • You want 90+ tok/s in this quality tier — 4× faster than dense 32B
  • You have 24 GB RAM — it fits where dense 32B struggles
  • You're doing chat, Q&A, summarization, or creative writing
  • You want the best model that runs on an M4 Pro 24 GB
  • Speed-to-quality tradeoff: MoE wins on this metric at 24 GB configs

benchmarks.json — full dataset  ·  models.json — model summaries  ·  benchmarks.csv — CSV export

See all benchmarks →