Local model memory: bits, dense, MoE

I use one small memory check before getting excited about a local model: count the weights first, then leave room for everything else.

The model size on the card is not enough by itself. A 7B model, a 70B model, and a MoE model with “only 16B active parameters” can mean very different things for compute, but the memory question starts with the same boring calculation: how many parameters have to be loaded, and how many bits are used for each weight?

Dense and MoE local model memoryDense models load and use the whole model. MoE models may activate only a few experts per token, but all expert weights still need to be available in memory or offloaded storage.Weight memory ~ total parameters x bits / 8Then add KV cache, runtime buffers, quantization overhead, and context length.DenseAll weights are loadedAll layers are activeMemory and compute track together.MoEAll experts are loadedtop-k runsCompute drops; weights still exist.Short version: dense total ~= active.For MoE, total is the memory number; active is mostly the compute number.
MoE routing saves work per token. It does not make the unused expert weights disappear.

The rough weight number is:

weight memory ~= total_parameters x bits_per_weight / 8
8B model at Q4   ~= 8 x 4  / 8 = 4 GB of weights
8B model at FP16 ~= 8 x 16 / 8 = 16 GB of weights

That number is only the weight floor. Real inference also needs KV cache, temporary buffers, driver/runtime overhead, and sometimes extra space for quantization scales or dequantized work buffers. Long context makes the KV cache matter more. A model with 4 GB of weights should not be planned as a 4 GB run.

ExampleWeights onlySafer plan
Dense 8B, Q4~4 GB6-8 GB
Dense 8B, Q8~8 GB10-12 GB
Dense 70B, Q4~35 GB45-55 GB
MoE 64B total, top-2, Q4~32 GB40+ GB

For a dense model, the total parameter count and the active parameter count are basically the same thing. An 8B dense model uses the whole model to produce tokens, and the whole model has to fit.

For a MoE model, there are two different numbers:

NumberWhat it means
Total parametersShared layers plus every expert weight that has to be available to the runtime. This is the memory number.
Active parametersShared layers plus the experts selected for a token. This is mostly the compute number.

That is the boundary I care about. “64B MoE, 16B active” is a useful compute claim. It is not a promise that the model has the memory footprint of a 16B dense model. The unused experts still have to live somewhere: VRAM, CPU RAM, or an offload path. Offloading can reduce VRAM pressure, but it trades that pressure for system memory, data movement, and latency.

So the quick local-model rule is simple: start with total parameters times bits per weight, then leave at least 20-30 percent extra space. If the context is long, the batch is larger than one, or the runtime is doing aggressive offload, leave more.