Quantization

quantization

Also called 8-bit quantization/ 4-bit quantization/ INT8

Quantization is the process of reducing the precision of a model's parameters (e.g. from 32-bit floats down to 8-bit or 4-bit integers) to shrink memory and speed up inference. A 4-bit quantized 70B model fits on a single GPU instead of needing eight; the accuracy loss is usually small.

Photo: Jan van der Wolf / Pexels

A model trained with 32-bit floating-point weights stores 4 bytes per parameter. A 70B model is therefore ~280 GB of weights, which only fits on a high-end multi-GPU node. The same 70B model in 16-bit precision (a common production format) is 140 GB. In 8-bit, 70 GB. In 4-bit, 35 GB. The compute and bandwidth scale similarly.

Quantization is the math trick that makes that shrink possible without retraining. The basic idea: for each weight matrix, find the min and max values, then map the range to fewer bits. A weight that was, say, 0.0314159 might become 0.0312 in 8-bit and 0.03 in 4-bit. The values are slightly off, but the model still works.

The accuracy hit varies. Naive 8-bit quantization usually costs less than 1% on benchmarks. Naive 4-bit can cost 2-5%. Modern techniques (activation-aware quantization, mixed-precision schemes, calibration sets) often close most of that gap. For 4-bit specifically, GPTQ and AWQ are the standard approaches.

The wins are dramatic: 4x to 8x memory savings, 2x to 4x throughput improvements, smaller (cheaper) GPUs needed for the same model. Local LLM enthusiasts run 70B models on consumer hardware exclusively because of quantization.

The trade-offs are real but acceptable for most use cases:

Slight accuracy degradation, more visible on hard tasks
Quantized models are harder to fine-tune than full-precision ones
Some operations (very rare tokens, edge cases) are more sensitive to precision loss than others

Frontier APIs from major labs are typically not user-quantized, but the labs themselves often serve quantized weights internally to reduce inference cost. The label "fp8" or "int8" in benchmarks is a hint that quantization is happening behind the scenes.

Related concepts

Want the rest?

There are 40 terms total.

See the full glossary