Quantisation is how a 70-billion parameter model that normally needs 140GB of RAM gets compressed into something that fits on 40GB. The trade-off is precision: quantisation reduces the numerical accuracy of each model weight from 16 bits down to 8, 6, 4, or even fewer bits. Less precision means a smaller file and faster inference, at the cost of some response quality. Understanding which quantisation level to use, and what you are giving up, is the difference between a local AI setup that actually works and one that frustrates you into stopping.
In short: Q4_K_M is the practical default for most hardware. It fits 7B models in 5GB of RAM with acceptable quality loss. Q6_K is the sweet spot for quality if you have the RAM. Q8_0 is near-lossless but needs twice the RAM of Q4. Start with Q4_K_M unless you have 16GB or more free for AI inference, in which case Q6_K or Q8_0 is worth the upgrade.
What Quantisation Actually Does
A language model consists of billions of numerical values called weights. In their original training form, each weight is stored as a 16-bit or 32-bit floating point number (FP16 or FP32). These high-precision numbers require significant storage and RAM. A 7B parameter model in FP16 format requires approximately 14GB of RAM to load.
Quantisation maps those high-precision weights to lower-precision representations. An 8-bit integer can represent 256 distinct values. A 4-bit integer can only represent 16 values. Compressing from FP16 to 4-bit is roughly a 4x reduction in size, but the model is now approximating what it used to represent exactly. In practice, language models are surprisingly robust to this approximation because most weights cluster around small values and the information content is highly redundant across billions of parameters.
The result: a 7B model at Q4_K_M quantisation fits in approximately 4.5 to 5GB of RAM and runs noticeably faster than the same model in FP16. The quality reduction is real but modest for most tasks. A model asked to summarise text or answer factual questions at Q4_K_M will produce responses indistinguishable from Q8_0 for most users. Tasks requiring precise reasoning, complex code generation, or sustained logical chains show more degradation.
Decoding the Naming System: What Q4_K_M Actually Means
Model files distributed in GGUF format (the standard for Ollama and llama.cpp) use a naming convention that combines the quantisation bit depth, the quantisation method, and the size variant. The format is Q[bits]_[method]_[size].
The bit number (Q4, Q5, Q6, Q8) indicates the target bits per weight. Higher numbers mean more precision and larger files. Q4 uses approximately 4 to 4.5 bits per weight when method overhead is included. Q8 uses approximately 8 bits per weight.
The K suffix indicates k-quants, a newer quantisation approach developed by the llama.cpp team. K-quants use different quantisation levels for different parts of the model. Attention layers, which handle context and reasoning, are quantised less aggressively than feed-forward layers. This produces significantly better quality than the older uniform quantisation at the same average bit depth. If you see a Q4 file without the K suffix, it uses the older method and produces lower quality at the same RAM usage. Always prefer K-quants where available.
The M, S, L suffix stands for Medium, Small, and Large. These variants apply different mixes of quantisation levels within the k-quant framework. K_M (Medium) is the recommended default for most use cases, balancing quality and size well. K_S (Small) saves a small amount of RAM by being slightly more aggressive. K_L (Large) retains more quality at a slightly larger size. For most hardware, K_M is the right choice.
| Q4_K_S | ~4.3 bits/weight. Smallest Q4 variant. Tight on quality. Use when RAM is the hard constraint and K_M does not fit |
|---|---|
| Q4_K_M | ~4.5 bits/weight. Recommended default. Best balance of RAM usage, speed, and quality for hardware with limited RAM |
| Q5_K_M | ~5.5 bits/weight. Noticeable quality improvement over Q4 for reasoning tasks. Good choice when you have a few extra GB of RAM |
| Q6_K | ~6.6 bits/weight. High quality, close to FP16 output. Recommended when 16GB or more is available for inference |
| Q8_0 | ~8 bits/weight. Near-lossless compared to FP16. Largest quantised format. For hardware with abundant RAM and where quality is paramount |
| FP16 (unquantised) | 16 bits/weight. Original training precision. Requires 2x the RAM of Q8_0. Rarely practical on consumer hardware |
RAM Requirements by Model Size and Quantisation Level
The RAM required for a model is the model file size plus overhead for the context window and runtime. A useful approximation for context window overhead is 0.5 to 1.5GB depending on context length settings. The figures below are the model file requirements without context overhead.
RAM Required by Model Size and Quantisation Level
| Q4_K_M | Q5_K_M | Q6_K | Q8_0 | |
|---|---|---|---|---|
| 7B model | ~4.5GB | ~5.7GB | ~6.6GB | ~8.0GB |
| 13B model | ~8.0GB | ~9.8GB | ~11.5GB | ~14.0GB |
| 34B model | ~20GB | ~25GB | ~29GB | ~34GB |
| 70B model | ~38GB | ~48GB | ~58GB | ~70GB |
| Hardware minimum (model only) | 5GB free RAM for 7B | 6GB free RAM for 7B | 7GB free RAM for 7B | 9GB free RAM for 7B |
Available RAM is not the same as total RAM. Your OS, running services, and other applications consume RAM before the model loads. On a NAS, expect 1.5 to 3GB consumed by the system. On a Windows mini-PC, expect 4 to 6GB consumed at idle. Subtract this from your total RAM to determine what is actually available for model inference.
Which Quantisation Level for Which Hardware
The right quantisation level depends entirely on how much RAM is available after the system takes its share. The goal is to fit the model entirely in RAM. A model that overflows into swap (disk-based memory) runs 10 to 100 times slower and becomes unusable in practice.
| 4 to 5GB available for AI (entry NAS, shared system) | 7B at Q4_K_S only. Expect 1 to 2 tokens/sec. Models larger than 7B will not fit |
|---|---|
| 6 to 7GB available for AI (NAS with 8GB total, mini-PC entry) | 7B at Q4_K_M comfortably. Best practical option for this hardware tier |
| 8 to 10GB available for AI (QNAP TS-473A 8GB, capable mini-PC) | 7B at Q6_K or Q8_0. 13B at Q4_K_M with caution. 13B may be tight depending on context settings |
| 12 to 14GB available for AI (NAS with 16GB, mid-range mini-PC) | 7B at Q8_0. 13B at Q5_K_M comfortably. Best all-round tier for quality without compromise |
| 16 to 24GB available for AI (capable mini-PC, 32GB system) | 13B at Q8_0. 34B at Q4_K_M. Strong quality across common model sizes |
| 32GB or more available for AI (high-end mini-PC or desktop) | 70B at Q4_K_M. 34B at Q8_0. Serious hardware for demanding use cases |
What Quality Difference You Actually Notice
The quality difference between Q4_K_M and Q8_0 on the same model is smaller than most people expect. On conversational tasks, summarisation, and simple question answering, the two outputs are largely indistinguishable. The gap becomes meaningful in specific scenarios.
Complex multi-step reasoning is where aggressive quantisation hurts most. Tasks like solving logic puzzles, following long chains of conditional instructions, or generating syntactically complex code show measurable degradation at Q4 compared to Q6 or Q8. The model at Q4 is more likely to lose track of constraints introduced earlier in the conversation, particularly in long context windows.
Creative writing shows moderate degradation at Q4, primarily in vocabulary diversity and sentence structure variation. The model defaults to more common phrasings and loses some of the stylistic range available at higher precision.
Factual question answering shows the least degradation. A model asked about documented facts performs similarly across quantisation levels because the information is encoded redundantly across many weights, making it robust to precision reduction.
The practical implication: for most home AI use cases, Q4_K_M is genuinely good enough. If you are using local AI for coding assistance or complex reasoning tasks and quality matters, Q5_K_M or Q6_K is worth the extra RAM if your hardware supports it. Q8_0 is rarely necessary unless you are doing systematic evaluation or fine-tuning work.
Speed Differences Between Quantisation Levels
Higher quantisation (more bits, better quality) is slower, not faster. Q4_K_M generates tokens faster than Q8_0 on the same model because the processor handles less data per weight during inference. The difference is not enormous, but it is measurable.
On a mid-range mini-PC with 16GB RAM running a 7B model, the approximate token generation speeds are: Q4_K_M at 14 to 18 tokens per second, Q5_K_M at 12 to 15 tokens per second, Q6_K at 10 to 13 tokens per second, and Q8_0 at 9 to 11 tokens per second. All of these are fast enough for responsive conversation. The gap becomes more noticeable on slower hardware like a NAS, where the processor is a stronger bottleneck and any extra data processing overhead compounds.
On a NAS like the Synology DS925+ or QNAP TS-473A, the difference between Q4_K_M and Q6_K on a 7B model may be 1 to 2 tokens per second, which is significant when the baseline is already only 2 to 4 tokens per second. On this hardware tier, Q4_K_M is the correct choice not just for RAM reasons but for speed.
Which Model to Download: A Practical Decision
When downloading models from Hugging Face or through Ollama's model library, use this decision process. First, determine your available RAM after system overhead. Second, identify the largest model size that fits at Q4_K_M with that RAM. Third, if RAM headroom exists after fitting the model, consider upgrading to Q5_K_M or Q6_K for better quality rather than jumping to a larger model at Q4_K_M.
A 7B model at Q6_K generally produces better outputs than a 13B model at Q4_K_S when both fit within the same RAM budget. Quantisation quality within the same model architecture matters more than raw parameter count when comparing across quantisation levels. The exception is very large parameter count differences: a 70B at Q4_K_M will outperform a 7B at Q8_0 on complex tasks, because the additional knowledge capacity of the larger model outweighs the precision advantage of the smaller one.
For Ollama running on a NAS, the recommended starting point is a 7B model at Q4_K_M. Good options include Llama 3.1 8B Q4_K_M (available via Ollama as llama3.1:8b), Mistral 7B Q4_K_M, and Qwen2.5 7B Q4_K_M. These fit on systems with 6GB or more of available RAM and produce acceptable quality for most tasks. For a capable mini-PC with 16GB RAM, Llama 3.3 70B at Q4_K_M is achievable and represents a significant quality jump.
Related reading: our NAS buyer's guide.
Related reading: our NAS explainer.
What is the difference between Q4 and Q4_K_M?
Q4 without a suffix uses an older, uniform quantisation method that applies the same 4-bit precision to every weight in the model. Q4_K_M uses k-quants, a newer method that applies different precision levels to different parts of the model. K-quants protect the most quality-sensitive layers (attention heads and embeddings) with higher precision while compressing less critical layers more aggressively. The result is noticeably better output quality at the same or slightly larger file size. If both Q4 and Q4_K_M are available for a model, Q4_K_M is the better choice in almost every case.
What quantisation level should I use on a NAS?
For most NAS hardware, Q4_K_M is the correct choice. NAS devices have limited RAM available for inference after the OS and services take their share. A NAS with 8GB total RAM typically has 5 to 6GB available for a model. Q4_K_M for a 7B model requires approximately 4.5 to 5GB, which fits comfortably. Q5_K_M or Q6_K require 5.7GB and 6.6GB respectively and may cause the model to run slowly or fail to load if available RAM is tight. See the local AI NAS guide for current NAS hardware recommendations with their available RAM figures.
Is Q8_0 worth it or is Q4_K_M good enough?
For most conversational and summarisation tasks, Q4_K_M is good enough. The quality difference is small enough that most users cannot reliably tell the outputs apart in blind tests. Q8_0 is worth considering for complex reasoning tasks, code generation, and work where precision matters, provided your hardware has the RAM to support it without impacting inference speed. If you have 16GB or more available for inference, Q6_K is a better practical choice than Q8_0: you get most of the quality benefit at significantly lower RAM usage and faster inference speed.
Does quantisation affect how fast the model responds?
Yes. Lower quantisation (fewer bits) is faster because the processor handles less data per operation. Q4_K_M is faster than Q6_K, which is faster than Q8_0, on the same hardware and model. The speed difference is more significant on slower hardware like a NAS than on a capable mini-PC with a modern processor. On NAS hardware running at 2 to 4 tokens per second, the difference between Q4_K_M and Q6_K might be 1 to 2 tokens per second, which is meaningful. On a mini-PC running at 15 to 20 tokens per second, the same difference might only be 2 to 4 tokens per second and will not be noticeable in conversation.
Can I run a 13B model on 8GB of RAM?
A 13B model at Q4_K_M requires approximately 8GB of RAM for the model itself. On a system with 8GB total RAM, after OS overhead there is not enough space for the model plus context. This causes the model to use disk-based swap, which reduces inference speed to near-unusable levels. For reliable 13B inference, 16GB of system RAM is the practical minimum, giving approximately 10 to 12GB for the model after OS overhead. On NAS hardware, 13B models are generally not viable even with 16GB total RAM because services and the OS consume more than a dedicated mini-PC would. The guide on what runs on each RAM tier covers this in more detail.
Ready to run a local LLM but not sure which NAS or hardware to use? The local AI hardware guide covers which models support Ollama, RAM ceilings, and current AU pricing.
See the Local AI Hardware Guide