I discovered that in LLM inference, keys and values in the KV cache have very different quantization sensitivities. Keys need higher precision than values to maintain quality.
I patched llama.cpp to enable different bit-widths for keys vs. values on Apple Silicon. The results are surprising:
- K8V4 (8-bit keys, 4-bit values): 59% memory reduction with only 0.86% perplexity loss
- K4V8 (4-bit keys, 8-bit values): 59% memory reduction but 6.06% perplexity loss
- The configurations use the same number of bits, but K8V4 is 7× better for quality
This means you can run LLMs with 2-3× longer context on the same Mac. Memory usage scales with sequence length, so savings compound as context grows.
Implementation was straightforward:
1. Added --kvq-key and --kvq-val flags to llama.cpp
2. Applied existing quantization logic separately to K and V tensors
3. Validated with perplexity metrics across context lengths
4. Used Metal for acceleration (with -mlong-calls flag to avoid vectorization issues)
Benchmarked on an M4 MacBook Pro running TinyLlama with 8K context windows. Compatible with Metal/MPS and optimized for Apple Silicon.
GitHub: https://github.com/dipampaul17/KVSplit
A note, your install script appears to still have a placeholder at the "apply patch" step. A suggestion, might be more user-friendly to fork llama.cpp and then include that as a git submodule rather than make it a "git clone and apply patch" step.
A further note, everyone and their dog has a different local python set-up, might be nice to let people separate the llama.cpp stuff from the python stuff rather than bake in a dependence on homebrew python.
reply