✨ TL;DR
GSQ is a new scalar quantization method for large language models that uses Gumbel-Softmax relaxation to jointly optimize grid assignments and scales, achieving accuracy comparable to complex vector quantization methods while remaining compatible with existing inference kernels. It successfully quantizes models to 2-3 bits per parameter and scales to trillion-parameter mixture-of-experts models.
Current weight quantization methods for LLMs face a fundamental trade-off. Simple scalar quantization techniques like GPTQ and AWQ are widely deployed and easy to implement but hit an accuracy ceiling at 3-4 bits per parameter. Meanwhile, advanced vector- and trellis-quantized methods like QTIP, GPTVQ, and AQLM achieve better accuracy at low bit-widths (2-3 bits) but are difficult to implement, hard to scale, and have limited adoption in practice. This creates a gap between what is theoretically possible and what is practically deployable, especially for local inference scenarios where extreme compression is needed.
GSQ introduces a post-training scalar quantization method that jointly optimizes per-coordinate grid assignments and per-group scales using a Gumbel-Softmax relaxation of the discrete quantization grid. The key innovation is matching the cardinality of the continuous relaxation to the small number of quantization levels available at the target bit-width (e.g., 3-8 levels for ternary and 3 bits per parameter). This makes the relaxation tight and the optimization tractable. The method uses symmetric scalar grids with group-wise quantization, ensuring full compatibility with existing scalar inference kernels while achieving the accuracy benefits typically associated with more complex vector quantization approaches.