Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer
By: Mayssam Naji
Vision Transformers (ViTs) have recently come to the fore in image classification, contributing significantly in the ImageNet Challenge. Their prowess is rooted in their ability to grasp long-distance visual connections. Despite their astonishing performance, they impose a hefty toll on parameter count and computational expenditure, making real-world applications challenging where resources could be restricted. This is where the fully quantized vision transformers, known as Q-ViTs, step in, aspiring to reshape the realm of image classification and computer vision.
Q-ViT: Quantized Vision Transformers
The primary driving force behind the genesis of Q-ViT is the demand for more efficient transformer models. Initial attempts to compress transformers involved network pruning, compact network design, and quantization, but these often jeopardized the model's total accuracy.
Quantization emerged as a favorite due to its potential applicability on AI chips, reducing the bit-width of network parameters and activations, paving the way for efficient inference. Nevertheless, while Quantization-aware training (QAT) methods worked well for Convolutional Neural Network (CNN) models, they should have been considered more for low-bit quantization of vision transformers. This void called for a compression method that didn't sacrifice the model's accuracy, leading to the advent of Q-ViT.
Q-ViT, the newcomer, successfully alters the distorted distribution over quantized attention modules by implementing two primary strategies: the Information Rectification Module (IRM) and the Distribution Guided Distillation (DGD) scheme.
- Information Rectification Module (IRM): This module utilizes quantized representations in the attention module with maximized information entropy, enabling the quantized model to restore the representation of input images.
- Distribution Guided Distillation (DGD): This strategy seeks to rectify the distribution mismatch in distillation. It capitalizes on suitable activations and knowledge from the similarity matrices for accurate optimization.
Remarkably, these methods have facilitated the creation of an accurate and low-bit ViT model that competes on an equal footing with full-precision counterparts.
Figure 1 visualizes an overview of the Q-Vit system, which incorporates two key components: the Information Rectification Module (IRM) and the Distribution Guided Distillation (DGD).
Li, Y. et al.1 conducted an array of experiments on the ILSVRC12 ImageNet classification dataset, encompassing 1.2 million training images and 50k validation images spanning 1000 classes. The models were trained to employ a batch-size of 512, learning-rate of 2e4, and the LAMB optimizer over 300 epochs.
The methods were evaluated on popular DeiT and Swin transformer backbones. The combined IRM and DGD techniques in the ablation study yielded superior results, with an accuracy rate of 83.2%, thus demonstrating that Q-ViT can achieve performance comparable to full-precision counterparts with ultra-low bit weights and activations.
Q-ViT has notably set the stage for studying fully quantized ViTs. It has showcased enhancements in the performance of transformers on assorted and large-scale datasets, especially in computer vision tasks. Furthermore, the researchers have put a commendable effort into elucidating the theory and application tied to each stage, as shown in the paper.
While Q-ViT, an evolution of Vision Transformer, a methodology devised to enhance fully quantized Vision Transformers (ViTs) by increasing the compression ratio while maintaining competitive performance, has achieved strides in image classification, its application in real-life scenarios where text and image data are paired remains to be delved into. Q-ViT models deliver performance on par with their full-precision equivalents, utilizing ultra-low bit weights and activations. This research provides valuable insights and efficient solutions to critical issues in ViT full quantization, paving the way for future efforts in extreme compression of ViT.
 Li, Y., Xu, S., Zhang, B., Cao, X., Gao, P., & Guo, G. (2023). Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer. Beihang University, Beijing, P.R.China; Zhongguancun Laboratory, Beijing, P.R.China; Shanghai Artificial Intelligence Laboratory, Shanghai, P.R.China; Institute of Deep Learning, Baidu Research, Beijing, P.R.China; National Engineering Laboratory for Deep Learning Technology and Application, Beijing, P.R.China.