Quantization and Compression

📖

terms

Post-Training Quantization (PTQ)

Precision reduction technique applied to an already trained model, without requiring complete retraining. It converts high-precision weights and activations (e.g., FP32) to lower-precision representations (e.g., INT8) to optimize inference.

📖

terms

Quantization-Aware Training (QAT)

Method where quantization and dequantization operations are integrated into the computational graph during training. This allows the model to adapt to precision loss, minimizing performance degradation compared to PTQ.

📖

terms

Binarized Neural Networks (BNN)

Extreme form of quantization where weights and/or activations are constrained to a single binary value (+1 or -1). It enables significant computational and memory gains by replacing multiplications with additions/subtractions.

📖

terms

Structured Pruning

Compression technique that removes entire weight structures, such as filters, channels, or attention heads, rather than individual weights. It is more effective for accelerating computation on modern hardware than unstructured pruning.

📖

terms

Unstructured Pruning

Compression method that eliminates individual weights in the network, typically those with the smallest magnitude. Although it can reduce model size, it requires specialized hardware support (sparsity) to accelerate computation.

📖

terms

Low-Rank Matrix Factorization

Compression technique that decomposes a large weight matrix into two or more smaller matrices. It reduces the number of parameters and matrix multiplication operations, thus accelerating dense and convolutional layers.

📖

terms

Knowledge Distillation

Compression process where a small model

📖

terms

Huffman Encoding for Weights

Lossless compression method that applies the Huffman coding algorithm to model weights. It assigns shorter binary codes to the most frequent weights, reducing file size on disk without affecting inference speed.

📖

terms

Weight Sharing

Compression technique that groups weights into clusters and replaces each weight with the index of its cluster centroid. This reduces the number of bits needed to store each weight and enables the use of lookup tables during inference.

📖

terms

Tucker Decomposition

Form of tensor decomposition applied to weight tensors (4D convolutions) to compress them. It decomposes a tensor into a smaller core tensor and factor matrices, significantly reducing the number of parameters and computational cost.

📖

terms

CP Decomposition (CANDECOMP/PARAFAC)

Tensor decomposition method that expresses a tensor as a sum of rank-one vector products. It is used to compress convolutional layers by approximating the weight tensor with a reduced number of components.

📖

terms

Variable Neural Network (VNN)

Model architecture where the number of active channels in each layer can vary dynamically based on resource constraints. It allows for flexible trade-offs between accuracy and computational cost at runtime.

📖

terms

Blockwise Quantization

Technique that divides weight or activation tensors into smaller blocks and applies independent quantization to each block. It better captures local magnitude variations, reducing overall quantization error.

📖

terms

8-bit Floating Point Representation (FP8)

Low-precision data format using 8 bits to represent floating-point numbers, with different variants (E4M3, E5M2) for training and inference. It offers superior trade-offs compared to integer formats for certain AI workloads.

📖

terms

Structured N:M Sparsity

Pruning scheme where, for every block of M weights, exactly N weights are preserved (N < M). This regular pattern is designed to be efficiently accelerated by specialized matrix computation units (Tensor Cores) in modern GPUs.

AI Glossary

Post-Training Quantization (PTQ)

Quantization-Aware Training (QAT)

Binarized Neural Networks (BNN)

Structured Pruning

Unstructured Pruning

Low-Rank Matrix Factorization

Knowledge Distillation

Huffman Encoding for Weights

Weight Sharing

Tucker Decomposition

CP Decomposition (CANDECOMP/PARAFAC)

Variable Neural Network (VNN)

Blockwise Quantization

8-bit Floating Point Representation (FP8)

Structured N:M Sparsity

No results found