What is the Impact of Quantisation on Memory Utilisation

Aspect	Impact on Memory Utilization
Large Language Models (LLMs)	Reduced Model Size: Quantization reduces the number of bits used to represent each weight and activation. For example, converting from 32-bit floating point (FP32) to 8-bit integer (INT8) reduces memory usage by a factor of 4. Lower Memory Footprint: This reduction in precision leads to a lower overall memory footprint for storing the model parameters and intermediate activations during inference and training. Increased Batch Sizes: With lower memory requirements, it becomes possible to process larger batch sizes within the same memory constraints, improving throughput.
Vector Databases	Compact Embeddings: Quantization reduces the size of vector embeddings stored in the database. For instance, converting vectors from 32-bit to 8-bit representation decreases storage requirements by up to 75%. Efficient Indexing: Smaller vector sizes allow for more efficient indexing and faster retrieval operations due to reduced memory bandwidth and cache usage. Scalability: The reduction in memory usage enables the handling of larger datasets and more vectors within the same hardware constraints, improving the scalability of the system.

Cost Efficiency: Lower memory usage translates to reduced hardware costs, as less RAM and storage are required.
Energy Efficiency: Less memory usage often results in lower power consumption, contributing to energy efficiency.
Performance Improvements: Reduced memory usage can lead to faster data access and processing times, as more data can fit into the faster levels of the memory hierarchy (e.g., caches).
Deployment Flexibility: Models and databases with lower memory footprints can be deployed on a wider range of devices, including edge devices with limited resources.

An Architect's vision