Vector compression (primarily known as Vector Quantization) is the process of reducing the memory footprint of vector embeddings while retaining enough information to find “nearest neighbors” accurately.

In production vector databases, RAM is the most expensive bottleneck. A raw 1536-dimensional vector (like from OpenAI’s text-embedding-3-small) takes up roughly 6KB of memory. If you have 100 million vectors, that is 600GB of RAM just to load the index—prohibitively expensive for most companies.

Compression reduces this size by 4x, 32x, or even 64x, often with less than 2-3% loss in retrieval accuracy.

Here are the fundamental ways this is done, ranked from simplest to most advanced.


1. Scalar Quantization (SQ)

The “Low Resolution” Approach

This is the most popular “safe” default. It works by reducing the precision of the numbers in your vector.

Standard vectors use float32 (32-bit decimal numbers like 0.1234567). SQ rounds these down to int8 (8-bit integers like 12) or even int4.

  • How it works: It looks at the range of values in your vector (e.g., min -1.0, max 1.0) and divides that range into 256 “buckets” (for int8). Every detailed decimal is snapped to the nearest bucket ID.

  • The Gain: 4x compression (32 bits 8 bits).

  • The Trade-off: minimal accuracy loss; usually unnoticeable for RAG applications.

2. Binary Quantization (BQ)

The “Extreme” Approach

This turns your vector into a string of 0s and 1s. It is incredibly fast because computers can calculate the distance between binary strings (Hamming distance) much faster than floating-point math.

  • How it works: It looks at every dimension. Is the number greater than 0? Make it a 1. Is it less than 0? Make it a 0.

  • The Gain: 32x compression (32 bits 1 bit).

  • The Trade-off: High information loss. This usually only works well with very large vectors (1000+ dimensions) like OpenAI’s or Cohere’s, where there is enough redundancy that “roughly positive” vs “roughly negative” is enough to define meaning.

3. Product Quantization (PQ)

The “Codebook” Approach

This is the classic industrial-standard compression (popularized by FAISS). Instead of compressing each number individually, it looks at sub-sections of the vector and finds patterns.

  • How it works:

    1. Split: Break a 1000-dimension vector into 10 chunks of 100 dimensions each.

    2. Cluster: For each chunk, look at all your data and find 256 common “patterns” (centroids).

    3. Replace: Instead of saving the raw numbers, you just save the ID of the closest pattern (e.g., “Pattern #42”).

  • The Gain: High compression (often 10x-20x) with good accuracy preservation.

  • The Trade-off: It requires a “training” phase to learn the patterns before you can index data.

4. Matryoshka Representation Learning (MRL)

The “Truncation” Approach (The Modern Way)

This is a newer technique that happens at the model level (e.g., OpenAI’s newer models support this). “Matryoshka” refers to Russian nesting dolls.

  • How it works: The embedding model is trained so that the most important information is crammed into the start of the vector. If you have a 1536-dimension vector but only want to use 256 dimensions, you literally just chop off the end.

  • The Gain: Flexible compression. You can store the full vector for precision but use the short version for extremely fast initial searches.

  • The Trade-off: You must use an embedding model trained specifically for this (e.g., text-embedding-3, nomic-embed).


Summary Comparison

TechniqueCompressionSpeedComplexityBest Use Case
Scalar (SQ)4xFastLowDefault choice. Good balance for most apps.
Binary (BQ)32xBlazingMediumMassive datasets (10M+) with high-dim vectors.
Product (PQ)8x - 64xMediumHighLegacy systems or when you need specific memory tuning.
MatryoshkaVariableFastLowIf your model supports it (OpenAI/Nomic) and you want flexibility.

Which one should you use?

If you are using a modern vector database (like Pinecone, Weaviate, or Qdrant), start with Scalar Quantization (int8). It is often a simple toggle that saves you 75% of your RAM costs with virtually no downside.

Why RAM and not Disk ?

1. The “Random Hop” Problem (Latency)

Standard databases (like SQL) usually know exactly where to look (e.g., “Go to row 500”). Vector databases, however, use an algorithm called HNSW (Hierarchical Navigable Small World). Think of HNSW as a network of friends.  

To find the “nearest neighbor,” the algorithm hops from one vector to another, checking distances.

  • “Are you close? No. Is your neighbor close? Maybe. Let’s check his neighbor.”

  • This hopping is random. It jumps all over the memory addresses.  

Why Disk Fails Here:

  • RAM: fast random access (Nanoseconds).

  • Disk (SSD): fast sequential read, but slow random access (Microseconds/Milliseconds).

If the database had to go to the disk for every single “hop” in the graph, a search that currently takes 10ms would take 500ms+.

2. The Solution: “Hot” Index vs. “Cold” Storage

To get that speed, databases traditionally load the entire graph structure (the index) into RAM.  

  • Disk: Used for Storage. It holds the raw data so it doesn’t get lost if the power goes out.

  • RAM: Used for Search. When you start the app, the DB loads the vector index into RAM so it can “hop” at lightning speed.  

Because RAM is expensive (roughly 10x-50x more per GB than SSD), databases are moving to hybrid models.

  • Qdrant (Memmap): Qdrant uses a technique called mmap (Memory Mapping). It tricks the operating system into thinking the disk is RAM. It keeps the “navigation links” (the small stuff) in RAM, but leaves the heavy vector data on disk until the very last second.  

    • Result: You can store more data, but it is slightly slower.
  • Pinecone (Serverless): They separate storage and compute completely. Your data sits in “cold” object storage (like S3/Blob) which is cheap. When you query, they quickly load just the relevant slice into a “hot” cache layer.  

  • DiskANN: This is a newer algorithm (used by Azure and others) specifically designed to run on SSDs. It organizes the data so that the “hops” happen in a way that minimizes disk reads, allowing for massive datasets with very little RAM.

Summary

  • Storage (Disk): Keeps data safe and durable.

  • Search (RAM): Required because calculating distance involves jumping randomly across millions of points, which is too slow for disks to handle in real-time.