RAG systems have a context problem
Multi-Vector Retrieval Details with Mixedbread's Aamir Shakir
RAG systems have a context problem. I talked with Aamir Shakir, the founder of Mixedbread, for a deep dive into the research and engineering behind modern retrieval systems. I've been using Mixedbread's tools, like mgrep. They claim to cut tokens in half, speed up retrieval, and improve quality. After running my own experiments, I found they were right.
We went beyond hybrid search and re-rankers to the architectural shift of multi-vector retrieval. This post summarizes the theory, engineering challenges, and practical applications.
If you missed the first talk on using mgrep for agentic workflows, you
can watch it here: mgrep with Founding Engineer
Rui.
What is Mixedbread & Multi-Vector Search?
Mixedbread began as an applied research lab built on a simple
hypothesis: AI will only be as useful as its context. Without
context, an AI is like a new employee on day one. This "context
problem" is a search and retrieval problem.
AI models have advanced fast, but retrieval tech still relies on concepts from 20 years ago. Mixedbread's goal is to modernize retrieval.
The core of their approach is multi-vector search.
Traditional retrieval-augmented generation (RAG) typically follows this
path:
-
Take a document.
-
Split it into chunks.
-
Create one single vector embedding for each chunk.
This process compresses complex information into a single vector.
Multi-vector search, particularly models like ColBERT, changes this.
Instead of one vector per chunk, it creates one vector per token.
For a sentence like "I love bread," a traditional model produces one
vector. A multi-vector model produces three, preserving far more
granular information.
Why Multi-Vector Outperforms Traditional RAG
The limitations of older methods highlight why a new approach is needed.
-
Keyword Search (BM25) anchors on exact keywords, making it robust
for niche domains with specific terminology. It fails with semantics,
synonyms, abbreviations (like "RAG" vs. "retrieval augmented
generation"), and context ("Apple" the fruit vs. "Apple" the company). -
Single-Vector Search: compresses paragraphs into a single point.
It captures the topic but blurs nuance. If a paragraph covers
politics, food, and sports, you may only retain the main topic and
lose the details. It's also sensitive to "out-of-distribution" data.
If the model hasn't seen a term or OCR errors introduce strange
characters, it guesses where to place the vector, losing meaning.
Multi-vector search combines the best of both worlds.
-
Granularity: By representing every token, it captures the
keyword-level precision of BM25, making it robust to
out-of-distribution terms. -
Semantics: Since each token's representation is a dense vector, it
also captures the semantic meaning and context, like a
traditional embedding.
This approach provides a powerful hybrid search "out of the box." Because it retains more information, it generalizes well to new domains, complex data, and long-context retrieval.
Aamir shared benchmarks where their ColBERT-style model (trained on
300-token docs) outperformed models designed for long-context retrieval
on documents with tens of thousands of tokens.
To build a strong foundation in traditional RAG, including BM25 and
semantic search, check out my
course.
All the content is free to access.
Making Multi-Vector Practical with Quantization
If multi-vector is so powerful, why wasn't it the standard all along? The primary barriers were infrastructure and cost. Storing a vector for every token generates massive data, making it expensive and slow without the right engineering.
This is where quantization becomes critical. Quantization converts high-precision numbers (like 32-bit floats) into lower-precision formats to save storage and speed up computation.
Aamir explained two common techniques:
-
Int8 Quantization: Store 8-bit integers per dimension instead of
32-bit floats. Map values to 256 buckets based on min/max. This cuts
storage by 4x and can speed computation by 8-10x with little loss in
retrieval quality. -
Binary (1-bit) Quantization: Store a 1 or 0 per dimension. This
reduces storage by 32x. Instead of cosine similarity, you can use
Hamming distance (XOR and popcount), which is extremely fast. This
can lead to performance loss if the model isn't optimized for it.
Mixedbread found a trick to mitigate binary loss: keep document vectors
binary, but keep the query vector higher precision (float32 or int8).
That drops performance loss from ~40% to ~5%. Query precision matters
more than storing both at low precision.
Mixedbread and Hugging Face co-authored a post on this
topic, showing
how to achieve a 40x speedup and 62x cost reduction.I also wrote a post that breaks down the fundamentals of
quantization
for multi-vector retrieval.
Mixedbread's Architecture & Semantic Chunking
With these techniques, Mixedbread built an end-to-end system for speed and scale. Indexing the entire React codebase (60 million tokens) takes a couple of minutes.
Here's an overview of their architecture:
-
Ingestion & Chunking: When a file is uploaded, it's chunked based
on semantics (more on this later) -
Inference: Chunks are sent to GPUs running a custom inference
engine with CUDA kernels, enabling massive parallelization and low
latency embedding generation. -
Storage & Caching: Embeddings are quantized and stored. The system
uses a two-step retrieval process (fast, lossy first pass; full
precision second pass) and multi-tier caching from S3 to hard drives,
NVMe SSDs, and in-memory for hot data.
A query typically takes around 60 milliseconds end-to-end (P95).
Smart Chunking for Any Data Type
A key part of Mixedbread's system is its approach to parsing and chunking different file types.
-
Code: They parse the Abstract Syntax Tree (AST) to create
semantically meaningful chunks, grouping related functions or classes
together. -
PDFs: PDFs are hard to parse due to tables, columns, and charts.
Mixedbread takes a screenshot of each page and embeds the image,
preserving layout and content. They also use LLMs to create
contextual summaries to link pages together. -
Video: A transformer-based shot detection model analyzes frames to
identify scene changes, creating logical chunks based on the visual
narrative. -
Text/Markdown: They use contextualization methods to ensure each
chunk contains relevant surrounding information, a technique inspired
by research from Anthropic.
This idea of processing entire documents and then chunking at the
embedding level is sometimes called "late chunking."
I wrote a post that covers the concept of late
chunking with a
minimal implementation.
The Role of Re-rankers & Cross-Encoders
Even with a strong retriever like ColBERT, re-ranking can boost quality.
Aamir confirmed they use cross-encoders internally.
A cross-encoder looks at the query and a candidate document together,
making a more accurate relevance judgment than a retriever that embeds
the document in isolation.
The next frontier is list-wise re-ranking, where the model sees the
query and the entire list of candidates at once. It can answer
questions like "which is the fastest?" by comparing all options, but
it's currently too slow and expensive for most production systems.
Aamir is also excited about learnable scoring functions. Instead of
burning GPUs to create complex embeddings only to compare them with
cosine similarity, the scoring function itself could be learned,
improving relevance.
How to Get Started in Retrieval Research
Retrieval is more accessible than training foundational LLMs. You can start with a MacBook. Aamir's advice:
-
Read the Fundamentals: Start with the original Sbert
(Sentence-BERT) paper to understand
the basics of modern embedding models. -
Learn by Doing: Use libraries like sentence-transformers to train
your own models. The documentation is excellent. -
Read In-Depth Guides: The Mixedbread
blog offers deep dives into their
training techniques. -
Stay Updated: Follow resources like the Information Retrieval
Substack to keep up with the latest
research. -
Embrace the Struggle: Build things yourself. Don't rely on AI to
write all the code. The learning happens when you debug PyTorch and
CUDA errors.
Conclusion
Multi-vector search is heavier and harder to engineer than standard RAG. With quantization making it affordable, the quality gains are now accessible. If you're hitting a ceiling with semantic search, this is the architecture to investigate next. Mixedbread offers an API that does it for you.
My conversation with Aamir reinforced that AI quality is tied to context quality. As models get smarter, the tools we use to feed them information must get smarter too.
If you're working on complex retrieval problems, the techniques discussed here are the new baseline. If you're an engineer passionate about building high-performance, distributed systems, Mixedbread is hiring.
Explore their work and open positions at
mixedbread.com.