Most RAG Systems Have a Context Problem
Most RAG Systems Have a Context Problem
Multi-Vector Retrieval Details with Mixedbread's Aamir Shakir
Most RAG systems have a context problem. I talked with Aamir Shakir, the founder of Mixedbread, for a deep dive into the research and engineering behind modern retrieval systems. I've been using Mixedbread's tools, especially mgrep, and I'm impressed. They claim to cut tokens in half, speed up retrieval, and improve quality. After running my own experiments, I found they were right.
We went beyond hybrid search and re-rankers to the architectural shift of multi-vector retrieval.
This post summarizes the theory, the engineering challenges, and the practical applications of building a state-of-the-art retrieval system.
If you missed the first talk on using mgrep for agentic workflows, you can watch it here: mgrep with Founding Engineer Rui.
What is Mixedbread & Multi-Vector Search?
Mixedbread began as an applied research lab built on a simple hypothesis: AI will only be as useful as its context. Without context, an AI is like a new employee on day one. This "context problem" is a search and retrieval problem.
AI models have advanced fast, but retrieval tech still relies on concepts from 20 years ago. Mixedbread's goal is to modernize retrieval to match today's AI.
The core of their approach is multi-vector search.
Traditional retrieval-augmented generation (RAG) typically follows this path:
-
Take a document.
-
Split it into chunks.
-
Create one single vector embedding for each chunk.
This process compresses complex information into a single vector. Multi-vector search, particularly models like ColBERT, changes this.
Instead of one vector per chunk, it creates one vector per token. For a sentence like "I love bread," a traditional model produces one vector. A multi-vector model produces three, preserving far more granular information.
Why Multi-Vector Outperforms Traditional RAG
The limitations of older methods highlight why a new approach is needed.
-
Keyword Search (BM25) anchors on exact keywords, making it robust for niche domains with specific terminology. It fails with semantics, synonyms, abbreviations (like "RAG" vs. "retrieval augmented generation"), and context ("Apple" the fruit vs. "Apple" the company).
-
Single-Vector Search: compresses paragraphs into a single point. It captures the topic but blurs nuance. If a paragraph covers politics, food, and sports, you may only retain the main topic and lose the details. It's also sensitive to "out-of-distribution" data. If the model hasn't seen a term or OCR errors introduce strange characters, it guesses where to place the vector, losing meaning.
Multi-vector search combines the best of both worlds.
-
Granularity: By representing every token, it captures the keyword-level precision of BM25, making it robust to out-of-distribution terms.
-
Semantics: Since each token's representation is a dense vector, it also captures the semantic meaning and context, like a traditional embedding.
This approach provides a powerful hybrid search "out of the box." Because it retains more information, it generalizes well to new domains, complex data, and long-context retrieval.
Aamir shared benchmarks where their ColBERT-style model (trained on 300-token docs) outperformed models designed for long-context retrieval on documents with tens of thousands of tokens.
To build a strong foundation in traditional RAG, including BM25 and semantic search, check out my course. All the content is free to access.
Making Multi-Vector Practical with Quantization
If multi-vector is so powerful, why wasn't it the standard all along? The primary barriers were infrastructure and cost. Storing a vector for every token generates massive data, making it expensive and slow without the right engineering.
This is where quantization becomes critical. Quantization is the process of converting high-precision numbers (like 32-bit floats) into lower-precision formats to save storage and speed up computation.
Aamir explained two common techniques:
-
Int8 Quantization: Store 8-bit integers per dimension instead of 32-bit floats. Map values to 256 buckets based on min/max. This cuts storage by 4x and can speed computation by 8-10x with little loss in retrieval quality.
-
Binary (1-bit) Quantization: Store a 1 or 0 per dimension. This reduces storage by 32x. Instead of cosine similarity, you can use Hamming distance (XOR and popcount), which is extremely fast. This can lead to performance loss if the model isn't optimized for it.
Mixedbread found a trick to mitigate binary loss: keep document vectors binary, but keep the query vector higher precision (float32 or int8). That drops performance loss from ~40% to ~5%. Query precision matters more than storing both at low precision.
Mixedbread and Hugging Face co-authored a post on this topic, showing how to achieve a 40x speedup and 62x cost reduction.
I also wrote a post that breaks down the fundamentals of quantization for multi-vector retrieval.
Mixedbread's Architecture & Semantic Chunking
With these techniques, Mixedbread built an end-to-end system for speed and scale. Indexing the entire React codebase (60 million tokens) takes a couple of minutes.
Here's an overview of their architecture:
-
Ingestion & Chunking: When a file is uploaded, it's chunked based on semantics (more on this later)
-
Inference: Chunks are sent to GPUs running a custom inference engine with CUDA kernels, enabling massive parallelization and low latency embedding generation.
-
Storage & Caching: Embeddings are quantized and stored. The system uses a two-step retrieval process (fast, lossy first pass; full precision second pass) and multi-tier caching from S3 to hard drives, NVMe SSDs, and in-memory for hot data.
A query typically takes around 60 milliseconds end-to-end (P95).
Smart Chunking for Any Data Type
A key part of Mixedbread's system is its sophisticated approach to parsing and chunking different file types, so users don't need a Ph.D. in data processing.
-
Code: They parse the Abstract Syntax Tree (AST) to create semantically meaningful chunks, grouping related functions or classes together.
-
PDFs: PDFs are hard to parse due to tables, columns, and charts. Mixedbread takes a screenshot of each page and embeds the image, preserving layout and content. They also use LLMs to create contextual summaries to link pages together.
-
Video: A transformer-based shot detection model analyzes frames to identify scene changes, creating logical chunks based on the visual narrative.
-
Text/Markdown: They use contextualization methods to ensure each chunk contains relevant surrounding information, a technique inspired by research from Anthropic.
This idea of processing entire documents and then chunking at the embedding level is sometimes called "late chunking."
I wrote a post that covers the concept of late chunking with a minimal implementation.
The Role of Re-rankers & Cross-Encoders
Even with a strong retriever like ColBERT, re-ranking can boost quality. Aamir confirmed they use cross-encoders internally.
A cross-encoder looks at the query and a candidate document together, making a more accurate relevance judgment than a retriever that embeds the document in isolation.
The next frontier is list-wise re-ranking, where the model sees the query and the entire list of candidates at once. It can answer questions like "which is the fastest?" by comparing all options, but it's currently too slow and expensive for most production systems.
Aamir is also excited about learnable scoring functions. Instead of burning GPUs to create complex embeddings only to compare them with cosine similarity, the scoring function itself could be learned, improving relevance.
How to Get Started in Retrieval Research
Retrieval is more accessible than training foundational LLMs. You can get started with a MacBook. Aamir's advice:
-
Read the Fundamentals: Start with the original Sbert (Sentence-BERT) paper to understand the basics of modern embedding models.
-
Learn by Doing: Use libraries like sentence-transformers to train your own models. The documentation is excellent.
-
Read In-Depth Guides: The Mixedbread blog offers deep dives into their training techniques.
-
Stay Updated: Follow resources like the Information Retrieval Substack to keep up with the latest research.
-
Embrace the Struggle: Build things yourself. Don't rely on AI to write all the code. The learning happens when you debug PyTorch and CUDA errors.
Conclusion
Multi-vector search is heavier and harder to engineer than standard RAG. With quantization making it affordable, the quality gains are now accessible. If you're hitting a ceiling with semantic search, this is the architecture to investigate next. Mixedbread offers an API that does it for you.
My conversation with Aamir reinforced that AI quality is tied to context quality. As models get smarter, the tools we use to feed them information must get smarter too.
If you're working on complex retrieval problems, the techniques discussed here are the new baseline. If you're an engineer passionate about building high-performance, distributed systems, Mixedbread is hiring.
Explore their work and open positions at mixedbread.com.