Find Where the Evidence First Goes Missing
Most teams evaluate RAG and agentic systems from the wrong end: the final answer. They score the output, argue about the prompt, and miss the earlier handoff where the evidence disappeared.
Final-answer review tells you whether the system failed. It usually does not tell you where. By the time you read the answer, even a simple search-backed app or tool-using agent has already rewritten the request, searched or called tools, ranked candidates, selected context, and generated from whatever survived.
The useful question is this: at which step did the correct evidence disappear, change meaning, or lose priority?
A useful trace frames the system as a chain of handoffs, letting you inspect what went in and what came out at each step. The source may never have been retrieved, got buried during re-ranking, disappeared from the selected context, or reached the model and got ignored. Each failure has a different fix, but they all look identical in the final output.
The examples below use the same trace shape twice: first on a traditional RAG-style pipeline (query rewriting, hybrid search -> re-ranking), then on a Codex trace (tool calls, file reads, and text generation).
The handoffs worth tracing
A useful trace viewer helps you ignore most of the system at first. Start with just the inputs and outputs of each request. Other logs may matter later, but only after the trace shows which handoff failed.
In a traditional RAG pipeline, those handoffs are usually query rewrite, candidate retrieval (keyword and/or semantic search), merging and re-ranking, top k filtering, and answer generation. In an agent trace, the same role may be played by tool calls: grep, file reads, browser steps, database queries, or MCP outputs. The review question stays the same.
Original request: Capture the user question and any hidden constraints: permissions, product area, time range, language, account tier, task type, or explicit tool limits.
Search or tool input: Capture the query, tool arguments, file path, URL, database query, or prompt that drove the next step.
Raw output: Capture the candidates or tool output before later filtering. Keep more than the final top few.
Selection step: Capture the before and after when the system ranks, merges, filters, trims, summarizes, or chooses which branch to follow.
Final context and answer: Capture the exact evidence sent to the answer model and the final answer. Do not settle for document IDs.
Review judgment: Mark the first bad transition. That is the fastest way to turn a bad answer into an engineering task.
The minimum useful trace
Start with the handoffs you can see. For each handoff, record four things:
- Input: What question, query, candidate list, document set, or context block entered this step?
- Operation: What ran? Keyword search, grep, semantic search, hybrid search, re-ranking, an agent tool call, a candidate filter, a prompt, a policy check.
- Output: What came out? Include the final context sent to the model and whatever metadata helps you make sense of it.
- Reviewer note: Did the right evidence survive this handoff? If not, what disappeared, lost priority, or became unsupported?
Then mark the first handoff where the correct evidence was gone, changed meaning, or lost priority. That is enough to debug most failures.
Example 1: a simple search trace
Start with a small search-backed answer system: keyword search plus semantic search, merged before the final context is sent to the model.
The user asks:
Can I shorten audit log retention to 30 days?
The final answer is not hard. The trace matters because it shows whether the correct source survived each handoff.
| Step | Input | Operation | Output | What the reviewer learns |
|---|---|---|---|---|
| Request | Can I shorten audit log retention to 30 days? | User question | One policy question with a requested retention period | The answer needs the audit-log retention rule, not a generic data-retention policy. |
| Query rewrite | User question | LLM rewrite | audit log retention 30 days shorten | The rewrite preserved the key constraint. |
| Keyword retrieval | Rewritten query | Keyword search top 20 | audit-log-retention at rank 1, security-dashboard at rank 7 | Exact keyword search found the right source. |
| Semantic retrieval | Original question | Semantic search top 20 | data-retention-policy at rank 1, audit-log-retention at rank 4 | Semantic search found related policy docs, but ranked the precise source lower. |
| Merge | Keyword and semantic candidates | Combined ranking | audit-log-retention at rank 1, data-retention-policy at rank 2 | The merge kept the precise source above the broader related source. |
| Context selection | Top merged candidates | Context packer | Exact audit-log paragraph sent to the model | The answer sentence survived packing and trimming. |
| Answer | Final context | Answer prompt | No. Audit log retention cannot be shortened to 30 days. | The answer is supported by the context the model actually saw. |
This healthy trace is small, but it separates the bugs. If keyword search missed the source, fix query coverage. If semantic search buried the precise policy under broad retention docs, fix semantic search or source weighting. If the combined ranking put data-retention-policy first, fix merge logic. If the final context dropped the audit-log paragraph, fix context packing.
The failed version is more useful:
| Step | Input | Operation | Output | What the reviewer learns |
|---|---|---|---|---|
| Keyword retrieval | Rewritten query | Keyword search top 20 | audit-log-retention at rank 1 | The right source was found. Retrieval coverage was not the first bug. |
| Semantic retrieval | Original question | Semantic search top 20 | data-retention-policy at rank 1, audit-log-retention at rank 4 | Semantic search preferred the broader policy. That is not fatal yet. |
| Merge | Keyword and semantic candidates | Combined ranking | data-retention-policy at rank 1, audit-log-retention at rank 6 | The right source lost priority during merge. This is the first bad handoff. |
| Context selection | Top merged candidates | Context packer | Generic data-retention paragraph sent to the model | The final context preserved the earlier mistake. |
| Answer | Final context | Answer prompt | Yes, 30 days is allowed under the general retention policy. | The model answered from the wrong evidence. The answer prompt is not the first fix. |
The first bad handoff is the merge. Fix source weighting, merge logic, or context selection so the precise audit-log policy reaches the answer before you touch the answer prompt.
Example 2: an agent investigation trace
The same shape holds when the question is less tidy and the system has to investigate:
Why did Acme Corp's VIP conversation get assigned to General Support outside business hours, and how should we change routing so this does not happen again?
The source set has synthetic support records and a local clone of Chatwoot docs: incident records, customer profiles, policy docs, admin notes, and product documentation.
Acme was assigned to General Support because the default Support Email assignment ran before VIP enrichment. The conversation was assigned at 22:17:34Z. The vip label arrived at 22:18:02Z. No reassignment ran after that label update. Business hours were not the deciding factor. The fix is to apply renewal and VIP labels before team assignment, then add a fallback reassignment rule for late VIP labels.
The trace matters because each wrong answer points to a different fix. If support search missed the ticket and conversation, fix retrieval coverage. If docs hits dominated the selected context, fix ranking or source weighting. If context selection dropped the routing-change note, fix context assembly. If the final context contained the right evidence and the model still blamed business hours, fix generation instructions.
For an agent trace, the unit is often the tool call, not the search result. Tool-call arguments are inputs. Tool outputs are evidence. Assistant commentary can become reviewer notes, but it should not be treated as evidence.
The raw Codex trace behind this answer is not a normal search-service log. It has session metadata, user messages, assistant commentary, tool calls, tool outputs, and the final response. It still decomposes cleanly because every tool call has an input and output.
The user instructed the agent to search only local files, run at most five rg commands, and answer with sources and uncertainty. The trace lets a reviewer check both retrieval quality and agent discipline.
| Step | Input | Operation | Output | What the reviewer learns |
|---|---|---|---|---|
| Session constraints | Read-only workspace, no network, local search roots | Codex session metadata | The agent can only inspect local files | The trace explains why there are no web sources or live product checks. |
| User task | Chatwoot support-routing question plus evidence requirements | User prompt | Search roots, rg limit, source requirements, uncertainty requirement | The reviewer can judge the agent against explicit constraints. |
| File inventory | rg --files corpus/support-records chatwoot-docs | Shell tool call | Available support records and docs files | The agent first checks source-set shape instead of guessing paths. |
| Broad incident search | Acme, VIP, business hours, General Support, routing, after hours | Shell rg call | Incident, policy, business-hours, and routing-order records | The first substantive search finds the core evidence. |
| Docs mechanism search | inbox, assignment, automation, rule, label, business hours | Shell rg call over docs | Related but mixed Chatwoot docs hits | The trace exposes a weak branch before it reaches the answer. |
| Exact source reads | Ticket, conversation, VIP policy, business-hours policy, routing-change note | sed file reads | Full local evidence for the important records | The agent moves from search hits to source text before answering. |
| Follow-up docs search | auto assignment, automation rule, inbox, default routing, rule order | Shell rg call over docs | Weak confirmation and coverage limits | The agent checks the likely missing doc area instead of overstating. |
| Final answer | Selected evidence from the tool outputs | Assistant response | Cause, recommended routing change, sources, uncertainty | The reviewer can map every claim back to prior tool output. |
You are not limited to a clean service that logs search results in one field. A coding agent might use rg, sed, jq, browser tools, database queries, test runners, or custom MCP tools. The handoff questions stay the same:
What was the input? What operation ran? What came out? What should the reviewer notice?
The final answer is valid only if earlier tool output supports it.
This also catches agent-specific failures that a normal retrieval dashboard would miss.
Did the agent obey the search constraint? Did it inspect the right roots? Did it burn searches on broad queries before reading exact records? Did it use the noisy docs branch too heavily? Did it preserve uncertainty from missing docs coverage?
A final-answer review can miss all of that. A trace makes it visible.
What this catches
A step-level trace turns vague answer quality into specific failure modes.
The source was never found. The query was bad, the index missed it, filters were wrong, or the source set lacked coverage.
The source was found and then buried. Re-ranking, combined scoring, or deduplication made the right record invisible.
The source was selected but the useful sentence was trimmed. Chunking, summarization, or context packing removed the part that answered the question.
The answer ignored the source. The model had enough evidence but failed to use it, contradicted it, or failed to cite it.
The agent answered from the wrong branch. One tool call found the answer, another noisy branch dominated the final response.
The system failed to notice missing coverage. The trace should record important absence. "We did not find the automation rule guide" is a real finding.
Do not put those bugs in one bucket called "hallucination."
Practical trace rules
You do not need a tracing platform to get value. Start with the handoffs you already control.
Do not start with every available event. Start with the evidence path. Pull in latency, token counts, retries, cache behavior, planner state, or model settings only when they explain the first bad transition.
Use stable IDs for every source. A trace with only prose snippets is hard to audit.
Keep raw candidates before filtering. The dropped documents often explain the failure.
Record ranks and scores when they exist. They help separate retrieval coverage from ranking quality.
Save the exact context sent to the model. Document IDs are not enough.
Separate expected sources from retrieved sources. The expected source is the eval target. The retrieved source is what the system actually found.
Record filters and permissions. A correct search can look broken when tenant, access, date, or product filters silently remove the right record.
Mark the first bad transition. Do not make every trace a long essay. One clear reviewer note is usually enough.
Use the same shape for success and failure. Success traces teach you what a healthy path looks like. Failure traces show where that path diverged.
The review question
When a search-backed answer system or tool-using agent is wrong, ask this:
At which step did the correct evidence disappear, change meaning, or lose priority?
That question is more useful than "was the answer good?"
Answer quality matters. It is just the last observable result. The engineering work happens earlier, in the handoffs between steps.
Take five bad answers. Trace each handoff. Mark the first bad transition. After that, the next fix is usually obvious.