Back to Archive

Find Where the Evidence First Goes Missing

May 17, 2026

Most teams evaluate RAG and agentic systems from the wrong end: the final answer. They score the output, argue about the prompt, and miss the earlier handoff where the evidence disappeared.

Final-answer review tells you whether the system failed. It usually does not tell you where. By the time you read the answer, even a simple search-backed app or tool-using agent has already rewritten the request, searched or called tools, ranked candidates, selected context, and generated from whatever survived.

The useful question is this: at which step did the correct evidence disappear, change meaning, or lose priority?

A useful trace frames the system as a chain of handoffs, letting you inspect what went in and what came out at each step. The source may never have been retrieved, got buried during re-ranking, disappeared from the selected context, or reached the model and got ignored. Each failure has a different fix, but they all look identical in the final output.

The examples below use the same trace shape twice: first on a traditional RAG-style pipeline (query rewriting, hybrid search -> re-ranking), then on a Codex trace (tool calls, file reads, and text generation).

The handoffs worth tracing

A useful trace viewer helps you ignore most of the system at first. Start with just the inputs and outputs of each request. Other logs may matter later, but only after the trace shows which handoff failed.

In a traditional RAG pipeline, those handoffs are usually query rewrite, candidate retrieval (keyword and/or semantic search), merging and re-ranking, top k filtering, and answer generation. In an agent trace, the same role may be played by tool calls: grep, file reads, browser steps, database queries, or MCP outputs. The review question stays the same.

Original request: Capture the user question and any hidden constraints: permissions, product area, time range, language, account tier, task type, or explicit tool limits.

Search or tool input: Capture the query, tool arguments, file path, URL, database query, or prompt that drove the next step.

Raw output: Capture the candidates or tool output before later filtering. Keep more than the final top few.

Selection step: Capture the before and after when the system ranks, merges, filters, trims, summarizes, or chooses which branch to follow.

Final context and answer: Capture the exact evidence sent to the answer model and the final answer. Do not settle for document IDs.

Review judgment: Mark the first bad transition. That is the fastest way to turn a bad answer into an engineering task.

The minimum useful trace

Start with the handoffs you can see. For each handoff, record four things:

  1. Input: What question, query, candidate list, document set, or context block entered this step?
  2. Operation: What ran? Keyword search, grep, semantic search, hybrid search, re-ranking, an agent tool call, a candidate filter, a prompt, a policy check.
  3. Output: What came out? Include the final context sent to the model and whatever metadata helps you make sense of it.
  4. Reviewer note: Did the right evidence survive this handoff? If not, what disappeared, lost priority, or became unsupported?

Then mark the first handoff where the correct evidence was gone, changed meaning, or lost priority. That is enough to debug most failures.

Example 1: a simple search trace

Start with a small search-backed answer system: keyword search plus semantic search, merged before the final context is sent to the model.

The user asks:

Can I shorten audit log retention to 30 days?

The final answer is not hard. The trace matters because it shows whether the correct source survived each handoff.

StepInputOperationOutputWhat the reviewer learns
RequestCan I shorten audit log retention to 30 days?User questionOne policy question with a requested retention periodThe answer needs the audit-log retention rule, not a generic data-retention policy.
Query rewriteUser questionLLM rewriteaudit log retention 30 days shortenThe rewrite preserved the key constraint.
Keyword retrievalRewritten queryKeyword search top 20audit-log-retention at rank 1, security-dashboard at rank 7Exact keyword search found the right source.
Semantic retrievalOriginal questionSemantic search top 20data-retention-policy at rank 1, audit-log-retention at rank 4Semantic search found related policy docs, but ranked the precise source lower.
MergeKeyword and semantic candidatesCombined rankingaudit-log-retention at rank 1, data-retention-policy at rank 2The merge kept the precise source above the broader related source.
Context selectionTop merged candidatesContext packerExact audit-log paragraph sent to the modelThe answer sentence survived packing and trimming.
AnswerFinal contextAnswer promptNo. Audit log retention cannot be shortened to 30 days.The answer is supported by the context the model actually saw.

This healthy trace is small, but it separates the bugs. If keyword search missed the source, fix query coverage. If semantic search buried the precise policy under broad retention docs, fix semantic search or source weighting. If the combined ranking put data-retention-policy first, fix merge logic. If the final context dropped the audit-log paragraph, fix context packing.

The failed version is more useful:

StepInputOperationOutputWhat the reviewer learns
Keyword retrievalRewritten queryKeyword search top 20audit-log-retention at rank 1The right source was found. Retrieval coverage was not the first bug.
Semantic retrievalOriginal questionSemantic search top 20data-retention-policy at rank 1, audit-log-retention at rank 4Semantic search preferred the broader policy. That is not fatal yet.
MergeKeyword and semantic candidatesCombined rankingdata-retention-policy at rank 1, audit-log-retention at rank 6The right source lost priority during merge. This is the first bad handoff.
Context selectionTop merged candidatesContext packerGeneric data-retention paragraph sent to the modelThe final context preserved the earlier mistake.
AnswerFinal contextAnswer promptYes, 30 days is allowed under the general retention policy.The model answered from the wrong evidence. The answer prompt is not the first fix.

The first bad handoff is the merge. Fix source weighting, merge logic, or context selection so the precise audit-log policy reaches the answer before you touch the answer prompt.

Example 2: an agent investigation trace

The same shape holds when the question is less tidy and the system has to investigate:

Why did Acme Corp's VIP conversation get assigned to General Support outside business hours, and how should we change routing so this does not happen again?

The source set has synthetic support records and a local clone of Chatwoot docs: incident records, customer profiles, policy docs, admin notes, and product documentation.

Acme was assigned to General Support because the default Support Email assignment ran before VIP enrichment. The conversation was assigned at 22:17:34Z. The vip label arrived at 22:18:02Z. No reassignment ran after that label update. Business hours were not the deciding factor. The fix is to apply renewal and VIP labels before team assignment, then add a fallback reassignment rule for late VIP labels.

The trace matters because each wrong answer points to a different fix. If support search missed the ticket and conversation, fix retrieval coverage. If docs hits dominated the selected context, fix ranking or source weighting. If context selection dropped the routing-change note, fix context assembly. If the final context contained the right evidence and the model still blamed business hours, fix generation instructions.

For an agent trace, the unit is often the tool call, not the search result. Tool-call arguments are inputs. Tool outputs are evidence. Assistant commentary can become reviewer notes, but it should not be treated as evidence.

The raw Codex trace behind this answer is not a normal search-service log. It has session metadata, user messages, assistant commentary, tool calls, tool outputs, and the final response. It still decomposes cleanly because every tool call has an input and output.

The user instructed the agent to search only local files, run at most five rg commands, and answer with sources and uncertainty. The trace lets a reviewer check both retrieval quality and agent discipline.

StepInputOperationOutputWhat the reviewer learns
Session constraintsRead-only workspace, no network, local search rootsCodex session metadataThe agent can only inspect local filesThe trace explains why there are no web sources or live product checks.
User taskChatwoot support-routing question plus evidence requirementsUser promptSearch roots, rg limit, source requirements, uncertainty requirementThe reviewer can judge the agent against explicit constraints.
File inventoryrg --files corpus/support-records chatwoot-docsShell tool callAvailable support records and docs filesThe agent first checks source-set shape instead of guessing paths.
Broad incident searchAcme, VIP, business hours, General Support, routing, after hoursShell rg callIncident, policy, business-hours, and routing-order recordsThe first substantive search finds the core evidence.
Docs mechanism searchinbox, assignment, automation, rule, label, business hoursShell rg call over docsRelated but mixed Chatwoot docs hitsThe trace exposes a weak branch before it reaches the answer.
Exact source readsTicket, conversation, VIP policy, business-hours policy, routing-change notesed file readsFull local evidence for the important recordsThe agent moves from search hits to source text before answering.
Follow-up docs searchauto assignment, automation rule, inbox, default routing, rule orderShell rg call over docsWeak confirmation and coverage limitsThe agent checks the likely missing doc area instead of overstating.
Final answerSelected evidence from the tool outputsAssistant responseCause, recommended routing change, sources, uncertaintyThe reviewer can map every claim back to prior tool output.

You are not limited to a clean service that logs search results in one field. A coding agent might use rg, sed, jq, browser tools, database queries, test runners, or custom MCP tools. The handoff questions stay the same:

What was the input? What operation ran? What came out? What should the reviewer notice?

The final answer is valid only if earlier tool output supports it.

This also catches agent-specific failures that a normal retrieval dashboard would miss.

Did the agent obey the search constraint? Did it inspect the right roots? Did it burn searches on broad queries before reading exact records? Did it use the noisy docs branch too heavily? Did it preserve uncertainty from missing docs coverage?

A final-answer review can miss all of that. A trace makes it visible.

What this catches

A step-level trace turns vague answer quality into specific failure modes.

The source was never found. The query was bad, the index missed it, filters were wrong, or the source set lacked coverage.

The source was found and then buried. Re-ranking, combined scoring, or deduplication made the right record invisible.

The source was selected but the useful sentence was trimmed. Chunking, summarization, or context packing removed the part that answered the question.

The answer ignored the source. The model had enough evidence but failed to use it, contradicted it, or failed to cite it.

The agent answered from the wrong branch. One tool call found the answer, another noisy branch dominated the final response.

The system failed to notice missing coverage. The trace should record important absence. "We did not find the automation rule guide" is a real finding.

Do not put those bugs in one bucket called "hallucination."

Practical trace rules

You do not need a tracing platform to get value. Start with the handoffs you already control.

Do not start with every available event. Start with the evidence path. Pull in latency, token counts, retries, cache behavior, planner state, or model settings only when they explain the first bad transition.

Use stable IDs for every source. A trace with only prose snippets is hard to audit.

Keep raw candidates before filtering. The dropped documents often explain the failure.

Record ranks and scores when they exist. They help separate retrieval coverage from ranking quality.

Save the exact context sent to the model. Document IDs are not enough.

Separate expected sources from retrieved sources. The expected source is the eval target. The retrieved source is what the system actually found.

Record filters and permissions. A correct search can look broken when tenant, access, date, or product filters silently remove the right record.

Mark the first bad transition. Do not make every trace a long essay. One clear reviewer note is usually enough.

Use the same shape for success and failure. Success traces teach you what a healthy path looks like. Failure traces show where that path diverged.

The review question

When a search-backed answer system or tool-using agent is wrong, ask this:

At which step did the correct evidence disappear, change meaning, or lose priority?

That question is more useful than "was the answer good?"

Answer quality matters. It is just the last observable result. The engineering work happens earlier, in the handoffs between steps.

Take five bad answers. Trace each handoff. Mark the first bad transition. After that, the next fix is usually obvious.

Weekly article

Get the weekly deep-dive on context, control, and workflows for useful agents.

5,000+ readers