Citations in the Key of RAG

A Problem

What are some ways I can implement source citations in my RAG chatbot?

A slackbot answering a user query and displaying its sources

Or: here’s-a-bunch-of-things-I-tried-and-kinda-worked

The Context

  • You should already have some baseline familiarity of building a “RAG” system. Acknowledging that “RAG” is, like just about everything else, a very overloaded term right now, I’m defining “RAG” as: semantic cosine similar search over a vector database (pgvector).
  • You’re building a “chat with your docs” chatbot, at a relatively small scale. This particular example will use:
    • Slack Bolt framework
    • Typescript / AI SDK
    • OpenAI provider

A chatbot is pretty simple to implement. It’s just an API call to OpenAI. For “RAG”, give it access to a tool that exposes the data. Run that in a “loop”. Let it call the tool to search as it needs. Then return the response to the user. The LLM reads the content returned from the knowledge base tool and determines the answer.

For example:

   "What's our return policy?"
                |
                v
          +------------+
          |    LLM     |
          +------------+
                |
                v
      searches knowledge base
                |
                v
   +--------------------------------------+
   | Similar documents (3)                |
   | • slop.md#returns (Chunk 12)     |
   |   "...Returns accepted within 30..." |
   | • faq.md#refunds    (Chunk 3)        |
   |   "...Refund period is 30 days..."   |
   | • slack-thread#915 (Chunk 1)         |
   |   "...Escalations contact ops..."    |
   +--------------------------------------+
           |                    |
           | relevant chunks    | ignored chunk
           v                    v
   +------------------+      +----------------+
   | Chunk 12 content |      | Chunk 1 ignored|
   +------------------+      +----------------+
           |
           v
      LLM ingests
           |
           v
   "Our return policy is 30 days."
         (cites handbook.md#returns, faq.md#refunds)

The challenge: but how do you accurately identify which documents were actually used vs. which were just retrieved?

while also working within the forces and constraints:

  • Budget - the agent has to be as low cost as possible, meaning the dumbest, cheapest, model
  • Latency - while also being as fast as possible
  • and accurate
  • and a UX that doesn’t suck

The Naive Approach

The obvious first attempt: show all documents returned by the vector search.

The issue is that the retrieval step often returns 3-5 documents based on semantic similarity. But the LLM might:

  • Use only 1-2 of those documents to answer the question
  • Find that none of the documents contain the answer and respond “I don’t have that information”
  • Use parts of some documents but ignore others

If you show all retrieved documents as “sources,” you’re not being honest. The user experience looks like “I’m sorry, I can’t find any information about the return policy. Sources: Unrelated Document 1, Unrelated Document 2, Unrelated Document 3”

What we need: A way to identify which specific documents the LLM actually used when generating its response, not just which documents we retrieved.

The Core Challenge

The LLM needs to see both the question AND the tool results (the retrieved documents) to accurately determine which sources informed its answer. You can’t just look at the final answer text and extract citations—there’s no way to know which documents were used without seeing them in context.

This means whatever solution we build needs to happen during the agent loop, not after.

In order to display this data in a pretty manner in the slack API, we also need a simple JSON structure:

	{
		response: "the actual response the user sees",
		sources: [ "Source 1", "Source 2" ]
	}

To recap the problem for the 30th time: how can we extract structured data from an LLM response that used tool calling.

Attempt 1: XML Tags

The approach: Prompt engineering. Have the LLM output <source> XML tags inline with the response, then parse them. Beg it to output the sources.

Example: “When calling the <search_tool>, YOU MUST ALWAYS!!!!!! cite the source using a <source> tag. For example: <source>Slop Factory</source>

Then after the call, manually parse the string, identify the xml, strip it out, return a new JSON object with the XML stripped and a new sources array.

Result with GPT-4o-mini: Unreliable. This model won’t reliably follow the format instructions. It worked for some simple queries at first, but had too many failures over time not following the prompt.

Result with GPT-5-mini: Performed significantly better at only a $.10 increase in cost. The model followed instructions and output the XML tags correctly and much more consistently.

The catch: GPT-5-mini is painfully slow because it’s a reasoning model. Benchmarks showed 31 seconds (with the default “medium” reasoning) vs 6.6 seconds for the same query—4.68x slower than GPT-4o-mini. I later discovered that setting the reasoning_effort to “low” significantly improved performance relative to “medium” but not enough.

Another side reason I didn’t want to use GPT-5-mini: the free tier of Github Models has a much lower context limit and I really like being able to prototype and iterate without having to pay per token. All of this costs fractions of a penny, but psychologically, it’s exhausting knowing every fuck up has a price tag.

Verdict: So while gpt-5-mini worked well, our UX and latency constraints require us to keep searching. Users won’t wait 31 seconds for a Slack response.

Attempt 2: Structured Output

The approach: Use experimental_output with GPT-4o-mini and a Zod schema to extract citations. The AI SDK has an abstraction over Open AI’s structured output feature that is very experimental. But it forces json output. It’s literally part of the API.

Initial result: It worked… okay.

The problems:

  1. Simple messages like “hi” or “hello” would randomly fail Zod validation
  2. Hit the infinite newline bug where GPT-4o outputs \n characters endlessly instead of JSON, burning tokens behind the scenes.
  3. The AI SDK’s experimental_output uses response_format: { type: "json_object" } which is less reliable than strict schema mode

Verdict: Fast but too brittle for production. Random failures are unacceptable. Different models also have different results.

Attempt 3: Two-Phase Approach

Another constraint I haven’t mentioned yet is that methods like AI SDK’s generateObject don’t support tool calling. So that’s off the table for us, but in theory, we could’ve just used generateObject() to return a JSON schema.

We can’t use generateObject directly or experimental_output , but we can chain it and have the LLM intelligently format the last response into JSON.

Alternatively, we could also try and brute force some regex but part of the challenge with using a low-power model like 4o-mini is that the output is unpredictable. Sometimes it’ll inline a source, sometimes it’ll display it as “Source: ”. Adding more instructions also adds more cost in input tokens and decreases performance in other areas.

The approach:

  1. Let the agent produce a response with sources inline
  2. Take the complete response and run it through a second LLM call to reformat as JSON.

Advantages: GPT-4o-mini seemed capable of extracting sources reliably when given the full response text.

The dealbreaker: This fundamentally cannot work with streaming. You need the complete response before you can reformat it.

And unfortunately, it turns out that streaming isn’t optional for our use case. It’s the only path to acceptable latency. GPT-4o-mini is already the fastest OpenAI model in my testing. Every other model tested was 3-4 seconds slower. The AI SDK adds negligible overhead. Streaming is the only way to improve perceived latency.

Verdict: Can’t sacrifice streaming for this approach. But if we didn’t need streaming, this would be a totally viable solution as well.

The Solution: cite_sources Tool

Going back to the drawing board, the main issue I had with 4o-mini was that it was inconsistent in outputting proper citations when it used sources. Prompt engineering alone wasn’t working.

Instead of relying on non-deterministic magicks, we can just add some old fashion guardrails using tools.

The approach: Create a cite_sources tool that the agent calls to explicitly declare which sources it used.

export const citeSourcesTool = tool({
	name: "cite_sources",
	description:
		"Cite source documents that you referenced in your response. Call this when you've used specific information from the knowledge base.",
	inputSchema: z.object({
		sources: z
			.array(z.string())
			.describe(
				"List of document names you cited. For example: "Slop Corner", "Slop Review", "Slop Policy",
			),
	}),
	execute: async ({ sources }) => {
		// This tool is primarily for metadata collection
		// The actual source list is captured by monitoring tool calls
		return `Successfully cited ${sources.length} source(s)`;
	},
});

The tool itself actually does nothing. Similar to the implicit prompting technique that the sequential thinking MCP uses, it just boxes the LLM into thinking. The tool call also makes it nice and easy for use to parse out later without having to do a bunch of cursed regex.

The breakthrough: AND we can make it deterministically enforceable.

//psuedocode
if (stopReason !== 'tool_call' && searchAcmeWikiWasCalled && citeSourcesWasNotCalled) {
  // Force the model to call cite_sources
  createNewModelMessage({ forceToolCall: 'cite_sources' });
  makeOneLastLLMCall();
}
          +--------------------+
          |   LLM (agent)      |
          +--------------------+
                    |
                    | calls tool
                    v
      "search tool: {{query}}"
                    |
                    v
          +--------------------+
          |  Search tool       |
          |  returns snippets  |
          +--------------------+
                    |
                    v
          +--------------------+
          |   LLM drafts       |
          |   final message    |
          +--------------------+
                    |
                    v
      +------------------------------+
      | Interceptor checks final turn|
      | search called? cite missing? |
      +------------------------------+
        |                         |
        | no                      | yes
        v                         v
  publish final reply     inject system hint
                               |
                               v
                     "You must call cite_sources"
                               |
                               v
                   LLM issues cite_sources call
                               |
                               v
                     publish final reply (with cites)

If the agent used the search tool and finished without calling cite_sources, we detect that and force the tool call.

Most of the time, just having the tool exposed was enough to get it reliably call it, but this hook is there just in case it doesn’t.

The schema for the citation tool also allows for the input to be empty. So for scenarios that are legitimate and no information was found, the model just enters an empty array. The injected system message only tells the model to call the tool.

I acknowledge that this isn’t a completely foolproof solution—the model could still get confused over whether or not it’s supposed to include the sources it read and didn’t use.

The Journey

What started as “just add citations” became a week-long exploration of model capabilities, speed constraints, and UX requirements:

  1. GPT-4o-mini won’t follow format instructions → tried GPT-5-mini
  2. GPT-5-mini is too slow (31s) → tried structured output
  3. Structured output is brittle → tried two-phase approach
  4. Two-phase breaks streaming → streaming is non-negotiable → needed new approach
  5. cite_sources tool → deterministic, fast, streaming-compatible

My takeaway: tools are more reliable than output formats. Having the model call a tool to declare citations is natural and deterministic. We can detect when it’s needed and force it rather than hoping the model follows instructions.