May 2, 202614 min readAdvanced Draft

Inside the RAG

A walkthrough of every step the terminal AI takes between you typing ai what's the polyglot sandbox? and a streaming answer landing in the terminal. Built for learning, with diagrams.

The model knows a lot of things, but it doesn't know the contents of this portfolio. RAG fixes that without retraining. This is what every piece of the pipeline is doing, why hybrid retrieval beats pure semantic, and where it still has rough edges.

Why RAG at all

Asking Gemini "what's in the polyglot sandbox post?" with no context produces a confident hallucination. The blog post might not exist as far as the model knows; if it does, the training cutoff probably has the wrong version. Two ways to fix that:

Fine-tune. Teach the model the portfolio. Slow, expensive, rebuilt every time content changes, and the model still doesn't really know when it's lying.
Retrieval-Augmented Generation (RAG). Keep the model as-is; on each question, look up relevant passages from the portfolio and paste them into the prompt. The model reads them and answers grounded in them.

RAG wins almost every time for content that changes faster than you'd retrain. Three things have to be good for it to feel magical: the right passages have to get retrieved, the prompt has to present them clearly, and the model has to be told to ground in them and admit when they don't cover the question.

The big picture

Two timelines run in this system. Offline, when content is written, chunks get embedded and stored. Online, on every question, two retrievers race, results are fused, the top few passages are pasted into the prompt, and the model streams an answer.

Build time — making the index

scripts/build-rag-index.mjs walks blog posts and project descriptions, splits them into roughly 600-character chunks with overlap, calls Gemini's gemini-embedding-001 for each, and writes everything to server/data/rag-index.json. Three design choices worth flagging:

Chunk size matters more than people expect. Too small (50 tokens) and the chunk has no context — "this" refers to nothing. Too big (2000 tokens) and the embedding becomes an average of unrelated topics and matches nothing well.
Overlap. Each chunk shares a few sentences with its neighbour, so a topic that straddles a chunk boundary doesn't get cut in half.
Stable IDs. Chunks are named {slug}-{n} so re-builds replace existing entries deterministically. No churn on unrelated pages when one post changes.

Vectors are stored in plain JSON. At ~115 × 768 floats that's about 700KB. At this scale a JSON file beats every "real" vector database for setup cost. Once the corpus crosses a million chunks you need pgvector or qdrant, but that's a future problem.

Embeddings — turning text into vectors

An embedding model is trained on enormous amounts of text with the objective that two passages with similar meaning get nearby vectors, regardless of word choice. "How is the polyglot sandbox structured?" and "Walk me through the code-challenges architecture" share almost no surface tokens but live next to each other in embedding space.

The output is an array of 768 floats. You can think of each dimension as a learned axis: maybe one fires on "Python", another on "browser sandbox", another on "regret expressed in past tense." Nobody can read the axes directly; they emerge from training. What matters is that distances between vectors mean something.

Cosine similarity, geometrically

Once everything is a vector, "which chunk is most relevant?" reduces to "which vector points the same way as the query?". Cosine similarity measures the angle between two vectors: 1 means same direction, 0 means perpendicular, -1 means opposite.

Cosine similarity cos(θ) = (a · b) / (‖a‖ · ‖b‖)

The implementation is six lines:

typescript

function cosine(a: number[], b: number[]): number {
  if (a.length !== b.length) return 0
  let dot = 0, na = 0, nb = 0
  for (let i = 0; i < a.length; i++) {
    dot += a[i] * b[i]
    na  += a[i] * a[i]
    nb  += b[i] * b[i]
  }
  const denom = Math.sqrt(na) * Math.sqrt(nb)
  return denom > 0 ? dot / denom : 0
}

No SIMD, no GPU, no library. At 115 chunks of 768 floats this runs in well under a millisecond. Vector databases exist because at scale this loop turns into a billion-element product per query and you need ANN indexes like HNSW. We are not at that scale.

The single-vector retrieval (V1)

The first version was: embed the query, score every chunk by cosine, sort, return the top 4 above a threshold of 0.55.

typescript

// V1 — single-vector retrieval
const queryVec = await embed(query)
const scored = chunks.map(c => ({ chunk: c, score: cosine(queryVec, c.embedding) }))
scored.sort((a, b) => b.score - a.score)
return scored.slice(0, 4).filter(s => s.score >= 0.55)

This works for most "what's in this post about X" questions. Then it stops working.

Where pure semantic search fails

Consider:

"what does safeFetch do?"
"show me the redis-rate-limiting post"
"what's the AbortController timeout in the puzzle solver?"

The query is dominated by identifiers — a function name, a slug, a class name. Embeddings are great at meaning; they're surprisingly weak at exact tokens. The vector for safeFetch doesn't point reliably toward chunks containing the literal string "safeFetch" — it points toward chunks that discuss safe fetching, which may not even mention the function. Result: relevant chunks score 0.4 (below threshold) and irrelevant chunks about "safety" score 0.6 (passed through).

Lexical search has the opposite weakness: it can't see that "polyglot" and "multi-language" are the same idea. But for exact identifiers, it's perfect. Every serious RAG system at scale runs both. Pure-vector demos look good; pure-vector in production is fragile.

BM25 in two minutes

BM25 is the lexical retriever. It's the algorithm Elasticsearch defaults to, the one Lucene ships, the one your search bar at work probably uses. From 1994. It scores how relevant a document is to a query based on three things: how often a query term appears in the document (tf), how rare that term is across the corpus (idf), and the document's length.

BM25 score for one (term t, doc d) pair score(t, d) = idf(t) · (f · (k₁ + 1)) / (f + k₁ · (1 - b + b · |d| / avgdl))

where f = frequency of t in d; |d| = length of d; avgdl = average doc length. Tune-knobs k₁=1.5 and b=0.75 are folk-standard.

Total document score = sum over all query terms.

Worked example

Query: pyodide watchdog timeout. Two candidate chunks:

chunk	tokens	tf("pyodide")	tf("watchdog")	tf("timeout")	BM25
polyglot-sandbox-3	312	2	3	2	7.84
terminal-overview-1	198	0	1	2	2.31

The first chunk wins decisively because it contains the rare term pyodide, which the second lacks. The chunk length only mildly modulates the score — that's the 1 - b + b · |d|/avgdl term doing its job.

typescript

function bm25Search(query: string, idx: BM25Index, topN: number) {
  const k1 = 1.5, b = 0.75
  const qTokens = [...new Set(tokenize(query))]
  const out = []
  for (let i = 0; i < idx.docLengths.length; i++) {
    let s = 0
    const tf = idx.termFreqs[i], dl = idx.docLengths[i]
    for (const t of qTokens) {
      const f = tf.get(t) ?? 0
      if (f === 0) continue
      const w = idx.idf.get(t) ?? 0
      s += w * (f * (k1 + 1)) / (f + k1 * (1 - b + b * dl / idx.avgdl))
    }
    if (s > 0) out.push({ docIdx: i, score: s })
  }
  return out.sort((a, b) => b.score - a.score).slice(0, topN)
}

One small twist: the BM25 index here weights the chunk's title by repeating it three times before the body during tokenization. Cheap way to bias matches toward chunks whose title aligns with the query.

Reciprocal Rank Fusion

Now there are two ranked lists — one from cosine, one from BM25. They use different scoring scales. Cosine is 0–1; BM25 is unbounded. You can't add them directly, and normalizing them is hard because the distributions differ per query.

Reciprocal Rank Fusion ignores scores entirely and uses only rank positions:

RRF score for document d RRF(d) = Σ over rankings 1 / (k + rank_r(d))

where rank starts at 1, and k is a constant (usually 60) that softens the contribution of the top items so a single ranker can't dominate.

A document at rank 1 in one list and rank 5 in another scores 1/(60+1) + 1/(60+5) ≈ 0.0317. A document at rank 1 in only one list and missing from the other scores 1/(60+1) ≈ 0.0164. Appearing in both rankings always beats appearing in just one.

typescript

function rrfFuse(rankings: ScoredDoc[][], k = 60) {
  const fused = new Map<number, number>()
  for (const ranking of rankings) {
    ranking.forEach((r, rank) => {
      fused.set(r.docIdx, (fused.get(r.docIdx) ?? 0) + 1 / (k + rank + 1))
    })
  }
  return fused
}

The full hybrid pipeline

What happens, in order, every time you type ai <question>:

typescript

const [vectorRanking, bm25Ranking] = await Promise.all([
  embedAndRank(query, apiKey, chunks),
  Promise.resolve(bm25Search(query, bm25Cache, 20)),
])

const fused = rrfFuse([vectorRanking.slice(0, 20), bm25Ranking])
const orderedIdx = [...fused.entries()]
  .sort((a, b) => b[1] - a[1])
  .map(([idx]) => idx)

// Drop weak vector-only matches; BM25 matches always pass through.
const bm25Set = new Set(bm25Ranking.map(r => r.docIdx))
const results = []
for (const idx of orderedIdx) {
  const cos = cosineByIdx.get(idx) ?? 0
  if (!bm25Set.has(idx) && cos < 0.45) continue
  results.push({ chunk: chunks[idx], score: cos, match: ... })
  if (results.length >= k) break
}

The two retrievers run in Promise.all. If embedding fails (rate-limited, network), BM25 still produces results — graceful degradation. The match field on each result tags whether the chunk came from the vector retriever, BM25, or both, useful for debugging "why did this chunk show up?".

How the context plugs into the prompt

Once the top 4 chunks are picked, they're formatted as a context block and injected after the persona system prompt. The model sees:

text

You are Eric's portfolio assistant. ...
[full persona prompt]

## Retrieved context

The following passages were retrieved from Eric's portfolio for this query.
Use them to ground your answer; cite by writing the source path inline
(e.g. /blog/polyglot-sandbox). If they don't cover what was asked, say so plainly.

[1] Source: /blog/polyglot-sandbox (Three Languages, One Sandbox)
Pyodide is around 10MB compressed and takes 3-6 seconds to initialize on a
cold cache. Per-run iframes would mean every Run Tests click forces a
multi-second pause...

---

[2] Source: /blog/polyglot-sandbox (Three Languages, One Sandbox)
The watchdog is an AbortController on the fetch, with a generous 25-second
timeout because compilation is part of the loop, not just execution...

Three details that matter more than they look:

Numbered passages ([1], [2], …) so the model can refer back to them mentally without quoting the whole thing.
Source path inline with the chunk so citations come out as /blog/polyglot-sandbox, which the terminal renders as a clickable link.
"If they don't cover what was asked, say so plainly." This is the most load-bearing sentence in the entire prompt. Without it the model will make stuff up to fill the void rather than admit retrieval missed.

Tools and the multi-round loop

RAG handles "what's in this content?" Tools handle "what's going on right now?" The model has three: search_blog, list_projects, and fetch_url (over an 8-host allowlist). Each conversation turn runs a multi-round loop, capped at 3 rounds: stream the model's tokens and collect any functionCalls; if any fired, execute them server-side, send tool-result events back to the client, feed the responses to the model, and re-stream.

The 3-round cap is a safety valve. Tool-calling models occasionally get stuck calling the same tool over and over; the cap means a stuck loop costs at most three round-trips before the user gets something.

What's not yet built

Three obvious next moves, in order of effort:

A retrieve tool the model can call directly. Retrieval runs once per turn today. Letting the model issue follow-up retrievals — like a person typing different search queries — is a real upgrade.
Query rewriting / multi-query expansion. Before retrieving, ask the model for 2–3 paraphrases, embed each, union and dedupe. Helps with vague or terse queries.
A reranker. Retrieve a wider top-N, then run a cheap LLM call that scores each chunk against the original question and sorts. Worth it once the corpus grows past a few hundred chunks; probably overkill at 115.

The takeaway

Most of what makes RAG good at scale isn't the embedding model. It's the surrounding scaffolding: chunking, hybrid retrieval, citation formatting, query rewriting, and the model gracefully saying "I don't know." Models keep getting better; the scaffolding is what separates a demo from a system.