A production architecture guide — chunking, grounding, and evaluation that actually works on spoken content.

RAG for Video Transcripts: Architecture, Chunking & Timestamp Grounding

Published By Hatem Mezlini
RAG architecture for video transcripts

Why RAG over video is a distinct problem

At surface level RAG over video looks like RAG over text with an extra ingestion step. In practice, three things about spoken content break the defaults that work for document RAG:

  1. Time-locatable answers. The user wants the exact moment where a claim was made, not just a passage. Every chunk in your index must carry a timestamp or the product UX is strictly worse than transcript-search.
  2. Noisy language. Spoken text has filler, false starts, and disfluency that embeddings are not optimized for. Hybrid retrieval and light normalization materially improve recall.
  3. Platform heterogeneity. YouTube, TikTok, Instagram, Facebook, X each expose transcripts through different channels with different quality. Your ingestion layer has to normalize or your retrieval quality will be a lottery.

Reference architecture

Video URL
   │
   ▼
┌────────────────────────────────────────────────┐
│  Transcript API (URL in → timestamped JSON)    │
│   → segments: [{start, end, text, speaker?}]   │
└────────────────────────────────────────────────┘
   │
   ▼
┌────────────────────────────────────────────────┐
│  Normalize (filler strip, lowercase, whitespace)│
└────────────────────────────────────────────────┘
   │
   ▼
┌────────────────────────────────────────────────┐
│  Chunker (300–600 tokens, ~50 overlap,         │
│   break on semantic boundaries, keep metadata) │
└────────────────────────────────────────────────┘
   │
   ▼
┌────────────┐    ┌──────────────────────────────┐
│ Embedder   │    │ BM25 / keyword index         │
│ (dense)    │    │ (rare technical terms)       │
└────────────┘    └──────────────────────────────┘
   │                 │
   └──── Hybrid retrieval (top 20) ─────┐
                                          ▼
                              ┌───────────────────┐
                              │ Cross-encoder      │
                              │ rerank → top 3–5   │
                              └───────────────────┘
                                          │
                                          ▼
                              ┌───────────────────┐
                              │ LLM answer w/      │
                              │ [video:sec] cites  │
                              └───────────────────┘

Every stage is a place where video-specific details matter — the ingestion normalizer, the chunker's metadata carry-through, the reranker choice, the citation format. The rest of this post walks each of them in order.

Ingestion — get timestamps, preserve them

You have two paths: build it yourself (official platform APIs for captioned content + a speech-to-text provider of your choice for the rest, plus all the glue between them) or use a managed endpoint that takes a video URL and returns timestamped segments. Either way, your ingestion output must be a list of segments with {start, end, text} — do not flatten to a blob of text at this stage.

For a complete walkthrough of the ingestion side — concurrency, retries, cost math — see the bulk YouTube transcript guide.

Normalization — cheap, reversible, embed-only

Spoken language carries filler that hurts embedding quality. A minimal normalizer:

  • Lowercase (optional — modern embeddings are mostly case-insensitive).
  • Strip bracketed caption annotations: [music], [laughter], [applause].
  • Collapse repeated whitespace.
  • Remove leading filler tokens: um, uh, you know (with a small allow-list to keep semantic content).

Keep an un-normalized copy for display and re-use at answer time. Embed the normalized version.

Chunking — semantic boundaries + metadata carry-through

Target 300–600 tokens per chunk with ~50 token overlap. Prefer natural pauses (long inter-segment gaps), punctuation boundaries (end-of-sentence markers), or explicit topic shifts (speaker change, scene change if your ingestion reports it) over fixed token windows.

@dataclass
class TranscriptChunk:
    chunk_id: str
    video_id: str
    start_sec: float      # earliest segment start in this chunk
    end_sec: float        # latest segment end in this chunk
    text: str             # normalized text for embedding
    display_text: str     # un-normalized text for the UI
    speaker: str | None   # if diarization is available
    chunk_index: int      # ordinal position inside the video

Every chunk keeps video_id + start_sec — that pair is the grounding primitive. With it, your answer UI can render "jump to 12:34 in the video" and your evaluation harness can verify retrieval against a golden set of expected timestamps.

Embedding — boring is good

In 2026 the pragmatic defaults are:

  • text-embedding-3-large (OpenAI) — 3,072 dimensions, strong out-of-the-box quality, priced per token.
  • voyage-3 / voyage-3-large (Voyage AI) — specialized retrieval models, competitive on MTEB.
  • BGE-M3 (open source) — multi-functional (dense + sparse + multi-vector), runs on-prem.

Pin the model + dimension in config. Versioning matters: when you upgrade the embedder, re-index the whole corpus — do not mix vectors from different model versions in the same index.

Retrieval — hybrid, reranked

Spoken content rewards hybrid retrieval because:

  • Dense vectors catch paraphrase ("how do I spin up a server" ≈ "launch an instance").
  • BM25 catches rare technical terms and proper nouns that the embedder smooths over.

Retrieve top 15–20 candidates from the hybrid fusion, then rerank with a cross-encoder (bge-reranker-v2-m3, Cohere Rerank-3) down to top 3–5 to pass into the generation prompt. Reranking is the single highest-ROI lever in most RAG systems — it trades a small latency hit for a materially higher ceiling on answer quality.

Generation — grounded citations are non-optional

The prompt contract:

SYSTEM
You are a video Q&A assistant. Answer ONLY from the provided chunks.
Each chunk is labelled [video_id:start_sec]. When you make a claim,
cite the chunk like this: [abc123:472]. If no chunk supports the
claim, say "I don't know" — do not invent citations.

USER
Chunks:
[abc123:0] In the first segment, the host explains ...
[abc123:472] The guest argues that serverless does not ...
[def456:88]  A separate video on the same topic notes ...

Question: What does the guest say about serverless cost?

On the server you parse the [video_id:start_sec] tokens in the answer and render them as clickable deep-links (e.g. /watch?v=abc123&t=472s). A downstream moderator can flag any answer that ships without at least one citation — that is your guardrail against hallucination.

Evaluation — the step everyone skips

Build a golden set. 50–200 query/expected-timestamp pairs is enough to catch almost any regression. For each pair you know the correct video + the correct ~30-second window; the evaluation harness checks whether the retrieved top-k contains that window. Track two numbers on every change:

  • Retrieval@k — did the correct timestamp make it into the top-5?
  • Citation accuracy — when the LLM emits a citation, does the timestamp it cites actually support the claim?

Manual spot checks at the stage-gate before a release will not catch 80% of retrieval regressions. A golden set is the cheapest, highest-leverage eval investment in the whole system.

The process, distilled

  1. Ingest transcripts with timestamps intact. Use a Universal Transcript Retrieval API (or a DIY pipeline combining the official platform APIs with your preferred speech-to-text provider) that returns segments in the form {start, end, text}. Do NOT flatten to plain text at ingestion — you will need timestamps downstream.
  2. Normalize and clean. Strip hesitations and filler (um, uh, you know), collapse repeated whitespace, fix obvious caption errors with a cheap LLM pass if needed. Keep an un-normalized copy for display; embed the normalized version.
  3. Chunk on semantic boundaries. Target 300–600 tokens per chunk with ~50 token overlap. Prefer natural pauses, topic shifts, or end-of-utterance markers over fixed token windows. Carry {video_id, start_sec, end_sec, speaker?} as metadata on every chunk.
  4. Embed with a current model. OpenAI text-embedding-3-large, Voyage voyage-3, or BGE-M3. Keep embedding model and dimension pinned in config so re-indexing is deterministic.
  5. Index for hybrid search. Store the chunk vector in a vector DB (pgvector, Qdrant, Pinecone) alongside a BM25 / keyword index. Spoken language benefits from hybrid — vector catches paraphrase, BM25 catches rare technical terms.
  6. Retrieve top-k with metadata. For a user query, retrieve top 10–20 chunks from the hybrid index and rerank with a cross-encoder (bge-reranker-v2-m3, Cohere Rerank) down to top 3–5. Pass the chunks with their timestamps into the generation prompt.
  7. Generate with grounded citations. Prompt the LLM to answer only from the provided chunks and cite in the form [video_id:start_sec]. Parse citations server-side and render them as deep-links in the UI. Reject or flag responses with no citation.
  8. Evaluate continuously. Build a golden set of 50–200 query/expected-timestamp pairs and compute retrieval@k and answer-citation accuracy on every index or model change. Manual spot checks alone do not catch regressions at scale.

Frequently asked questions

RAG for Video Transcripts: Architecture, Chunking, and Timestamp Grounding | VidNavigator AI