Skip to content
INDXR.AI

How to Chunk YouTube Transcripts for RAG (and Why 30 Seconds Is Wrong)

INDXR.AI Editorial
INDXR.AI Editorial
Published April 16, 2026 · Updated April 24, 2026

The chunk size you pick for YouTube transcripts matters more than your embedding model. That's the finding of a 2025 peer-reviewed study from Vectara, published at NAACL, which tested 25 chunking configurations across 48 embedding models (arxiv.org/abs/2410.13070). Chunking strategy had equal or greater influence on retrieval quality than model choice.

Most developers default to 30 seconds because granularity feels useful. Thirty seconds of spoken English produces roughly 75 words — approximately 100 tokens. That's below the 256-token floor where embedding models start to produce semantically meaningful vectors. You're embedding fragments, and your retrieval quality reflects it.

How Many Tokens Is 30 Seconds of Speech?

Spoken English averages 130–160 words per minute. YouTube creators trend toward the faster end. Using OpenAI's cl100k_base tokenizer (~1.33 tokens per word):

DurationWords (~150 WPM)TokensFor RAG?
30s~75~100❌ Below 256-token floor
60s~150~200⚠️ Minimum viable
90s~225~300✅ Inside sweet spot
120s~300~400✅ Research-backed optimum

LangChain's YoutubeLoader defaults to chunk_size_seconds=120 for exactly this reason. INDXR.AI offers 30s, 60s, 90s, and 120s presets — the 30s option exists for short-form content and granular navigation, but for most RAG workloads 60s or above is the right starting point.

What the Research Says

Vectara NAACL 2025 tested 25 chunking configurations across 48 embedding models. Key finding: chunking strategy influenced retrieval quality as much as or more than the embedding model. Larger fixed-size chunks generally outperformed smaller ones. Semantic chunking did not reliably beat well-chosen fixed-size chunking.

NVIDIA's benchmark tested 128 to 2,048 tokens across query types. Factoid queries performed best at 256–512 tokens. Analytical queries performed better at 512–1,024 tokens. For YouTube transcripts, where queries tend to be topic-based, 256–512 tokens — the 60–90 second range — is the appropriate target.

Chroma Research found that RecursiveCharacterTextSplitter at 400 tokens achieved ~89% recall — competitive with more complex approaches at a fraction of the cost. Token-range target matters more than algorithm sophistication.

Microsoft Azure AI Search recommends 512 tokens with 25% overlap as a baseline. For audio transcripts with shorter sentences, 300–400 tokens (90–120 seconds) often performs comparably.

Fixed-Time vs. Semantic Chunking

Semantic chunking detects topic shifts and adjusts boundaries accordingly. For audio transcripts specifically, it underperforms.

The Vectara paper found semantic chunking failed to justify its computational cost. A 2026 benchmark by Vecta found semantic chunking produced an average chunk size of only 43 tokens — far below optimal — with 54% accuracy. Fixed-size chunking at 512 tokens achieved 69% accuracy at a fraction of the compute.

For transcripts, there's an additional reason to prefer fixed-time: timestamp alignment. Semantic chunkers adjust boundaries based on text similarity, which can produce chunks that span awkward time ranges and break the clean mapping between text and video timestamp. Lose that mapping and you lose the ability to cite sources with deep links — one of the most valuable things about video-based RAG.

The approach that works: time-based chunking with sentence-boundary snapping. Target a duration, but adjust the boundary to land on a complete sentence. This gives you predictable token ranges, clean sentence boundaries, and preserved timestamps.

Overlap: 15% Is the Research-Backed Default

Overlap repeats a portion of one chunk at the start of the next, helping retrieval when a relevant passage spans a boundary.

NVIDIA tested 10%, 15%, and 20% overlap. 15% performed best for dense embedding retrieval. Microsoft Azure recommends 25% as a conservative starting point. For 60-second chunks, 15% overlap is 9 seconds. For 120-second chunks, it's 18 seconds — roughly one to two sentences carried over.

One important caveat: a 2026 analysis using SPLADE sparse retrieval found overlap provided no measurable benefit for sparse methods. If you're using BM25 or SPLADE, set overlap to 0%. The overlap_seconds field in INDXR.AI's output tells you what was applied so you can deduplicate if needed.

Why Transcript Source Quality Affects Chunking

Auto-generated YouTube captions have no punctuation. Text arrives as lowercase words without sentence boundaries. This affects chunking in two ways.

First, sentence-boundary snapping requires sentences — which requires punctuation. Without it, chunk edges are arbitrary cuts through the text stream.

Second, accuracy varies significantly. Auto-captions achieve 60–95% word accuracy depending on audio quality. Errors propagate into embeddings. A misheard technical term becomes a poor retrieval anchor.

AssemblyAI transcription adds punctuation and capitalization, improves accuracy — particularly for accents, fast speech, and technical vocabulary — and enables sentence-level overlap via sentence boundary detection. For RAG pipelines where retrieval quality matters, the source transcript quality is part of the chunking equation.

The Practical Defaults

ParameterRecommended defaultSource
Chunk duration60–90sToken range analysis; NVIDIA benchmark
StrategyFixed-time + sentence-boundary snapVectara NAACL 2025
Overlap15%NVIDIA benchmark
AvoidUnder 60s for most workloadsBelow 200-token threshold

These aren't universal rules. Short-form content (under 5 minutes) may benefit from 30s chunks for granularity. Analytical queries over long lectures may benefit from 120s. Start at 60s and adjust based on your retrieval quality.

INDXR.AI's RAG JSON export handles the chunking, overlap, and metadata — download and load directly into your vector database. For the full output schema, see YouTube Transcript JSON Export. For credit packages, see the pricing page.

Frequently Asked Questions

Why does chunk size matter more than the embedding model?
The Vectara NAACL 2025 study tested this directly. An excellent embedding model can't compensate for chunks too small to carry semantic content.
Does semantic chunking ever beat fixed-size for transcripts?
Rarely. The Vectara paper found it didn't consistently outperform. The Vecta 2026 benchmark found it produced dangerously small fragments (43 tokens average) with poor accuracy. For transcripts, fixed-time with sentence snapping is both simpler and more effective.
What chunk size for videos under 5 minutes?
Consider 30s. With 120s chunks, a 5-minute video produces only 2–3 chunks, which limits retrieval granularity.
Does overlap help with sparse retrieval?
No. Set it to 0% for BM25 or SPLADE. Overlap benefits dense embedding models specifically.

Sources