The Problem
You built a RAG system. Split documents every 500 characters. Retrieved chunks return half a sentence, or split a critical code block mid-function. The LLM can't make sense of fragments, so answers are garbage despite having the right documents.
Chunking makes or breaks RAG. Most systems fail here.
Naive chunking (every N characters) destroys semantic meaning. You wouldn't rip pages randomly from a book and expect to understand the story. But that's what character-count splitting does to your knowledge base.
The Core Insight
Chunks should be semantic units, not arbitrary character counts.
Think of chunks like paragraphs in writing: each should convey a complete thought. A sentence fragment is useless. A full paragraph with context is valuable. Chunk boundaries should respect meaning.
Good chunking preserves: context, completeness, and retrievability.
The Walkthrough
The Chunking Hierarchy
From worst to best:
| Strategy | How It Works | Quality | When To Use |
|---|---|---|---|
| Fixed Character Count | Split every N chars | ❌ Poor | Never (too naive) |
| Fixed Token Count | Split every N tokens | ⚠️ Basic | Quick prototypes only |
| Sentence-Based | Split on sentence boundaries | ✅ Good | Prose, documentation |
| Paragraph-Based | Split on \n\n | ✅ Better | Structured text |
| Semantic Chunking | Group by topic/meaning | ✅ Best | Production systems |
| Structure-Aware (Code) | Split on functions/classes | ✅ Best for code | Codebase RAG |
Sentence-Based Chunking (Good Baseline)
def chunk_by_sentences(text: str, target_size: int = 500) -> list[str]:
"""
Chunk text by sentences, targeting size but never splitting mid-sentence.
"""
sentences = split_sentences(text) # Use proper sentence tokenizer
chunks = []
current_chunk = []
current_size = 0
for sentence in sentences:
sentence_size = len(sentence)
# If adding this sentence exceeds target, finalize current chunk
if current_size + sentence_size > target_size and current_chunk:
chunks.append(" ".join(current_chunk))
current_chunk = [sentence]
current_size = sentence_size
else:
current_chunk.append(sentence)
current_size += sentence_size
# Add final chunk
if current_chunk:
chunks.append(" ".join(current_chunk))
return chunks
Semantic Chunking (Production-Grade)
Use embeddings to find natural topic boundaries:
def semantic_chunk(text: str, similarity_threshold: float = 0.7) -> list[str]:
"""
Chunk text where semantic similarity drops (topic changes).
"""
sentences = split_sentences(text)
embeddings = embed_sentences(sentences) # Batch embed all sentences
chunks = []
current_chunk = [sentences[0]]
for i in range(1, len(sentences)):
# Compare current sentence embedding with previous
similarity = cosine_similarity(embeddings[i-1], embeddings[i])
# If similarity drops, it's a new topic - start new chunk
if similarity < similarity_threshold:
chunks.append(" ".join(current_chunk))
current_chunk = [sentences[i]]
else:
current_chunk.append(sentences[i])
if current_chunk:
chunks.append(" ".join(current_chunk))
return chunks
Why Semantic Chunking Works
When writing flows naturally, sentences about the same topic have high embedding similarity. When topics shift, similarity drops. This creates natural boundaries that preserve meaning.
Code-Specific Chunking
Code has structure. Use it:
def chunk_code_by_structure(code: str, language: str) -> list[dict]:
"""
Chunk code by logical units (functions, classes, methods).
"""
tree = parse_ast(code, language) # Use tree-sitter or similar
chunks = []
for node in tree.root_node.children:
if node.type in ['function_definition', 'class_definition']:
chunks.append({
'type': node.type,
'name': extract_name(node),
'code': extract_code(node),
'docstring': extract_docstring(node),
'start_line': node.start_point[0],
'end_line': node.end_point[0]
})
return chunks
The Overlap Strategy
Prevent context loss at boundaries:
def chunk_with_overlap(text: str, chunk_size: int = 500, overlap: int = 100) -> list[str]:
"""
Create overlapping chunks to preserve context at boundaries.
"""
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
# Find natural boundary (sentence end) near target end
boundary = find_sentence_boundary(text, end, window=50)
chunks.append(text[start:boundary])
# Next chunk starts earlier (overlap) to include context
start = boundary - overlap
return chunks
The Overlap Tradeoff
Pros: Prevents losing meaning at chunk boundaries
Cons: Increases storage (redundant text) and retrieval noise
Sweet spot: 10-20% overlap for most use cases
Failure Patterns
1. The Character Counter
Symptom: Chunks split mid-sentence, mid-word, mid-thought.
Fix: Use sentence boundaries at minimum. Never split on character count alone.
2. The Micro-Chunk Problem
Symptom: Chunks are 2-3 sentences, no context, retrieval is noisy.
Fix: Aim for 300-800 tokens per chunk. Smaller chunks lose context, larger chunks dilute signal.
3. The Mega-Chunk
Symptom: Chunks are 5000 tokens, retrieval returns too much irrelevant context.
Fix: Split to 500-1000 tokens. Use hierarchical retrieval if full context needed.
4. The No-Overlap Gap
Symptom: Critical information spans two chunks, retrieval misses it.
Fix: Add 10-15% overlap. Small cost, big improvement in boundary cases.
Practical Chunking Pipeline
def production_chunking_pipeline(document: str, doc_type: str) -> list[dict]:
"""
Production-grade chunking with metadata and overlap.
"""
# Step 1: Clean and normalize
text = clean_text(document)
# Step 2: Choose strategy based on type
if doc_type == "code":
raw_chunks = chunk_code_by_structure(text)
elif doc_type == "markdown":
raw_chunks = chunk_by_headers(text) # Use markdown structure
else:
raw_chunks = semantic_chunk(text)
# Step 3: Add overlap
chunks_with_overlap = add_overlap(raw_chunks, overlap_ratio=0.15)
# Step 4: Add metadata
enriched_chunks = []
for i, chunk in enumerate(chunks_with_overlap):
enriched_chunks.append({
'id': f"{document_id}_chunk_{i}",
'content': chunk,
'metadata': {
'doc_id': document_id,
'doc_type': doc_type,
'chunk_index': i,
'total_chunks': len(chunks_with_overlap)
}
})
# Step 5: Embed
embeddings = embed_chunks([c['content'] for c in enriched_chunks])
for chunk, embedding in zip(enriched_chunks, embeddings):
chunk['embedding'] = embedding
return enriched_chunks
Quick Reference
Chunking Guidelines:
- Size: 300-800 tokens per chunk (sweet spot)
- Overlap: 10-20% (100-150 tokens)
- Boundaries: Respect semantic units (sentences, paragraphs, functions)
- Metadata: Include source doc, chunk index, type
Strategy by Content Type:
- Prose/Docs: Semantic chunking or paragraph-based
- Code: AST-based (functions, classes)
- Structured (JSON/YAML): Logical blocks
- Markdown: Header-based hierarchy
Quality Checks:
- Can you understand the chunk in isolation?
- Does it contain a complete thought/unit?
- Would splitting it differently improve retrieval?
Rule of Thumb:
If a human can't make sense of your chunk without context, your retrieval system won't either. Chunks should be self-contained knowledge units.