Chunking Strategies That Actually Work

Module 12: RAG & Vector Databases | Expansion Guide

Back to Module 12

The Problem

You built a RAG system. Split documents every 500 characters. Retrieved chunks return half a sentence, or split a critical code block mid-function. The LLM can't make sense of fragments, so answers are garbage despite having the right documents.

Chunking makes or breaks RAG. Most systems fail here.

Naive chunking (every N characters) destroys semantic meaning. You wouldn't rip pages randomly from a book and expect to understand the story. But that's what character-count splitting does to your knowledge base.

The Core Insight

Chunks should be semantic units, not arbitrary character counts.

Think of chunks like paragraphs in writing: each should convey a complete thought. A sentence fragment is useless. A full paragraph with context is valuable. Chunk boundaries should respect meaning.

Good chunking preserves: context, completeness, and retrievability.

The Walkthrough

The Chunking Hierarchy

From worst to best:

Strategy How It Works Quality When To Use
Fixed Character Count Split every N chars ❌ Poor Never (too naive)
Fixed Token Count Split every N tokens ⚠️ Basic Quick prototypes only
Sentence-Based Split on sentence boundaries ✅ Good Prose, documentation
Paragraph-Based Split on \n\n ✅ Better Structured text
Semantic Chunking Group by topic/meaning ✅ Best Production systems
Structure-Aware (Code) Split on functions/classes ✅ Best for code Codebase RAG

Sentence-Based Chunking (Good Baseline)

def chunk_by_sentences(text: str, target_size: int = 500) -> list[str]:
    """
    Chunk text by sentences, targeting size but never splitting mid-sentence.
    """
    sentences = split_sentences(text)  # Use proper sentence tokenizer
    chunks = []
    current_chunk = []
    current_size = 0

    for sentence in sentences:
        sentence_size = len(sentence)

        # If adding this sentence exceeds target, finalize current chunk
        if current_size + sentence_size > target_size and current_chunk:
            chunks.append(" ".join(current_chunk))
            current_chunk = [sentence]
            current_size = sentence_size
        else:
            current_chunk.append(sentence)
            current_size += sentence_size

    # Add final chunk
    if current_chunk:
        chunks.append(" ".join(current_chunk))

    return chunks

Semantic Chunking (Production-Grade)

Use embeddings to find natural topic boundaries:

def semantic_chunk(text: str, similarity_threshold: float = 0.7) -> list[str]:
    """
    Chunk text where semantic similarity drops (topic changes).
    """
    sentences = split_sentences(text)
    embeddings = embed_sentences(sentences)  # Batch embed all sentences

    chunks = []
    current_chunk = [sentences[0]]

    for i in range(1, len(sentences)):
        # Compare current sentence embedding with previous
        similarity = cosine_similarity(embeddings[i-1], embeddings[i])

        # If similarity drops, it's a new topic - start new chunk
        if similarity < similarity_threshold:
            chunks.append(" ".join(current_chunk))
            current_chunk = [sentences[i]]
        else:
            current_chunk.append(sentences[i])

    if current_chunk:
        chunks.append(" ".join(current_chunk))

    return chunks

Why Semantic Chunking Works

When writing flows naturally, sentences about the same topic have high embedding similarity. When topics shift, similarity drops. This creates natural boundaries that preserve meaning.

Code-Specific Chunking

Code has structure. Use it:

def chunk_code_by_structure(code: str, language: str) -> list[dict]:
    """
    Chunk code by logical units (functions, classes, methods).
    """
    tree = parse_ast(code, language)  # Use tree-sitter or similar
    chunks = []

    for node in tree.root_node.children:
        if node.type in ['function_definition', 'class_definition']:
            chunks.append({
                'type': node.type,
                'name': extract_name(node),
                'code': extract_code(node),
                'docstring': extract_docstring(node),
                'start_line': node.start_point[0],
                'end_line': node.end_point[0]
            })

    return chunks

The Overlap Strategy

Prevent context loss at boundaries:

def chunk_with_overlap(text: str, chunk_size: int = 500, overlap: int = 100) -> list[str]:
    """
    Create overlapping chunks to preserve context at boundaries.
    """
    chunks = []
    start = 0

    while start < len(text):
        end = start + chunk_size

        # Find natural boundary (sentence end) near target end
        boundary = find_sentence_boundary(text, end, window=50)

        chunks.append(text[start:boundary])

        # Next chunk starts earlier (overlap) to include context
        start = boundary - overlap

    return chunks

The Overlap Tradeoff

Pros: Prevents losing meaning at chunk boundaries
Cons: Increases storage (redundant text) and retrieval noise
Sweet spot: 10-20% overlap for most use cases

Failure Patterns

1. The Character Counter

Symptom: Chunks split mid-sentence, mid-word, mid-thought.

Fix: Use sentence boundaries at minimum. Never split on character count alone.

2. The Micro-Chunk Problem

Symptom: Chunks are 2-3 sentences, no context, retrieval is noisy.

Fix: Aim for 300-800 tokens per chunk. Smaller chunks lose context, larger chunks dilute signal.

3. The Mega-Chunk

Symptom: Chunks are 5000 tokens, retrieval returns too much irrelevant context.

Fix: Split to 500-1000 tokens. Use hierarchical retrieval if full context needed.

4. The No-Overlap Gap

Symptom: Critical information spans two chunks, retrieval misses it.

Fix: Add 10-15% overlap. Small cost, big improvement in boundary cases.

Practical Chunking Pipeline

def production_chunking_pipeline(document: str, doc_type: str) -> list[dict]:
    """
    Production-grade chunking with metadata and overlap.
    """
    # Step 1: Clean and normalize
    text = clean_text(document)

    # Step 2: Choose strategy based on type
    if doc_type == "code":
        raw_chunks = chunk_code_by_structure(text)
    elif doc_type == "markdown":
        raw_chunks = chunk_by_headers(text)  # Use markdown structure
    else:
        raw_chunks = semantic_chunk(text)

    # Step 3: Add overlap
    chunks_with_overlap = add_overlap(raw_chunks, overlap_ratio=0.15)

    # Step 4: Add metadata
    enriched_chunks = []
    for i, chunk in enumerate(chunks_with_overlap):
        enriched_chunks.append({
            'id': f"{document_id}_chunk_{i}",
            'content': chunk,
            'metadata': {
                'doc_id': document_id,
                'doc_type': doc_type,
                'chunk_index': i,
                'total_chunks': len(chunks_with_overlap)
            }
        })

    # Step 5: Embed
    embeddings = embed_chunks([c['content'] for c in enriched_chunks])
    for chunk, embedding in zip(enriched_chunks, embeddings):
        chunk['embedding'] = embedding

    return enriched_chunks

Quick Reference

Chunking Guidelines:

Strategy by Content Type:

Quality Checks:

  1. Can you understand the chunk in isolation?
  2. Does it contain a complete thought/unit?
  3. Would splitting it differently improve retrieval?

Rule of Thumb:

If a human can't make sense of your chunk without context, your retrieval system won't either. Chunks should be self-contained knowledge units.