The Problem
Your codebase has 50,000 files. You want an AI agent to understand it. You naively chunk every file, embed everything, and query it. Results are terrible: retrieves config.json when you ask about authentication, misses the actual auth module, and returns test fixtures as production examples.
Not all code is equally important. Treating it equally makes retrieval useless.
Most code RAG systems fail because they don't understand code semantics. They treat code like prose - it's not. Code has structure, dependencies, and hierarchies that pure text embeddings miss.
The Core Insight
Code RAG needs structural awareness, not just semantic similarity.
Think of a codebase like a graph: functions call each other, modules import dependencies, tests reference implementations. Flat text embeddings lose this structure. Production code RAG preserves it.
The winning approach: combine text embeddings with code structure metadata.
The Walkthrough
Architecture Overview
Codebase Knowledge Base
├─ File Prioritization (what to index)
├─ Code Chunking (how to split)
├─ Metadata Extraction (structure + context)
├─ Dependency Graph (relationships)
├─ Embedding + Indexing (vector DB)
└─ Incremental Updates (stay fresh)
Step 1: File Prioritization
Don't index everything. Prioritize by importance:
| Priority | File Types | Why |
|---|---|---|
| High | Core business logic, API routes, services | Where actual work happens |
| Medium | Utils, helpers, models, schemas | Reusable components |
| Low | Tests (index selectively), docs | Reference, not implementation |
| Skip | node_modules, build artifacts, configs | Noise, not signal |
def should_index_file(file_path: str) -> bool:
"""Decide if file should be indexed."""
# Skip dependencies and build artifacts
skip_patterns = ['node_modules', 'venv', '.git', 'dist', 'build']
if any(pattern in file_path for pattern in skip_patterns):
return False
# Skip config files unless they're code
if file_path.endswith(('.json', '.yml', '.yaml', '.env')):
return False
# Index source code
code_extensions = ['.js', '.ts', '.py', '.java', '.go', '.rs']
return any(file_path.endswith(ext) for ext in code_extensions)
Step 2: Semantic Code Chunking
Chunk by logical units (functions, classes), not lines:
from tree_sitter import Parser, Language
def chunk_code_semantically(code: str, language: str) -> list[dict]:
"""
Parse code into semantic chunks (functions, classes, methods).
"""
parser = Parser()
parser.set_language(Language(f'tree-sitter-{language}'))
tree = parser.parse(bytes(code, 'utf8'))
chunks = []
def extract_chunks(node, parent_context=""):
if node.type in ['function_definition', 'class_definition', 'method_definition']:
# Extract the full code for this unit
chunk_code = code[node.start_byte:node.end_byte]
# Extract metadata
name = extract_name(node, code)
docstring = extract_docstring(node, code)
params = extract_parameters(node, code)
chunks.append({
'type': node.type,
'name': name,
'code': chunk_code,
'docstring': docstring,
'parameters': params,
'parent_context': parent_context,
'start_line': node.start_point[0],
'end_line': node.end_point[0]
})
# Recurse for nested definitions
for child in node.children:
extract_chunks(child, parent_context=name)
else:
for child in node.children:
extract_chunks(child, parent_context)
extract_chunks(tree.root_node)
return chunks
Step 3: Rich Metadata Extraction
Add context beyond the code itself:
def extract_metadata(file_path: str, chunk: dict) -> dict:
"""
Add rich metadata for better retrieval.
"""
return {
**chunk,
'metadata': {
'file_path': file_path,
'module': extract_module_path(file_path),
'imports': extract_imports(chunk['code']),
'calls_to': extract_function_calls(chunk['code']),
'complexity': calculate_complexity(chunk['code']),
'has_tests': check_for_tests(chunk['name'], file_path),
'last_modified': get_git_last_modified(file_path),
'primary_author': get_git_primary_author(file_path)
}
}
Step 4: Dependency Graph Integration
Track relationships between code units:
class CodebaseGraph:
"""
Graph of code dependencies for enhanced retrieval.
"""
def __init__(self):
self.graph = nx.DiGraph()
def add_function(self, func_name: str, metadata: dict):
"""Add function node with metadata."""
self.graph.add_node(func_name, **metadata)
def add_dependency(self, caller: str, callee: str):
"""Add edge from caller to callee."""
self.graph.add_edge(caller, callee)
def get_related_functions(self, func_name: str, depth: int = 2) -> list[str]:
"""
Get functions related to this one (callers + callees).
"""
# Get functions this one calls
callees = list(nx.descendants(self.graph, func_name, depth))
# Get functions that call this one
callers = list(nx.ancestors(self.graph, func_name, depth))
return callees + callers
Step 5: Hybrid Retrieval Strategy
Combine vector search with graph traversal:
def retrieve_relevant_code(query: str, top_k: int = 5) -> list[dict]:
"""
Hybrid retrieval: semantic + structural.
"""
# Step 1: Vector search for semantically similar chunks
vector_results = vector_db.search(query, top_k=top_k*2)
# Step 2: Expand with dependency graph
expanded_results = []
for result in vector_results:
func_name = result['metadata']['name']
# Add the matched function
expanded_results.append(result)
# Add related functions from graph
related = codebase_graph.get_related_functions(func_name, depth=1)
for rel_func in related[:2]: # Top 2 related
related_chunk = vector_db.get_by_name(rel_func)
if related_chunk:
expanded_results.append(related_chunk)
# Step 3: Re-rank by relevance + importance
ranked = rerank_results(expanded_results, query)
return ranked[:top_k]
Why Hybrid Retrieval Works
Vector search finds: "What code is semantically similar to the query?"
Graph expansion adds: "What other code is structurally related?"
Together: You get both the answer and its context.
Step 6: Incremental Updates
Re-indexing 50k files on every commit is wasteful. Update smartly:
def incremental_update(changed_files: list[str]):
"""
Update only changed files and their dependents.
"""
for file_path in changed_files:
# Remove old chunks for this file
vector_db.delete_where(metadata={'file_path': file_path})
# Re-chunk and re-index
code = read_file(file_path)
chunks = chunk_code_semantically(code, detect_language(file_path))
for chunk in chunks:
metadata = extract_metadata(file_path, chunk)
embedding = embed_code(chunk['code'])
vector_db.add(embedding, metadata)
# Update dependency graph
update_graph_for_file(file_path, chunks)
Production-Ready Pipeline
class CodebaseKnowledgeBase:
"""
Production code RAG system.
"""
def __init__(self, repo_path: str):
self.repo_path = repo_path
self.vector_db = init_vector_db()
self.graph = CodebaseGraph()
def index_codebase(self):
"""Initial indexing of entire codebase."""
files = find_source_files(self.repo_path)
for file_path in tqdm(files):
if not should_index_file(file_path):
continue
self._index_file(file_path)
def _index_file(self, file_path: str):
"""Index a single file."""
code = read_file(file_path)
language = detect_language(file_path)
chunks = chunk_code_semantically(code, language)
for chunk in chunks:
# Enrich with metadata
enriched = extract_metadata(file_path, chunk)
# Embed and store
embedding = embed_code(chunk['code'] + chunk['docstring'])
self.vector_db.add(embedding, enriched)
# Update graph
self.graph.add_function(chunk['name'], enriched)
# Add dependencies
for callee in chunk['metadata']['calls_to']:
self.graph.add_dependency(chunk['name'], callee)
def query(self, question: str, top_k: int = 5) -> list[dict]:
"""Query the knowledge base."""
return retrieve_relevant_code(question, top_k)
def watch_for_changes(self):
"""Watch repo for changes and incrementally update."""
# Use git hooks or file watcher
for changed_file in watch_git_changes(self.repo_path):
incremental_update([changed_file])
Failure Patterns
1. The Kitchen Sink Index
Symptom: You indexed node_modules, configs, 10k test fixtures - retrieval is garbage.
Fix: Be selective. Index production code, skip noise.
2. The Line-Count Chunker
Symptom: Functions split mid-implementation, context lost.
Fix: Use AST-based chunking. Respect code structure.
3. The Flat Embedding
Symptom: Retrieves similar code but misses dependencies.
Fix: Build dependency graph. Expand results structurally.
4. The Stale Index
Symptom: Agent suggests code from 50 commits ago.
Fix: Incremental updates on file changes. Git hooks or watchers.
The Embedding Model Matters
Code-specific embeddings (CodeBERT, StarEncoder) outperform general embeddings (OpenAI text-embedding-3) by 30%+ on code retrieval tasks. If code RAG is your core use case, use specialized models.
Quick Reference
Code RAG Pipeline:
- Filter files (skip deps, configs, build artifacts)
- Chunk by structure (AST parsing, not lines)
- Extract metadata (imports, calls, complexity)
- Build dependency graph (caller/callee relationships)
- Embed with code-specific model
- Hybrid retrieval (vector + graph)
- Incremental updates (re-index only changed files)
Key Metadata to Track:
- File path and module structure
- Function/class name and signature
- Imports and dependencies
- Function calls (who calls what)
- Docstrings and inline comments
- Git metadata (author, last modified)
Retrieval Strategy:
- Vector search for semantic matches
- Graph expansion for related code
- Re-rank by relevance + importance
- Return top-k with full context
Rule of Thumb:
Code isn't just text - it's a graph. Pure text embeddings miss structure. Combine semantic search with dependency graphs for production-grade code retrieval.