Introduction¶
Code Graph is the foundational feature of the Code Graph Knowledge System, providing intelligent code intelligence capabilities without requiring vector embeddings or large language models. It works in all deployment modes (minimal, standard, and full), making it the most accessible and performant feature for code analysis.
Unlike traditional code search tools that rely on simple text matching, Code Graph uses Neo4j's graph database and native fulltext indexing to understand code structure, file relationships, and dependency chains. This enables powerful capabilities like impact analysis, smart search, and context generation for AI assistants.
What is Code Graph?¶
Code Graph is a graph-based representation of your codebase stored in Neo4j. When you ingest a repository, the system:
- Scans code files across your repository (Python, TypeScript, JavaScript, Go, Rust, Java, etc.)
- Creates graph nodes for repositories, files, and symbols (functions, classes)
- Establishes relationships like IMPORTS, CALLS, DEFINED_IN, IN_REPO
- Indexes content using Neo4j's native fulltext search for fast retrieval
- Calculates metrics like file size, language, and change frequency
The result is a queryable graph that understands:
- Which files import other files
- Which functions call which other functions
- What would break if you modify a specific file
- Which files are most central to your codebase
Key Features¶
1. Repository Ingestion¶
Transform your codebase into a searchable graph database.
Modes:
- Incremental (60x faster): Only process changed files using git diff
- Full: Complete re-ingestion for non-git projects or first-time setup
Supported Languages:
- Python (
.py) - TypeScript/JavaScript (
.ts,.tsx,.js,.jsx) - Go (
.go) - Rust (
.rs) - Java (
.java) - C/C++ (
.c,.cpp,.h,.hpp) - C# (
.cs) - Ruby (
.rb) - PHP (
.php) - Swift (
.swift) - Kotlin (
.kt) - Scala (
.scala)
What gets indexed:
- File paths (for pattern matching)
- Programming language
- File size
- File content (for files < 100KB)
- SHA hash (for change detection)
- Git commit information (in incremental mode)
2. Fulltext Search¶
Find relevant files using Neo4j's native fulltext search engine. Unlike vector-based semantic search, fulltext search:
- Works without embeddings or LLM
- Provides instant results (< 100ms)
- Supports fuzzy matching and relevance scoring
- Scales to large repositories (10,000+ files)
Search capabilities:
- Keyword matching in file paths
- Language filtering
- Relevance ranking
- Multi-term queries
- Path pattern matching
3. Impact Analysis¶
Understand the blast radius of code changes before making them. Impact analysis traverses the dependency graph to find:
- Direct dependents: Files that directly import your file
- Transitive dependents: Files that indirectly depend on your file
- Function callers: Code that calls functions you're modifying
- Import chains: Complete dependency paths
This is critical for:
- Refactoring: Know what you'll break
- Code review: Understand change implications
- Testing strategy: Identify affected test suites
- Architecture analysis: Map system boundaries
4. Context Packing¶
Generate curated, token-budget-aware context bundles for AI assistants. Context packing solves the problem of "what code should I show the LLM?"
Features:
- Budget-aware: Respects token limits (500-10,000 tokens)
- Stage-specific: Different content for plan/review/implement stages
- Smart ranking: Prioritizes most relevant files
- Deduplication: Removes redundant references
- Category limits: Balances files vs symbols vs guidelines
Use cases:
- Claude Desktop integration via MCP
- VS Code Copilot context
- Custom AI agents
- Automated code review
- Documentation generation
Architecture¶
Graph Schema¶
Code Graph uses the following Neo4j schema:
Nodes:
- Repo: Repository root
- Properties: id, created, file_count
- File: Source code file
- Properties: repoId, path, lang, size, content, sha, updated
- Constraints: Composite key (repoId, path)
- Symbol: Function or class
- Properties: id, name, type, lang
- Constraints: Unique id
Relationships:
- (File)-[:IN_REPO]->(Repo): File belongs to repository
- (File)-[:IMPORTS]->(File): File imports another file
- (Symbol)-[:DEFINED_IN]->(File): Symbol defined in file
- (Symbol)-[:CALLS]->(Symbol): Symbol calls another symbol
Indexes¶
Code Graph creates several indexes for optimal performance:
- Fulltext Index (
file_text): - Indexes: File path, language
- Used for: Fast fulltext search
-
Type: Neo4j native fulltext
-
Property Indexes:
file_path: Exact path lookupsfile_repo: Filter by repository-
symbol_name: Symbol name lookups -
Composite Keys:
(repoId, path): Unique file identification- Allows same filename in different repos
Performance Characteristics¶
| Operation | Small Repo (<1K files) | Medium Repo (1K-10K files) | Large Repo (>10K files) |
|---|---|---|---|
| Full Ingestion | 5-10s | 30-60s | 2-5min |
| Incremental Update | <1s | 1-3s | 3-10s |
| Fulltext Search | <50ms | <100ms | <200ms |
| Impact Analysis | <100ms | <200ms | <500ms |
| Context Pack | <200ms | <300ms | <500ms |
Scalability:
- Tested with repositories up to 50,000 files
- Neo4j graph database scales horizontally
- Fulltext index automatically optimized
- Memory usage: ~50MB per 1,000 files
Integration Points¶
1. MCP Server (Model Context Protocol)¶
Code Graph provides 4 MCP tools for AI assistants:
code_graph_ingest_repo: Ingest repositorycode_graph_related: Find related filescode_graph_impact: Analyze impactcontext_pack: Build context bundle
Compatible with:
- Claude Desktop
- VS Code with MCP extension
- Any MCP-compatible client
2. REST API¶
All Code Graph features are available via HTTP REST API:
POST /api/v1/code-graph/ingest - Ingest repository
POST /api/v1/code-graph/search - Fulltext search
POST /api/v1/code-graph/impact - Impact analysis
POST /api/v1/code-graph/context-pack - Build context pack
3. Direct Service Access¶
For custom integrations, use Python services directly:
from src.codebase_rag.services.code import graph_service
from src.codebase_rag.services.code import code_ingestor
from src.codebase_rag.services.ranker import ranker
from src.codebase_rag.services.code import pack_builder
Deployment Modes¶
Code Graph works identically across all deployment modes:
Minimal Mode¶
What's included: - Neo4j database only - Code Graph (all features) - No embeddings or LLM required
Resource requirements: - Docker image: ~500MB - Memory: 512MB minimum - CPU: 1 core minimum - Startup time: ~5 seconds
Best for: - Individual developers - Learning the system - CI/CD integration - Budget-conscious deployments
Standard Mode¶
What's included: - Neo4j database - Code Graph (all features) - Memory Store (manual management) - Embedding model (for memory search)
Additional capabilities: - Memory Store for project knowledge - Vector search for memories - Still no LLM required for Code Graph
Full Mode¶
What's included: - Everything from Standard - LLM integration - Memory auto-extraction - Knowledge RAG
Additional capabilities: - Memory extraction from git commits - Knowledge base Q&A - Advanced AI features
Note: Code Graph features work identically in all modes. Only additional features change.
Use Cases¶
1. Understanding Unfamiliar Codebases¶
Scenario: You've joined a new team and need to understand a large codebase quickly.
Workflow: 1. Ingest the repository 2. Search for key terms (e.g., "authentication", "database") 3. Use impact analysis to understand dependencies 4. Generate context packs for specific areas
Benefits: - No need to read thousands of files - Quickly identify entry points - Understand system architecture - Find related code automatically
2. Refactoring with Confidence¶
Scenario: You need to refactor a core module but don't know what depends on it.
Workflow: 1. Run impact analysis on the file you want to change 2. Review all dependent files (direct and transitive) 3. Assess the blast radius 4. Plan your refactoring strategy
Benefits: - Know exactly what you'll break - Identify all test files to update - Plan migration strategy - Avoid surprise breakages
3. AI-Assisted Development¶
Scenario: You want to use Claude Desktop to help with development but need relevant context.
Workflow: 1. Ingest your repository 2. Use MCP tools in Claude Desktop 3. Generate context packs for specific tasks 4. Ask Claude questions with full context
Benefits: - AI has relevant code context - Token budget automatically managed - No manual copy-pasting - Stay within LLM context limits
4. Code Review Assistance¶
Scenario: Reviewing a pull request that touches multiple files.
Workflow: 1. Run impact analysis on changed files 2. Identify all affected components 3. Search for related test files 4. Generate review context pack
Benefits: - Complete picture of PR impact - Don't miss hidden dependencies - Find affected tests automatically - Better review quality
5. Architectural Analysis¶
Scenario: Need to understand system architecture and identify tightly coupled components.
Workflow: 1. Ingest the repository 2. Query the graph for high-degree nodes (many connections) 3. Analyze import/call patterns 4. Identify architectural boundaries
Benefits: - Discover hidden dependencies - Identify refactoring opportunities - Understand layer violations - Plan architecture improvements
Comparison with Alternatives¶
vs. grep/ripgrep¶
| Feature | grep/ripgrep | Code Graph |
|---|---|---|
| Text search | ✅ Excellent | ✅ Good |
| Relationship analysis | ❌ None | ✅ Full support |
| Impact analysis | ❌ Manual | ✅ Automatic |
| Ranking | ❌ None | ✅ Relevance-based |
| Scalability | ⚠️ Slows on large repos | ✅ Constant time |
When to use grep: Quick one-off searches, simple text matching
When to use Code Graph: Understanding relationships, impact analysis, repeated searches
vs. ctags/universal-ctags¶
| Feature | ctags | Code Graph |
|---|---|---|
| Symbol indexing | ✅ Excellent | ✅ Good |
| Cross-file analysis | ❌ Limited | ✅ Full support |
| Dependency tracking | ❌ None | ✅ Complete |
| Search capabilities | ⚠️ Basic | ✅ Advanced |
| Graph traversal | ❌ None | ✅ Full support |
When to use ctags: Editor integration, local navigation
When to use Code Graph: Cross-file analysis, dependency tracking, impact analysis
vs. Vector-based semantic search¶
| Feature | Vector Search | Code Graph |
|---|---|---|
| Semantic understanding | ✅ Excellent | ⚠️ Limited |
| Relationship analysis | ❌ None | ✅ Full support |
| Setup complexity | ⚠️ High (embeddings) | ✅ Low (no LLM) |
| Performance | ⚠️ Slower | ✅ Fast |
| Resource requirements | ⚠️ High | ✅ Low |
When to use Vector Search: Natural language queries, semantic similarity
When to use Code Graph: Structural analysis, fast searches, resource-constrained environments
vs. Language Server Protocol (LSP)¶
| Feature | LSP | Code Graph |
|---|---|---|
| Real-time analysis | ✅ Excellent | ⚠️ Batch |
| Cross-file features | ✅ Good | ✅ Excellent |
| Language support | ⚠️ Per-language | ✅ Universal |
| Historical analysis | ❌ None | ✅ Git integration |
| AI integration | ❌ Limited | ✅ Native |
When to use LSP: Editor integration, real-time feedback, language-specific features
When to use Code Graph: Cross-language analysis, historical changes, AI assistance
Best Practices¶
1. Ingestion Strategy¶
For active development: - Use incremental mode for fast updates - Run ingestion on every pull request - Automate with CI/CD hooks
For initial setup: - Use full mode first time - Verify ingestion completed successfully - Check Neo4j for expected file count
For large repositories (>10K files): - Use incremental mode exclusively - Schedule full re-ingestion monthly - Monitor ingestion performance
2. Search Optimization¶
For best search results: - Use specific terms (not single letters) - Include file extensions for language filtering - Combine multiple keywords - Use path segments for targeted search
Examples: - Good: authentication service typescript - Bad: auth ts - Good: api/routes payment - Bad: pay
3. Impact Analysis¶
When running impact analysis: - Start with depth=1 for direct dependencies - Increase to depth=2 for transitive dependencies - Rarely go beyond depth=3 (too much noise) - Focus on high-score results first
Interpreting results: - score=1.0: Direct CALLS relationship, depth 1 - score=0.9: Direct IMPORTS relationship, depth 1 - score=0.7: Transitive CALLS, depth 2 - score<0.5: Indirect dependencies, lower priority
4. Context Packing¶
Budget recommendations: - Planning: 500-1000 tokens (high-level overview) - Review: 1000-2000 tokens (focused analysis) - Implementation: 1500-3000 tokens (detailed context) - Large context: 3000-10000 tokens (comprehensive)
Stage selection: - plan: Project structure, entry points, key files - review: Code quality, patterns, conventions - implement: Detailed implementation, symbols, logic
5. Performance Tuning¶
For optimal performance: - Keep files under 100KB (for content indexing) - Exclude generated files (node_modules, build/) - Run incremental updates frequently - Monitor Neo4j memory usage
Troubleshooting slow queries: - Check Neo4j indexes are created - Verify fulltext index exists - Reduce search result limit - Add more specific search terms
Limitations¶
Current Limitations¶
- No semantic understanding: Code Graph uses fulltext search, not embeddings
- Can't find synonyms or related concepts
- Requires keyword matching
-
No natural language queries
-
Limited symbol analysis: Basic function/class detection only
- No deep AST parsing
- No type inference
-
No cross-language call graphs (yet)
-
File size limits: Files > 100KB are not content-indexed
- Path and metadata still indexed
- Impact analysis still works
-
Just no fulltext search of content
-
No real-time updates: Ingestion is batch-based
- Not suitable for editor integration
- Run manually or via CI/CD
- Use incremental mode for faster updates
Planned Improvements¶
v0.8 (Next Release): - Enhanced AST parsing for better symbol detection - Cross-language call graph analysis - Real-time file watching (optional)
v0.9 (Future): - Hybrid vector + fulltext search - AI-powered code summarization - Natural language query support
v1.0 (Long-term): - Multi-repo dependency tracking - Language-specific analyzers - Performance profiling integration
Getting Started¶
Ready to use Code Graph? Check out these guides:
- Repository Ingestion - Learn how to ingest your codebase
- Search and Discovery - Master fulltext search and ranking
- Impact Analysis - Understand dependencies and blast radius
- Context Packing - Generate AI-friendly context bundles
FAQ¶
Does Code Graph require an LLM or embeddings?¶
No. Code Graph works with Neo4j alone. It uses native fulltext indexing, not vector embeddings or LLMs.
What deployment mode do I need?¶
Any mode. Code Graph works identically in minimal, standard, and full deployment modes.
How is this different from GitHub Copilot?¶
Code Graph is a knowledge management system, not a code completion tool. It helps you understand your codebase structure, dependencies, and relationships. It can feed context to Copilot, but doesn't replace it.
Can I use this with private/confidential code?¶
Yes. Code Graph runs entirely on-premise or in your infrastructure. No code is sent to external services. It's completely self-hosted.
How much disk space does it need?¶
Approximately 10-20% of your source code size. A 1GB repository typically requires 100-200MB of Neo4j storage.
Does it work with monorepos?¶
Yes. Code Graph handles monorepos well. You can ingest the entire monorepo and search across all projects, or ingest individual projects separately.
Can I query the graph directly?¶
Yes. You can access Neo4j Browser at http://localhost:7474 and run Cypher queries directly. See the Neo4j documentation for query syntax.
What if my language isn't supported?¶
Files are still indexed by path and metadata, just without language-specific symbol extraction. Fulltext search and impact analysis still work. Language support is expanding in future releases.
Next Steps¶
- Ingestion Guide: Learn how to ingest your first repository
- Search Guide: Master search and discovery techniques
- Impact Analysis: Understand code dependencies
- Context Packing: Generate AI context bundles