Improve graph: entity types, traversal, ingestion pipeline, REST API, tests, and scoring by ChristianKniep · Pull Request #1139 · MemMachine/MemMachine
Purpose of the change
Adds a knowledge graph layer to MemMachine, enabling multi-hop relationship traversal, entity-typed nodes, semantic feature relationships, graph analytics, and node deduplication on top of the existing Neo4j vector store. This allows the system to answer queries that require following connections across memories — for example, discovering that Bob is a TensorFlow expert and Project Atlas uses TensorFlow — rather than relying solely on vector similarity scoring.
Description
This PR introduces end-to-end knowledge graph capabilities across the storage, application, and API layers:
Graph infrastructure (neo4j_vector_graph_store.py, data_types.py, graph_traversal_store.py):
RELATED_TOedges are created between semantically similar features during ingestion, controlled by a configurable cosine-similarity threshold (related_to_threshold, default0.70)- Entity-type labels are applied to nodes (
ENTITY_TYPE_Person,ENTITY_TYPE_Concept,ENTITY_TYPE_Event, etc.) and exposed as a filter parameter on the search API - Multi-hop traversal, graph-filtered vector search, shortest-path queries, subgraph (ego-graph) extraction
- GDS-powered analytics: PageRank, Louvain community detection, degree centrality, betweenness centrality
- Background node deduplication with configurable
SAME_ASproposals and merge/dismiss resolution via the API - Near-duplicate
RELATED_TOedge suppression atsimilarity >= 0.99to avoid noise
Semantic ingestion pipeline (semantic_ingestion.py, semantic_relationship_storage.py, feature_relationship_types.py):
- After each ingestion cycle, semantic features are cross-linked with typed edges:
RELATED_TO,CONTRADICTS,IMPLIES,SUPERSEDES - A new
SemanticRelationshipStorageprotocol exposes relationship CRUD and contradiction detection
Episode store deduplication (episode_sqlalchemy_store.py):
- Adds a
content_hashcolumn (SHA-256 ofsession_key + producer_id + content) withON CONFLICT DO NOTHINGupsert on both PostgreSQL and SQLite - Includes an online migration that backfills existing rows and adds the unique constraint idempotently on startup
Episode.is_newflag allows callers to distinguish newly inserted episodes from deduplicated returns
REST API (graph_router.py, ~1,900 lines) — new /memories/graph route group:
POST /memories/graph/search/multi-hop— multi-hop traversal from an anchor nodePOST /memories/graph/search/filtered— graph-filtered vector similarity searchPOST /memories/graph/relationships— create typed feature relationshipsPOST /memories/graph/relationships/get— query relationships with direction and confidence filtersPOST /memories/graph/relationships/delete— delete a specific relationshipPOST /memories/graph/contradictions— find allCONTRADICTSpairs within a feature setPOST /memories/graph/dedup/proposals— list duplicate node proposalsPOST /memories/graph/dedup/resolve— merge or dismiss duplicate pairsPOST /memories/graph/analytics/pagerank— compute PageRank (requires GDS)POST /memories/graph/analytics/communities— Louvain community detection (requires GDS)POST /memories/graph/analytics/stats— graph statistics (node/edge counts, degree, type distribution)POST /memories/graph/analytics/shortest-path— shortest path between two nodesPOST /memories/graph/analytics/degree-centrality— degree centrality rankingPOST /memories/graph/analytics/betweenness— betweenness centrality (requires GDS)POST /memories/graph/analytics/subgraph— ego-graph/subgraph extraction
Configuration (database_conf.py, configuration/__init__.py):
- New Neo4j knobs:
gds_enabled,gds_default_damping_factor,gds_default_max_iterations,pagerank_auto_enabled,pagerank_trigger_threshold,dedup_trigger_threshold,dedup_embedding_threshold,dedup_property_threshold,dedup_auto_merge - New semantic memory knob:
related_to_threshold
Migration utilities (neo4j_migration.py):
- One-shot helpers for upgrading existing Neo4j databases:
audit_duplicate_uids,resolve_duplicate_uids,apply_uniqueness_constraints,backfill_entity_type_labels
Documentation (docs/open_source/graph.mdx + four experiment pages):
- Updated graph capability overview and four experimental result pages comparing baseline vector search against graph-enriched search
Bruno collection (tools/bruno/):
- Full end-to-end Bruno API collection covering health, ingestion, standard search, graph search, graph analytics, relationship CRUD, and deduplication flows across 7 folders
Dependencies: No new runtime Python dependencies. The Neo4j GDS plugin is optional; all non-analytics endpoints work with a vanilla Neo4j instance.
Fixes/Closes
Fixes #(issue number)
Type of change
- Bug fix (non-breaking change which fixes an issue)
- New feature (non-breaking change which adds functionality)
- Breaking change (fix or feature that would cause existing functionality to not work as expected)
- Refactor (does not change functionality, e.g., code style improvements, linting)
- Documentation update
- Project Maintenance (updates to build scripts, CI, etc., that do not affect the main project)
- Security (improves security without changing functionality)
How Has This Been Tested?
- Unit Test
- Integration Test
- End-to-end Test
- Test Script (please provide)
- Manual verification (list step-by-step instructions)
| Test file | What it covers |
|---|---|
test_neo4j_knowledge_graph.py |
Entity types, RELATED_TO edge creation, traversal |
test_neo4j_knowledge_graph_integration.py |
Integration against a live Neo4j instance |
test_neo4j_pagerank_pipeline.py |
Background PageRank pipeline |
test_neo4j_shortest_path.py |
Shortest-path queries |
test_neo4j_subgraph_extraction.py |
Ego-graph / subgraph extraction |
test_neo4j_degree_centrality.py |
Degree centrality |
test_neo4j_betweenness_centrality.py |
Betweenness centrality (GDS) |
test_neo4j_cross_collection_traversal.py |
Cross-collection traversal |
test_neo4j_gds_refinements.py |
GDS edge cases and path-quality scoring |
test_neo4j_graph_stats.py |
Graph stats endpoint |
test_graph_data_types.py |
Data type unit tests |
test_episode_dedup.py |
Episode content-hash deduplication |
test_neo4j_feature_relationships_integration.py |
Semantic relationship storage |
test_neo4j_graph_relationships_integration.py |
Graph relationship integration |
test_neo4j_utils.py |
Neo4j utility helpers |
test_semantic_memory_graph_enrichment.py |
Semantic memory graph enrichment |
test_semantic_prompt_template.py |
Prompt template |
test_declarative_memory_entity_types.py |
Entity type filtering in declarative memory |
test_declarative_memory_graph_search.py |
Graph-assisted declarative search |
test_graph_router.py |
REST API graph router unit tests |
test_graph_integration.py |
REST API graph integration tests |
Test Results: All unit tests pass locally. Integration tests require a running Neo4j instance. GDS analytics tests additionally require the Neo4j GDS plugin.
Checklist
- I have signed the commit(s) within this pull request
- My code follows the style guidelines of this project (See STYLE_GUIDE.md)
- I have performed a self-review of my own code
- I have commented my code
- I have made corresponding changes to the documentation
- My changes generate no new warnings
- I have added unit tests that prove my fix is effective or that my feature works
- New and existing unit tests pass locally with my changes
- Any dependent changes have been merged and published in downstream modules
- I have checked my code and corrected any misspellings
Maintainer Checklist
- Confirmed all checks passed
- Contributor has signed the commit(s)
- Reviewed the code
- Run, Tested, and Verified the change(s) work as expected
Screenshots/Gifs
N/A
Further comments
- GDS analytics endpoints (
/analytics/pagerank,/analytics/communities,/analytics/betweenness) require the Neo4j Graph Data Science plugin andgds_enabled: truein the Neo4j configuration. All other graph endpoints work with a standard Neo4j instance. - The episode content-hash deduplication migration runs automatically on startup and is safe to apply to existing PostgreSQL and SQLite databases.
RELATED_TOedges withsimilarity >= 0.99are suppressed to avoid near-duplicate noise between identical or near-identical semantic features.- Path-quality scoring is applied during multi-hop traversal: a result at hop distance d receives a score of
score_decay^d(defaultscore_decay = 0.7), and paths crossing low-similarityRELATED_TOedges are penalised via thepath_qualityfield onMultiHopResult.