Improve graph: entity types, traversal, ingestion pipeline, REST API, tests, and scoring by ChristianKniep · Pull Request #1139

Improve graph: entity types, traversal, ingestion pipeline, REST API, tests, and scoring by ChristianKniep · Pull Request #1139 · MemMachine/MemMachine

Purpose of the change

Adds a knowledge graph layer to MemMachine, enabling multi-hop relationship traversal, entity-typed nodes, semantic feature relationships, graph analytics, and node deduplication on top of the existing Neo4j vector store. This allows the system to answer queries that require following connections across memories — for example, discovering that Bob is a TensorFlow expert and Project Atlas uses TensorFlow — rather than relying solely on vector similarity scoring.

Description

This PR introduces end-to-end knowledge graph capabilities across the storage, application, and API layers:

Graph infrastructure (neo4j_vector_graph_store.py, data_types.py, graph_traversal_store.py):

RELATED_TO edges are created between semantically similar features during ingestion, controlled by a configurable cosine-similarity threshold (related_to_threshold, default 0.70)
Entity-type labels are applied to nodes (ENTITY_TYPE_Person, ENTITY_TYPE_Concept, ENTITY_TYPE_Event, etc.) and exposed as a filter parameter on the search API
Multi-hop traversal, graph-filtered vector search, shortest-path queries, subgraph (ego-graph) extraction
GDS-powered analytics: PageRank, Louvain community detection, degree centrality, betweenness centrality
Background node deduplication with configurable SAME_AS proposals and merge/dismiss resolution via the API
Near-duplicate RELATED_TO edge suppression at similarity >= 0.99 to avoid noise

Semantic ingestion pipeline (semantic_ingestion.py, semantic_relationship_storage.py, feature_relationship_types.py):

After each ingestion cycle, semantic features are cross-linked with typed edges: RELATED_TO, CONTRADICTS, IMPLIES, SUPERSEDES
A new SemanticRelationshipStorage protocol exposes relationship CRUD and contradiction detection

Episode store deduplication (episode_sqlalchemy_store.py):

Adds a content_hash column (SHA-256 of session_key + producer_id + content) with ON CONFLICT DO NOTHING upsert on both PostgreSQL and SQLite
Includes an online migration that backfills existing rows and adds the unique constraint idempotently on startup
Episode.is_new flag allows callers to distinguish newly inserted episodes from deduplicated returns

REST API (graph_router.py, ~1,900 lines) — new /memories/graph route group:

POST /memories/graph/search/multi-hop — multi-hop traversal from an anchor node
POST /memories/graph/search/filtered — graph-filtered vector similarity search
POST /memories/graph/relationships — create typed feature relationships
POST /memories/graph/relationships/get — query relationships with direction and confidence filters
POST /memories/graph/relationships/delete — delete a specific relationship
POST /memories/graph/contradictions — find all CONTRADICTS pairs within a feature set
POST /memories/graph/dedup/proposals — list duplicate node proposals
POST /memories/graph/dedup/resolve — merge or dismiss duplicate pairs
POST /memories/graph/analytics/pagerank — compute PageRank (requires GDS)
POST /memories/graph/analytics/communities — Louvain community detection (requires GDS)
POST /memories/graph/analytics/stats — graph statistics (node/edge counts, degree, type distribution)
POST /memories/graph/analytics/shortest-path — shortest path between two nodes
POST /memories/graph/analytics/degree-centrality — degree centrality ranking
POST /memories/graph/analytics/betweenness — betweenness centrality (requires GDS)
POST /memories/graph/analytics/subgraph — ego-graph/subgraph extraction

Configuration (database_conf.py, configuration/__init__.py):

New Neo4j knobs: gds_enabled, gds_default_damping_factor, gds_default_max_iterations, pagerank_auto_enabled, pagerank_trigger_threshold, dedup_trigger_threshold, dedup_embedding_threshold, dedup_property_threshold, dedup_auto_merge
New semantic memory knob: related_to_threshold

Migration utilities (neo4j_migration.py):

One-shot helpers for upgrading existing Neo4j databases: audit_duplicate_uids, resolve_duplicate_uids, apply_uniqueness_constraints, backfill_entity_type_labels

Documentation (docs/open_source/graph.mdx + four experiment pages):

Updated graph capability overview and four experimental result pages comparing baseline vector search against graph-enriched search

Bruno collection (tools/bruno/):

Full end-to-end Bruno API collection covering health, ingestion, standard search, graph search, graph analytics, relationship CRUD, and deduplication flows across 7 folders

Dependencies: No new runtime Python dependencies. The Neo4j GDS plugin is optional; all non-analytics endpoints work with a vanilla Neo4j instance.

Fixes/Closes

Fixes #(issue number)

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Refactor (does not change functionality, e.g., code style improvements, linting)
Documentation update
Project Maintenance (updates to build scripts, CI, etc., that do not affect the main project)
Security (improves security without changing functionality)

How Has This Been Tested?

Unit Test
Integration Test
End-to-end Test
Test Script (please provide)
Manual verification (list step-by-step instructions)

Test file	What it covers
`test_neo4j_knowledge_graph.py`	Entity types, `RELATED_TO` edge creation, traversal
`test_neo4j_knowledge_graph_integration.py`	Integration against a live Neo4j instance
`test_neo4j_pagerank_pipeline.py`	Background PageRank pipeline
`test_neo4j_shortest_path.py`	Shortest-path queries
`test_neo4j_subgraph_extraction.py`	Ego-graph / subgraph extraction
`test_neo4j_degree_centrality.py`	Degree centrality
`test_neo4j_betweenness_centrality.py`	Betweenness centrality (GDS)
`test_neo4j_cross_collection_traversal.py`	Cross-collection traversal
`test_neo4j_gds_refinements.py`	GDS edge cases and path-quality scoring
`test_neo4j_graph_stats.py`	Graph stats endpoint
`test_graph_data_types.py`	Data type unit tests
`test_episode_dedup.py`	Episode content-hash deduplication
`test_neo4j_feature_relationships_integration.py`	Semantic relationship storage
`test_neo4j_graph_relationships_integration.py`	Graph relationship integration
`test_neo4j_utils.py`	Neo4j utility helpers
`test_semantic_memory_graph_enrichment.py`	Semantic memory graph enrichment
`test_semantic_prompt_template.py`	Prompt template
`test_declarative_memory_entity_types.py`	Entity type filtering in declarative memory
`test_declarative_memory_graph_search.py`	Graph-assisted declarative search
`test_graph_router.py`	REST API graph router unit tests
`test_graph_integration.py`	REST API graph integration tests

Test Results: All unit tests pass locally. Integration tests require a running Neo4j instance. GDS analytics tests additionally require the Neo4j GDS plugin.

Checklist

I have signed the commit(s) within this pull request
My code follows the style guidelines of this project (See STYLE_GUIDE.md)
I have performed a self-review of my own code
I have commented my code
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added unit tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
Any dependent changes have been merged and published in downstream modules
I have checked my code and corrected any misspellings

Maintainer Checklist

Confirmed all checks passed
Contributor has signed the commit(s)
Reviewed the code
Run, Tested, and Verified the change(s) work as expected

Screenshots/Gifs

N/A

Further comments

GDS analytics endpoints (/analytics/pagerank, /analytics/communities, /analytics/betweenness) require the Neo4j Graph Data Science plugin and gds_enabled: true in the Neo4j configuration. All other graph endpoints work with a standard Neo4j instance.
The episode content-hash deduplication migration runs automatically on startup and is safe to apply to existing PostgreSQL and SQLite databases.
RELATED_TO edges with similarity >= 0.99 are suppressed to avoid near-duplicate noise between identical or near-identical semantic features.
Path-quality scoring is applied during multi-hop traversal: a result at hop distance d receives a score of score_decay^d (default score_decay = 0.7), and paths crossing low-similarity RELATED_TO edges are penalised via the path_quality field on MultiHopResult.