SERF is an open-source framework for semantic entity resolution — identifying when two or more records refer to the same real-world entity using large language models, sentence embeddings, and agentic AI.
SERF runs multiple rounds of entity resolution until the dataset converges to a stable state, with DSPy agents controlling all phases dynamically.
Features
Phase 0 — Agentic Control
DSPy ReAct agents dynamically orchestrate the entire pipeline, adjusting blocking parameters, selecting matching strategies, and deciding when convergence is reached.
Phase 1 — Semantic Blocking
Clusters records using Qwen3 sentence embeddings and FAISS IVF to create efficient blocks for comparison. Auto-scales block size across iterations.
Phase 2 — Schema Alignment, Matching and Merging
All three operations in a single LLM prompt via DSPy signatures with the BAMLAdapter for structured output formatting. Block-level matching lets the LLM see all records simultaneously for holistic decisions.
Phase 3 — Edge Resolution
For knowledge graphs: deduplicate edges that result from merging nodes using LLM-guided intelligent merging.
Architecture
| Component | Technology |
|---|---|
| Package Manager | uv |
| Data Processing | PySpark 4.x |
| LLM Framework | DSPy 3.x with BAMLAdapter |
| Embeddings | multilingual-e5-base via sentence-transformers |
| Vector Search | FAISS IndexIVFFlat |
| Linting/Formatting | Ruff |
| Type Checking | zuban (mypy-compatible) |
Quick Start
Installation
git clone https://github.com/Graphlet-AI/serf.git
cd serf
uv sync --extra devDocker
# Build docker compose build # Run any serf command docker compose run serf benchmark --dataset dblp-acm # Run benchmarks docker compose --profile benchmark up # Run tests docker compose --profile test up # Analyze a dataset (put your file in data/) docker compose run serf analyze --input data/input.csv --output data/er_config.yml
Set your API key in a .env file or export it:
echo "GEMINI_API_KEY=your-key" > .env
System Requirements
- Python 3.12+
- Java 11/17/21 (for PySpark)
- 4GB+ RAM recommended
CLI Usage
# Profile a dataset serf analyze --input data/companies.parquet # Run the full ER pipeline serf resolve --input data/entities.csv --output data/resolved/ --iteration 1 # Run individual phases serf block --input data/entities.csv --output data/blocks/ --method semantic serf match --input data/blocks/ --output data/matches/ --iteration 1 serf eval --input data/matches/ # Benchmark against standard datasets serf download --dataset dblp-acm serf benchmark --dataset dblp-acm --output data/results/
Python API
from serf.block.pipeline import SemanticBlockingPipeline from serf.match.matcher import EntityMatcher from serf.eval.metrics import evaluate_resolution # Block pipeline = SemanticBlockingPipeline(target_block_size=50) blocks, metrics = pipeline.run(entities) # Match matcher = EntityMatcher(model="gemini/gemini-2.0-flash") resolutions = await matcher.resolve_blocks(blocks) # Evaluate metrics = evaluate_resolution(predicted_pairs, ground_truth_pairs)
DSPy Interface
import dspy from serf.dspy.signatures import BlockMatch from serf.dspy.baml_adapter import BAMLAdapter lm = dspy.LM("gemini/gemini-2.0-flash", api_key=GEMINI_API_KEY) dspy.configure(lm=lm, adapter=BAMLAdapter()) matcher = dspy.ChainOfThought(BlockMatch) result = matcher(block_records=block_json, schema_info=schema, few_shot_examples=examples)
Benchmark Results
Performance on standard ER benchmarks from the Leipzig Database Group. Blocking uses multilingual-e5-base name-only embeddings + FAISS IVF. Matching uses Gemini 2.0 Flash via DSPy BlockMatch.
| Dataset | Domain | Left | Right | Matches | Precision | Recall | F1 |
|---|---|---|---|---|---|---|---|
| DBLP-ACM | Bibliographic | 2,616 | 2,294 | 2,224 | 0.8849 | 0.5809 | 0.7014 |
Blocking uses name-only embeddings for tighter semantic clusters. All matching decisions are made by the LLM — no embedding similarity thresholds.
Project Structure
src/serf/
├── cli/ # Click CLI commands
├── dspy/ # DSPy types, signatures, agents, adapter
├── block/ # Semantic blocking (embeddings, FAISS, normalization)
├── match/ # UUID mapping, LLM matching, few-shot examples
├── merge/ # Field-level entity merging
├── edge/ # Edge resolution for knowledge graphs
├── eval/ # Metrics, benchmark datasets
├── analyze/ # Dataset profiling, field detection
├── spark/ # PySpark schemas, utils, Iceberg, graph components
├── config.py # Configuration management
└── logs.py # Logging
Configuration
All configuration is centralized in config.yml:
from serf.config import config model = config.get("models.llm") # "gemini/gemini-2.0-flash" block_size = config.get("er.blocking.target_block_size") # 50
Development
# Install dependencies uv sync # Run tests uv run pytest tests/ # Lint and format uv run ruff check --fix src tests uv run ruff format src tests # Type check uv run zuban check src tests # Pre-commit hooks pre-commit install pre-commit run --all-files
References
- Jurney, R. (2024). "The Rise of Semantic Entity Resolution." Towards Data Science.
- Khattab, O. et al. (2024). "DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines." ICLR 2024.
- Li, Y. et al. (2021). "Ditto: A Simple and Efficient Entity Matching Framework." VLDB 2021.
- Mudgal, S. et al. (2018). "Deep Learning for Entity Matching: A Design Space Exploration." SIGMOD 2018.
- Papadakis, G. et al. (2020). "Blocking and Filtering Techniques for Entity Resolution: A Survey." ACM Computing Surveys.
License
Apache License 2.0. See LICENSE for details.