GitHub - BaseModelAI/cleora: Cleora AI is a general-purpose open-source model for efficient, scalable learning of stable and inductive entity embeddings for heterogeneous relational data. Created by Synerise.com team.

The Graph Embedding Engine

Cleora computes all possible random walks in a single matrix multiplication.
No negative sampling. No GPU. No noise. Just fast, deterministic, production-grade embeddings.

Website · Documentation · API Reference · Benchmarks

pip install pycleora

#1 Accuracy. Every Dataset.
Tested on 5 canonical academic datasets against 7 competing algorithms — Cleora wins on accuracy on every single dataset,
and is the only algorithm that scales to every graph without crashing.

240x Faster Than GraphSAGE · 50x Less Memory Than NetMF · ~5 MB Install · 0 GPUs Required

Achievements

1️⃣st place at SIGIR eCom Challenge 2020

2️⃣nd place and Best Paper Award at WSDM Booking.com Challenge 2021

2️⃣nd place at Twitter Recsys Challenge 2021

3️⃣rd place at KDD Cup 2021

Installation

Optional extras:

pip install pycleora[viz]       # matplotlib for visualization
pip install pycleora[full]      # matplotlib + networkx + tqdm

Quick Start

from pycleora import SparseMatrix, embed, find_most_similar

edges = ["alice item_laptop", "alice item_mouse", "bob item_keyboard"]
graph = SparseMatrix.from_iterator(iter(edges), "complex::reflexive::product")

embeddings = embed(graph, feature_dim=256, num_iterations=40)

similar = find_most_similar(graph, embeddings, "alice", top_k=5)
for r in similar:
    print(f"{r['entity_id']}: {r['similarity']:.4f}")

embed() defaults to feature_dim=256, num_iterations=40, and whitening after every propagation step.

Step-by-Step Example

The high-level embed() function wraps the Markov propagation loop for convenience. Here's the full manual version, which gives you complete control over the process:

from pycleora import SparseMatrix, whiten_embeddings
import numpy as np
import pandas as pd
import random

customers = [f"Customer_{i}" for i in range(1, 20)]
products = [f"Product_{j}" for j in range(1, 20)]

data = {
    "customer": random.choices(customers, k=100),
    "product": random.choices(products, k=100),
}

df = pd.DataFrame(data)
customer_products = df.groupby('customer')['product'].apply(list).values
cleora_input = map(lambda x: ' '.join(x), customer_products)

mat = SparseMatrix.from_iterator(cleora_input, columns='complex::reflexive::product')

print(mat.entity_ids)

embeddings = mat.initialize_deterministically(256)

NUM_ITERATIONS = 40

for i in range(NUM_ITERATIONS):
    embeddings = mat.left_markov_propagate(embeddings)
    embeddings /= np.linalg.norm(embeddings, ord=2, axis=-1, keepdims=True)
    embeddings = whiten_embeddings(embeddings)

for entity, embedding in zip(mat.entity_ids, embeddings):
    print(entity, embedding)

print(np.dot(embeddings[0], embeddings[1]))

CLI

pycleora embed --input graph.tsv --output embeddings.npz --dim 256 --iterations 40
pycleora info --input graph.tsv
pycleora similar --input graph.tsv --entity alice --top-k 10
pycleora benchmark --dataset karate_club

Key Advantages

No Negative Sampling

Unlike DeepWalk, Node2Vec, and LINE, Cleora doesn't approximate random walks with negative sampling. It computes all walks exactly via matrix multiplication. Less noise, higher accuracy, perfect reproducibility.

240x Faster Than GraphSAGE

Zomato reported embedding generation in under 5 minutes with Cleora, compared to 20 hours with GraphSAGE on the same dataset. Rust core with adaptive parallelism makes every CPU cycle count.

Deterministic Embeddings

Same input always produces the same output. No random seeds, no stochastic variation, no "run it 5 times and average" workflows. Critical for reproducible research and production ML pipelines.

Heterogeneous Hypergraphs

Natively handles multi-type nodes and edges, bipartite graphs, and hypergraphs. TSV input with typed columns like complex::reflexive::product. No graph preprocessing needed.

~5 MB, Zero Dependencies

The entire library is ~5 MB. Compare: PyTorch Geometric is 500 MB+, DGL is 400 MB+. Cleora ships as a single compiled Rust extension. No CUDA, no cuDNN, no GPU driver headaches.

Stable & Inductive

Embeddings are stable across runs and support inductive learning: new nodes can be embedded without retraining the entire graph. Production-ready from day one.

Supported Algorithms

Algorithm	Type	Description
Cleora	Spectral / Random Walk	Iterative Markov propagation with per-iteration whitening — all random walks in one matrix multiplication
ProNE	Spectral	Fast spectral propagation with Chebyshev polynomial approximation
RandNE	Random Projection	Gaussian random projection for very fast, approximate embeddings
NetMF	Matrix Factorization	Network Matrix Factorization — factorizes the DeepWalk matrix explicitly
DeepWalk	Random Walk	Classic random walk + skip-gram approach
Node2Vec	Random Walk	Biased random walks with tunable BFS/DFS exploration
HOPE	Matrix Factorization	High-Order Proximity preserved Embedding
GraRep	Matrix Factorization	Graph Representations with Global Structural Information
MLP	Neural Classifier	2-layer MLP classifier in pure numpy/scipy — no PyTorch needed

All algorithms are unified under a single API. Switch between methods by changing one parameter:

pycleora embed --input graph.tsv --output out.npz --algorithm cleora
pycleora embed --input graph.tsv --output out.npz --algorithm prone
pycleora embed --input graph.tsv --output out.npz --algorithm node2vec

Advanced Embedding Modes

Beyond the standard algorithms, Cleora supports several advanced embedding strategies:

Multiscale embeddings — concatenates embeddings from different iteration depths (e.g. scales [10, 20, 30, 40]) to capture both local and global graph structure simultaneously
Attention-weighted propagation — uses softmax-normalized dot-product attention during propagation, dynamically weighting neighbor contributions
Supervised refinement — fine-tunes unsupervised embeddings using positive/negative entity pairs with a triplet margin loss
Directed graph embeddings — handles asymmetric relationships where edge direction matters
Weighted graph embeddings — incorporates edge weights into the propagation step
Node feature integration — initializes embeddings with external features (text, image, numeric) before propagation
PCA whitening — built-in whitening after every iteration by default to decorrelate embedding dimensions and improve downstream task performance

Batteries Included

pycleora ships with a comprehensive set of built-in modules:

Module	What it does
`pycleora.community`	Community detection (Louvain)
`pycleora.classify`	MLP and Label Propagation classifiers — no PyTorch needed
`pycleora.sampling`	6 graph sampling methods
`pycleora.tuning`	Grid search and random search for hyperparameter tuning
`pycleora.compress`	Embedding compression (PQ, scalar quantization)
`pycleora.io_utils`	Save/load embeddings (NPZ, CSV, TSV), NetworkX conversion
`pycleora.viz`	Embedding visualization (UMAP, t-SNE projections)
`pycleora.metrics`	Evaluation metrics for embeddings
`pycleora.benchmark`	Compare algorithms with time, memory, and accuracy metrics
`pycleora.ensemble`	Combine embeddings from multiple algorithms
`pycleora.align`	Embedding alignment across graphs
`pycleora.search`	Nearest-neighbor entity search
`pycleora.stats`	Graph statistics and degree analysis
`pycleora.preprocess`	Graph preprocessing and filtering
`pycleora.hetero`	Heterogeneous graph utilities
`pycleora.generators`	Synthetic graph generators for testing
`pycleora.datasets`	Real-world benchmark datasets (Facebook, Cora, CiteSeer, PubMed, PPI, roadNet-CA, and more)

See the full API reference for details on every function and parameter.

Case Study: Zomato

From 20 hours to under 5 minutes — powering recommendations for 80M+ users across 500+ cities.

Zomato's ML team needed graph embeddings to power "People Like You" restaurant recommendations. Their initial approach with GraphSAGE took ~20 hours just to process customer-restaurant interaction data for a single city region — making it impossible to scale across 500+ cities.

Pipeline:

Customer-Restaurant Graph — Bipartite graph of customer orders and restaurant interactions
Cleora Embeddings (< 5 minutes) — 197x faster than DeepWalk, no sampling of positive/negative examples
EMDE Density Estimation — Customer preferences modeled as probability density functions
Production Recommendations — Restaurant recommendations, search ranking, dish suggestions, and "People Like You" lookalikes

Results:

Metric	Value
Speed vs DeepWalk	197x faster
Embedding generation	< 5 min
Cities scaled to	500+
GPUs required	0

Read the full Zomato blog post →

Benchmarks

Benchmarked against 7 competing algorithms on 5 real-world datasets (ego-Facebook, Cora, CiteSeer, PubMed, PPI) plus a 2M-node scale test. All datasets are genuine academic benchmarks from SNAP, Planetoid, and DGL. Cleora wins on accuracy on every single dataset.

Full interactive benchmark results at cleora.ai/benchmarks.

Classification Accuracy

Dataset	Nodes	Cleora	NetMF	DeepWalk	Node2Vec	HOPE	GraRep	ProNE	RandNE
ego-Facebook	4K	0.990	0.957	0.958	0.958	0.890	T/O	0.075	0.212
Cora	2.7K	0.861	0.839	0.835	0.835	0.821	0.809	0.179	0.247
CiteSeer	3.3K	0.824	0.810	0.806	0.806	0.740	0.756	0.189	0.244
PubMed	19.7K	0.879	OOM	T/O	T/O	T/O	OOM	0.339	0.351
PPI	3.9K	1.000	OOM	T/O	T/O	T/O	OOM	0.023	0.073

Only 3 of 8 algorithms survive at 19.7K nodes. HOPE, NetMF, GraRep, DeepWalk, and Node2Vec all crash or time out. Cleora achieves perfect accuracy on PPI (50 classes).

Memory Efficiency

Dataset	Cleora	Best Competitor	Factor
ego-Facebook (4K)	22 MB	572 MB	26x less
Cora (2.7K)	14 MB	227 MB	16x less
CiteSeer (3.3K)	16 MB	294 MB	18x less
PubMed (19.7K)	97 MB	175 MB	Only 3 survived
roadNet-CA (2M)	4.1 GB	—	Only Cleora finished

Scale Test: roadNet-CA (2 Million Nodes)

2 million nodes. 31 seconds. Every other algorithm crashes with out-of-memory. Cleora is the only library that survives at this scale on a single CPU.

Library Comparison

Feature	pycleora 3.2	PyG	KarateClub	DGL	Node2Vec	StellarGraph
CPU-only (no GPU needed)	Yes	Optional	Yes	Optional	Yes	Optional
Rust-powered core	Yes	No (C++)	No	No (C++)	No	No (TF)
No negative sampling needed	Yes	No	No	No	No	No
Deterministic output	Yes	No	No	No	No	No
Node2Vec / DeepWalk	Built-in	Yes	Yes	Yes	Yes	Yes
MLP classifier (no PyTorch)	MLP	Requires PyTorch	No	Requires PyTorch	No	Requires TF
Graph sampling	6 methods	Yes	No	Yes	No	Yes
Hyperparameter tuning	Grid + Random	Manual	No	Manual	No	Manual
Install size	~5 MB	~500 MB+	~15 MB	~400 MB+	~2 MB	~600 MB+
Actively maintained	Yes	Yes	Yes	Yes	Yes	Archived

Use Cases

Recommendation Systems — Products, content, restaurants, videos
Knowledge Graphs — Entity and relation embeddings
Customer Lookalikes — Find users with similar behavior patterns
Entity Resolution — Match entities across data sources
Fraud Detection — Detect anomalous patterns in transaction graphs
Social Networks — Community detection and link prediction
Drug Discovery — Molecule and protein interaction networks
Supply Chain — Supplier and logistics graph analysis

See cleora.ai/use-cases for detailed walkthroughs with code examples.

How It Works

Input Data — Feed edge lists, interaction logs, or knowledge triples. Cleora accepts any TSV with typed columns.
Hypergraph Construction — Builds a heterogeneous hypergraph where a single edge can connect multiple entities of different types.
Sparse Markov Matrix — Constructs a sparse transition matrix (99%+ sparse). Rows normalized so each row sums to 1.
Single Matrix Multiplication = All Walks — One sparse matrix multiplication captures every possible random walk of a given length. No sampling, no noise.
L2-Normalized + Whitened Propagation — Each iteration replaces every node's embedding with the L2-normalized average of its neighbors and then whitens the embedding space. The default configuration runs 40 iterations at 256 dimensions.
Embeddings Ready — Dense, deterministic embedding vectors for every entity. Same input always yields same output.

Also Used By

Synerise — AI/ML platform processing billions of e-commerce events daily. Cleora powers core recommendation and personalization: product embeddings from terabytes of transactions, substitute vs. complement detection, customer segmentation, cold-start solving — all on CPU in minutes.

Dailymotion — Video platform with 350M+ monthly visitors. Personalized video recommendations with improved relevance and catalog coverage.

ML Competitions — Cleora-powered solutions achieved top placements in KDD Cup 2021, WSDM WebTour 2021, and SIGIR eCom 2020 — beating deep learning approaches on travel, e-commerce, and web recommendation benchmarks.

FAQ

Q: What should I embed?

A: Any entities that interact with each other, co-occur or can be said to be present together in a given context. Examples can include: products in a shopping basket, locations frequented by the same people at similar times, employees collaborating together, chemical molecules being present in specific circumstances, proteins produced by the same bacteria, drug interactions, co-authors of the same academic papers, companies occurring together in the same LinkedIn profiles.

Q: How should I construct the input?

A: What works best is grouping entities co-occurring in a similar context, and feeding them in whitespace-separated lines using complex::reflexive modifier is a good idea. E.g. if you have product data, you can group the products by shopping baskets or by users. If you have urls, you can group them by browser sessions, or by (user, time window) pairs. Check out the usage example above. Grouping products by customers is just one possibility.

Q: Can I embed users and products simultaneously, to compare them with cosine similarity?

A: No, this is a methodologically wrong approach, stemming from outdated matrix factorization approaches. What you should do is come up with good product embeddings first, then create user embeddings from them. Feeding two columns e.g. user product into cleora will result in a bipartite graph. Similar products will be close to each other, similar users will be close to each other, but users and products will not necessarily be similar to each other.

Q: What embedding dimensionality to use?

A: The default is 256. For larger production systems we often work from 1024 to 4096, but 256 is the baseline shipped by the library.

Q: How many iterations of Markov propagation should I use?

A: The default is 40 whitening-enhanced propagation steps. If you want more local, co-occurrence-style behavior you can dial that down manually; higher values bias more toward contextual similarity.

Q: How do I incorporate external information, e.g. entity metadata, images, texts into the embeddings?

A: Just initialize the embedding matrix with your own vectors coming from a VIT, sentence-transformers, or a random projection of your numeric features. In that scenario fewer Markov iterations than the default 40 often work best.

Q: My embeddings don't fit in memory, what do I do?

A: Cleora operates on dimensions independently. Initialize your embeddings with a smaller number of dimensions, run Cleora, persist to disk, then repeat. You can concatenate your resulting embedding vectors afterwards, but remember to normalize them afterwards!

Q: Is there a minimum number of entity occurrences?

A: No, an entity A co-occurring just 1 time with some other entity B will get a proper embedding, i.e. B will be the most similar to A. The other way around, A will be highly ranked among nearest neighbors of B, which may or may not be desirable, depending on your use case. Feel free to prune your input to Cleora to eliminate low-frequency items.

Q: Are there any edge cases where Cleora can fail?

A: Cleora works best for relatively sparse hypergraphs. If all your hyperedges contain some very common entity X, e.g. a shopping bag, then it will degrade the quality of embeddings by degenerating shortest paths in the random walk. It is a good practice to remove such entities from the hypergraph.

Q: How can Cleora be so fast and accurate at the same time?

A: Not using negative sampling is a great boon. By constructing the (sparse) Markov transition matrix, Cleora explicitly performs all possible random walks in a hypergraph in one big step (a single matrix multiplication). That's what we call a single iteration. The default configuration performs 40 such iterations with whitening after every step. Negative sampling or randomly selecting random walks tend to introduce a lot of noise - Cleora is free of those burdens.

Resources

Website: cleora.ai
API Reference: cleora.ai/api
Benchmarks: cleora.ai/benchmarks
Whitepaper: "Cleora: A Simple, Strong and Scalable Graph Embedding Scheme"
GitHub: github.com/BaseModelAI/cleora
PyPI: pypi.org/project/pycleora

Cite

Please cite our paper (and the respective papers of the methods used) if you use this code in your own work:

@article{DBLP:journals/corr/abs-2102-02302,
  author    = {Barbara Rychalska, Piotr Babel, Konrad Goluchowski, Andrzej Michalowski, Jacek Dabrowski},
  title     = {Cleora: {A} Simple, Strong and Scalable Graph Embedding Scheme},
  journal   = {CoRR},
  year      = {2021}
}

License

MIT licensed. See LICENSE for details.

Contributing

Pull requests are welcome. For major changes, please open an issue first. Contact: cleora@synerise.com