GitHub - vbilsmith/breed2vec: Package to decompose FCI breed standards to facilitate semantic comparisons of dog breeds

breed2vec is a lightweight, reproducible pipeline for collecting Fédération Cynologique Internationale (FCI) dog breed standards, extracting text from PDFs, and analyzing semantic similarity between breeds using modern text representations. The goal of this project is not to infer ancestry or biological truth, but to evaluate how well document embeddings recover known historical, geographic, and morphological relationships from breed-standard text alone. The workflow is designed to be transparent and inspectable.

scrape → store → ingest PDFs → analyze

Why This Matters (AI Safety Context)

This project is a small, controlled testbed for evaluating how well text embeddings recover meaningful structure from domain‑specific documents. That style of construct‑validity check is directly relevant to safety‑adjacent evaluation: if embeddings fail on known structure in a narrow domain, they are less trustworthy for high‑stakes interpretability or retrieval settings.

What This Demo Shows (30-second overview)

Using a small, interpretable set of breeds, the demo illustrates that:

Breeds with shared geographic and functional history (e.g., Labrador Retriever and Newfoundland) exhibit high semantic similarity.
Related breeds developed in different contexts (e.g., Golden Retriever) appear nearby but offset.
Morphologically and historically distinct breeds (e.g., Xoloitzcuintle) separate cleanly.
These relationships emerge without any biological labels—purely from text embeddings of breed standards.

This serves as a construct-validity sanity check for document-level embeddings in a fine-grained biological domain.

Quickstart for Reviewers

Read the Retriever Sandbox writeup: breed2vec/docs/retriever_sandbox.md.

Optionally run analysis on the provided breed list:

python -m breed2vec analyze --breeds breed2vec/breeds.txt

To avoid scraping, use a cached DB (see below).

Environment Setup

Create and activate the conda environment:

conda env create -f environment.yml
conda activate breed2vec

(Alternatively, dependencies can be installed via pip; see environment.yml for details.)

Minimal Working Example

This is the shortest path from scratch to analysis output.

Populate FCI group and breed metadata:

python -m breed2vec groups
python -m breed2vec breeds

Specify a small list of breeds: Edit breed2vec/breeds.txt, adding one breed per line using official FCI names.
Ingest breed standards (PDF download + text extraction):

python -m breed2vec ingest --breeds breed2vec/breeds.txt

Notes:

Internet access is required to download new PDFs.
If offline, use a cached DB (see below).

Run analysis:

python -m breed2vec analyze --breeds breed2vec/breeds.txt

This produces cosine similarity matrices and low-dimensional visualizations (e.g., PCA) over document embeddings.

Project Structure (Short)

breed2vec/scrape/: scrape FCI group + breed metadata.
breed2vec/db/: sqlite schema + CRUD.
breed2vec/ingest/: PDF download and text extraction.
breed2vec/analyze/: analysis helpers (TF‑IDF / embeddings).
breed2vec/pipeline/: orchestration entrypoints.

For full detail, see breed2vec/MANIFEST.md.

Using a Cached Database (No Scraping Required)

If you have a cached fci_cache.db, you can skip scraping and run ingest/analyze directly.

Option A: environment variables

export BREED2VEC_DB_PATH="/path/to/fci_cache.db"
python -m breed2vec ingest --breeds breed2vec/breeds.txt
python -m breed2vec analyze --breeds breed2vec/breeds.txt

Option B: CLI flags

python -m breed2vec ingest --db-path /path/to/fci_cache.db --breeds breed2vec/breeds.txt
python -m breed2vec analyze --db-path /path/to/fci_cache.db --breeds breed2vec/breeds.txt

If you have a full cached data directory (db + pdfs + layout), use:

python -m breed2vec analyze --data-dir /path/to/data

Outputs

breed2vec/data/fci_cache.db: sqlite database for breed metadata and documents.
breed2vec/data/pdfs/: downloaded PDF standards.
breed2vec/data/layout/: optional layout traces for section extraction.
breed2vec/data/plots/<run_id>/: analysis outputs (cosine matrix, TF‑IDF tables, plots).

Notes on Data and Reproducibility

The FCI site is the source of truth for breed metadata and PDF standards.
The pipeline is designed to be incremental; re‑ingest re‑downloads PDFs and updates only when hashes change.
PCA visualizations are used for interpretability, not statistical inference.

Limitations and Scope

This demo uses a small number of breeds for clarity and interpretability. Semantic similarity reflects shared descriptive language, not proof of genetic ancestry. The project is intended as an exploratory and diagnostic tool rather than a benchmark.

Next Steps (Planned)

Section extraction from layout traces (history vs morphology vs temperament).
Structured comparisons across section types.

Contact

If you are interested in this work, please feel free to reach out for a walkthrough or a curated subset of results.

Acknowledgements

Development was assisted by OpenAI tools (Codex / ChatGPT) for scaffolding, refactoring, and debugging support.