Public dataset of all Sefaria texts, hosted on Google Cloud Storage.
This repository is a lightweight index and set of tools for accessing the Sefaria text corpus. The actual text data (~26GB, ~85K files) lives in a public GCS bucket and can be downloaded without authentication.
Quick Start
Browse what's available
# List top-level formats and directories ./examples/browse_bucket.sh # Drill into a specific category ./examples/browse_bucket.sh json/Talmud
Download a single text
curl -O "https://storage.googleapis.com/sefaria-export/json/Tanakh/Torah/Genesis/English/merged.json"Download an entire category
# Using the helper script ./examples/download_category.sh Talmud # all Talmud in JSON ./examples/download_category.sh Mishnah txt # all Mishnah in TXT # Or directly with gcloud/gsutil gcloud storage cp -r "gs://sefaria-export/json/Talmud/" ./talmud/ gsutil -m cp -r "gs://sefaria-export/json/Talmud/" ./talmud/
Download everything
gcloud storage cp -r "gs://sefaria-export/" ./sefaria-data/Use books.json to filter and download programmatically
import requests books = requests.get( "https://raw.githubusercontent.com/Sefaria/Sefaria-Export/master/books.json" ).json() # Find all Talmud texts for book in books["books"]: if "Talmud" in book["categories"]: print(book["title"], book.get("json_url"))
Or use the ready-made script:
# Download all English Mishnah texts as JSON python examples/download_from_books_json.py --category Mishnah --language English # Download a specific title python examples/download_from_books_json.py --title "Genesis" # List what's available without downloading python examples/download_from_books_json.py --category Tanakh --list
Bucket Structure
The GCS bucket is organized hierarchically by format, category, title, language, and version:
gs://sefaria-export/
json/{categories}/{title}/{language}/{versionTitle}.json
txt/{categories}/{title}/{language}/{versionTitle}.txt
cltk-full/{categories}/{title}/{language}/{versionTitle}.json
cltk-flat/{categories}/{title}/{language}/{versionTitle}.json
schemas/{title}.json
links/links0.csv ... links12.csv
table_of_contents.json
Example paths
json/Tanakh/Torah/Genesis/English/merged.json
json/Talmud/Bavli/Seder Moed/Shabbat/Hebrew/merged.json
txt/Mishnah/Seder Zeraim/Mishnah Berakhot/English/merged.txt
schemas/Genesis.json
links/links0.csv
Formats
| Format | Description |
|---|---|
json/ |
Structured JSON with text content, verse-level arrays |
txt/ |
Plain text, one file per version |
cltk-full/ |
JSON formatted for the Classical Language Toolkit |
cltk-flat/ |
Flattened CLTK format |
schemas/ |
Schema/structure metadata for each text |
links/ |
CSV files of all intertextual connections |
Merged files
Each text directory includes a merged file (e.g., merged.json, merged.txt). This file combines the maximal content available from all versions, using Sefaria's merging logic. When a single complete version exists, the merged file is a copy of it. Use merged files when you want the most complete text available.
books.json
books.json is an index of every text in the bucket. Each entry contains:
{
"title": "Genesis",
"language": "English",
"versionTitle": "merged",
"categories": ["Tanakh", "Torah"],
"json_url": "https://storage.googleapis.com/sefaria-export/json/Tanakh/Torah/Genesis/English/merged.json",
"txt_url": "https://storage.googleapis.com/sefaria-export/txt/Tanakh/Torah/Genesis/English/merged.txt",
"cltk_full_url": "...",
"cltk_flat_url": "..."
}This file is regenerated monthly (2nd of each month, day after the GCS export) by a GitHub Action. It can also be triggered manually from the Actions tab.
Repository Contents
| Path | Description |
|---|---|
books.json |
Index of all texts with metadata and download URLs |
scripts/generate_books_json.py |
Generates books.json from the GCS bucket listing |
examples/download_from_books_json.py |
Filter and download texts using books.json |
examples/download_category.sh |
Download all texts in a category via gcloud |
examples/browse_bucket.sh |
Browse available categories and texts |
.github/workflows/generate-books-json.yml |
Monthly CI to regenerate books.json (also supports manual trigger) |
Related Projects
- Sefaria-Project - Sefaria's main application source code
- Sefaria API - REST API for accessing Sefaria data programmatically
License
See LICENSE.md.