GitHub - Sudhendra/compression-layer: Build a model-agnostic semantic compression layer that reduces token count for LLM inputs (memories, code, context) while preserving task performance across Claude, GPT, and Gemini.

Can we fine-tune small LLMs to compress context while preserving semantic equivalence across models?

This project trains LoRA adapters on small (3B–8B) language models to rewrite verbose LLM context into compressed representations that produce equivalent reasoning outputs from Claude, GPT, and Gemini — reducing token usage while maintaining answer quality.

Read the full write-up: Semantic Compression — Research Overview

Pipeline: Raw Corpora → Synthetic Generation → Preprocessing → SFT Training → Compression Eval → Equivalence Eval

Key Results

Summary of trained adapter evaluations on the full test set (2,497 examples from Sudhendra/semantic-compression-sft):

Compression (Token Ratio)

Model	Params	Backend	LoRA Config	Avg Token Ratio	Avg Input Tokens	Avg Output Tokens
Nanbeige4.1-3B (8-bit)	3B	MLX (local)	rank 8, lr 1e-4, 500 iters	34.8%	276.3	103.2
Qwen3-8B	8B	Tinker (cloud)	rank 16, lr 2e-4, 4962 steps	38.6%	237.5	98.1

Token ratio = output_tokens / input_tokens. Lower is better. A ratio of 34.8% means the compressed output uses ~35% of the original token count.

Extrinsic Evaluation (Downstream Task Performance)

The primary evaluation: does compressed context still let a task model answer correctly? Each example is evaluated twice — once with full context and once with compressed context — using the same task model and prompt. The delta between full-context and compressed-context accuracy measures the cost of compression.

Extrinsic Evaluation: Benchmark datasets → Paired evaluation (full vs compressed context) → Task model → Per-example scoring (EM, F1, compression ratio, cost) → Aggregate comparison

Supported benchmarks:

Benchmark	Task	Metric
HotPotQA	Multi-hop QA	Exact Match, Token F1
Qasper	Long-document QA	Exact Match, Token F1
DS1000	Code generation	Exact Match, Token F1

Compressor baselines:

Compressor	Description
`identity`	No compression (upper bound on accuracy)
`truncate`	Hard token-limit truncation
`extractive`	Top sentences by TF-IDF similarity to query
`adapter_local`	Learned LoRA adapter (MLX, local)
`adapter_tinker`	Learned LoRA adapter (Tinker, cloud)

Results — 300 examples per benchmark, gpt-4o-mini as task model, seed 42. Each example is evaluated twice (full context vs compressed context); deltas show the accuracy cost of compression. See docs/downstream-eval.md for the full workflow.

HotPotQA (multi-hop QA, n = 300):

Compressor	Compressed EM	ΔEM	Compressed F1	ΔF1	Ratio	Cost
`identity`	0.043	−0.003	0.177	+0.000	1.00	$0.129
`truncate`	0.040	−0.007	0.093	−0.087	0.31	$0.085
`extractive`	0.030	−0.020	0.068	−0.112	0.20	$0.079
`adapter_local`	0.027	−0.020	0.104	−0.075	0.26	$0.084
`adapter_tinker`	0.010	−0.037	0.067	−0.109	0.31	$0.087

DS1000 (code generation, n = 300):

Compressor	Compressed EM	ΔEM	Compressed F1	ΔF1	Ratio	Cost
`identity`	0.213	0.000	0.459	+0.002	1.00	$0.102
`truncate`	0.210	+0.003	0.435	−0.021	0.87	$0.094
`extractive`	0.170	−0.030	0.326	−0.128	0.65	$0.087
`adapter_local`	0.067	−0.143	0.159	−0.301	0.38	$0.079
`adapter_tinker`	0.037	−0.170	0.079	−0.386	0.62	$0.084

Key takeaways:

Identity confirms the baseline — zero compression, near-zero delta (noise only).
Truncate preserves accuracy well on DS1000 (ratio 0.87, ΔF1 −0.02) where contexts are shorter, but loses more on HotPotQA (ratio 0.31, ΔF1 −0.09) where multi-hop evidence spans the full context.
Learned adapters achieve aggressive compression (0.26–0.62 ratio) but with significant accuracy drops, especially on code generation. The adapter was trained on conversational text, not code — domain mismatch explains the gap.
Qasper results pending.

Intrinsic Equivalence (Cross-Model Semantic Fidelity)

A diagnostic evaluation measuring whether the compressed text preserves the same semantic content as the original verbose text. Both are sent to three frontier models (Claude, GPT, Gemini) for fact-extraction tasks; the outputs are compared through a 3-gate system:

Gate	Metric	Threshold	What it catches
1	Embedding similarity (MiniLM-L6-v2)	>= 0.60	Gross topic drift, garbled outputs
2	Fact overlap (atomic fact coverage)	>= 0.55	Dropped facts, numbers, parameters
3	LLM judge	>= 0.75	Subtle reasoning gaps, quality issues

A sample passes only when the minimum score across all three models exceeds the threshold on every active gate.

Intrinsic Equivalence: 3-Gate Scoring — Compression pair → Frontier model (Claude/GPT/Gemini) fact extraction → 3-gate comparison (embedding, fact overlap, LLM judge) → per-model aggregation → final pass/fail verdict

Model	Sample (seed 42)	Pass Rate	Avg Min-Equiv	Median Min-Equiv
Qwen3-8B (step 4500)	300	2.0%	0.390	0.413
Nanbeige 3B (iter 500)	300	1.3%	0.402	0.434

Qwen3-8B gate-level breakdown

Gate failure rates (how often each gate was the bottleneck):

Gate	Fail Rate	Avg Score
Embedding (>= 0.60)	15.3%	0.782
Fact overlap (>= 0.55)	76.0%	0.535
LLM judge (>= 0.75)	96.3%	0.584

Without the LLM judge (gates 1+2 only), the pass rate would be 24.0%.

Per-model avg scores (across 300 samples):

Model	Embedding	Fact Overlap	LLM Judge
Claude Sonnet	0.783	0.539	0.572
GPT-4o-mini	0.795	0.525	0.607
Gemini Flash	0.768	0.542	0.574

Per-domain results (3-gate, all models):

Domain	n	Pass Rate	Avg Min-Equiv	Median Min-Equiv
NL	95	2.1%	0.320	0.304
Mixed	171	1.8%	0.415	0.444
Code	34	2.9%	0.459	0.450

Nanbeige 3B gate-level breakdown

Gate failure rates (how often each gate was the bottleneck):

Gate	Fail Rate	Avg Score
Embedding (>= 0.60)	12.7%	0.804
Fact overlap (>= 0.55)	74.0%	0.556
LLM judge (>= 0.75)	97.0%	0.596

Without the LLM judge (gates 1+2 only), the pass rate would be 25.7%.

Per-model avg scores (across 300 samples):

Model	Embedding	Fact Overlap	LLM Judge
Claude Sonnet	0.805	0.553	0.594
GPT-4o-mini	0.817	0.548	0.618
Gemini Flash	0.790	0.566	0.577

Per-domain results (3-gate, all models):

Domain	n	Pass Rate	Avg Min-Equiv	Median Min-Equiv
NL	95	3.2%	0.346	0.350
Mixed	171	0.6%	0.419	0.444
Code	34	0.0%	0.476	0.456

Samples are domain-stratified (95 NL / 171 mixed / 34 code) to match the test set distribution.

Training run details

Nanbeige 3B — 500 iterations (MLX)

Model: mlx-community/Nanbeige4.1-3B-8bit
LoRA: rank 8, alpha 16, 16 layers
Config: 500 iters, batch size 4, lr 1e-4
Eval artifact: models/eval/ratio_nanbeige_iter500.jsonl

Qwen3-8B (Tinker cloud)

Model: Qwen/Qwen3-8B
LoRA: rank 16, alpha 32, dropout 0.05
Config: 2 epochs, batch size 4, lr 2e-4
Status: completed, early-stopped at step 4962 (planned 9924)
Best checkpoint: step 4500 (val_loss = 0.2579)
Train examples: 19,845

Quick Start

# Clone and setup
git clone https://github.com/Sudhendra/compression-layer.git
cd compression-layer

python3 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"

# Copy and configure environment variables
cp .env.example .env
# Edit .env with your API keys

# Run tests
pytest tests/ -v

Environment Variables

Create a .env file (see .env.example):

ANTHROPIC_API_KEY=sk-ant-...   # For Claude intrinsic eval
OPENAI_API_KEY=sk-...          # For GPT intrinsic eval + downstream task model
GOOGLE_API_KEY=...             # For Gemini intrinsic eval
HF_TOKEN=hf_...               # For HuggingFace dataset access
TINKER_API_KEY=tk_...          # For Tinker cloud training/inference

Reproduce Evaluation

1. Extrinsic Eval (Downstream Task Performance)

The primary evaluation. Measures whether compressed context preserves task accuracy.

Build a dataset:

python scripts/prepare_downstream_eval.py \
  --benchmark hotpotqa \
  --split validation \
  --limit 25 \
  --output data/eval/downstream/hotpotqa_validation_25.jsonl

Run baseline comparisons (identity, truncate, extractive):

# Identity baseline (full context, no compression — upper bound)
python scripts/evaluate_downstream.py \
  --dataset data/eval/downstream/hotpotqa_validation_25.jsonl \
  --compressor identity \
  --task-model gpt-4o-mini \
  --output models/eval/downstream_hotpotqa25_identity_gpt4omini.jsonl \
  --max-cost 0.50 \
  --resume

# Truncation baseline
python scripts/evaluate_downstream.py \
  --dataset data/eval/downstream/hotpotqa_validation_25.jsonl \
  --compressor truncate \
  --truncate-tokens 256 \
  --task-model gpt-4o-mini \
  --output models/eval/downstream_hotpotqa25_truncate_gpt4omini.jsonl \
  --max-cost 0.50 \
  --resume

# Extractive baseline
python scripts/evaluate_downstream.py \
  --dataset data/eval/downstream/hotpotqa_validation_25.jsonl \
  --compressor extractive \
  --extractive-chars 1000 \
  --task-model gpt-4o-mini \
  --output models/eval/downstream_hotpotqa25_extractive_gpt4omini.jsonl \
  --max-cost 0.50 \
  --resume

Run learned compressor (requires adapter):

# Local MLX adapter
python scripts/evaluate_downstream.py \
  --dataset data/eval/downstream/hotpotqa_validation_25.jsonl \
  --compressor adapter_local \
  --adapter-model mlx-community/Nanbeige4.1-3B-8bit \
  --adapter-path models/runs/mlx/.../adapter \
  --task-model gpt-4o-mini \
  --output models/eval/downstream_hotpotqa25_adapterlocal_gpt4omini.jsonl \
  --max-cost 0.50 \
  --resume

# Tinker cloud adapter
python scripts/evaluate_downstream.py \
  --dataset data/eval/downstream/hotpotqa_validation_25.jsonl \
  --compressor adapter_tinker \
  --checkpoint-path "tinker://.../weights/step-004500" \
  --task-model gpt-4o-mini \
  --output models/eval/downstream_hotpotqa25_adaptertinker_gpt4omini.jsonl \
  --max-cost 0.50 \
  --resume

See docs/downstream-eval.md for the full workflow including budget rules and reporting metrics.

2. Compression Eval (Token Ratio)

Measures how much the adapter compresses input tokens.

Local MLX backend (for Nanbeige / local models):

python scripts/evaluate_tinker.py \
  --backend local \
  --model mlx-community/Nanbeige4.1-3B-8bit \
  --adapter-path models/runs/mlx/2026-03-01_20-53-41--iter-500/adapter \
  --hf-dataset Sudhendra/semantic-compression-sft \
  --output models/eval/ratio_nanbeige_iter500.jsonl \
  --resume

Tinker cloud backend (for Qwen3-8B):

python scripts/evaluate_tinker.py \
  --backend tinker \
  --hf-dataset Sudhendra/semantic-compression-sft \
  --checkpoint-path "tinker://161b1f39-3e50-53c0-9d75-f5ce804db7eb:train:0/weights/step-004500" \
  --output models/eval/ratio_qwen3-8b_tinker_step004500.jsonl \
  --show-examples 0 \
  --resume

3. Convert to Validation Pairs

Transforms compression eval output into the format needed for intrinsic equivalence testing:

python - <<'PY'
import json
from pathlib import Path
from src.inference.domain_classifier import DomainClassifier

src = Path("models/eval/ratio_qwen3-8b_tinker_step004500.jsonl")
dst = Path("models/eval/ratio_qwen3-8b_pairs.jsonl")
clf = DomainClassifier()

with src.open(encoding="utf-8") as fin, dst.open("w", encoding="utf-8") as fout:
    for line in fin:
        if not line.strip():
            continue
        row = json.loads(line)
        verbose = row["input_text"]
        compressed = row["generated_output"]
        domain = clf.classify(verbose).value
        fout.write(json.dumps({
            "verbose": verbose,
            "compressed": compressed,
            "domain": domain
        }) + "\n")
PY

4. Intrinsic Equivalence Eval (Cross-Model Semantic Fidelity)

A diagnostic evaluation — tests whether compressed outputs produce equivalent reasoning across Claude, GPT, and Gemini. Useful for understanding where information is lost (which gate fails, which facts get dropped):

python scripts/validate_batch.py \
  --input models/eval/pairs_sample300_seed42_qwen_step004500.jsonl \
  --output models/eval/equiv_qwen_step004500_judge.jsonl \
  --models claude gpt gemini \
  --use-llm-judge \
  --save-all \
  --concurrency 3 \
  --resume

How It Works

Pipeline Overview

Raw Corpora — Source material from code repositories and natural language documents
Synthetic Generation — A trained adapter generates compression pairs from raw inputs, bootstrapping the training data
Preprocessing — Pairs are cleaned and filtered by compression ratio to remove degenerate outputs
SFT Training — LoRA fine-tuning on the filtered pairs teaches the model to compress while preserving semantics
Compression Eval — Token ratio measurement (output_tokens / input_tokens) across the held-out test set
Extrinsic Eval — Compressed context is substituted into downstream QA and code tasks; accuracy deltas measure the real-world cost of compression
Intrinsic Equivalence — The compressed text is fed to Claude, GPT, and Gemini; a 3-gate scoring system diagnoses where semantic fidelity breaks down

For full data pipeline reproduction commands, see docs/data-pipeline.md.

Training

Training is supported on two backends:

MLX (local) — Apple Silicon, quantized models (3B–4B). Uses mlx-lm for LoRA fine-tuning.
Tinker (cloud) — Remote GPU training for larger models (8B+). Manages checkpoints, metrics, and early stopping.

# Local MLX training
python scripts/train_tinker.py \
  --backend local \
  --model Qwen/Qwen3-4B-Instruct-2507 \
  --local-model mlx-community/Qwen3-4B-Instruct-2507-8bit \
  --hf-dataset Sudhendra/semantic-compression-sft \
  --output models/adapters/local_run

# Cloud Tinker training
python scripts/train_tinker.py \
  --backend tinker \
  --model Qwen/Qwen3-8B \
  --hf-dataset Sudhendra/semantic-compression-sft \
  --output models/adapters/tinker

Project Structure

compression-layer/
├── src/
│   ├── validation/        # Cross-model intrinsic equivalence testing
│   ├── evaluation/        # Evaluation logic
│   │   └── downstream/    # Extrinsic eval: runner, dataset, scoring, baselines
│   │       └── benchmarks/  # HotPotQA, Qasper, DS1000 benchmark loaders
│   ├── generation/        # Compression pair generation
│   ├── training/          # Tinker + MLX training pipelines
│   ├── inference/         # Compression inference (local + cloud)
│   └── utils/             # Tokenizers, caching, cost tracking
├── scripts/               # CLI entry points
│   ├── train_tinker.py    # Training (local MLX / cloud Tinker)
│   ├── evaluate_tinker.py # Compression ratio evaluation
│   ├── evaluate_downstream.py  # Extrinsic downstream evaluation
│   ├── prepare_downstream_eval.py  # Benchmark dataset preparation
│   ├── validate_batch.py  # Intrinsic equivalence evaluation
│   └── ...
├── data/                  # Corpora and datasets (gitignored)
├── models/                # Checkpoints and eval artifacts (gitignored)
├── configs/               # YAML configurations
├── docs/                  # Documentation and plans
├── tests/                 # Test suite
└── assets/                # README images and diagrams

Contributing

We welcome contributions, especially in these areas:

New domains — Extending compression beyond code and natural language (e.g., structured data, mathematical notation)
Model experiments — Training adapters on different base models or with alternative LoRA configurations
Evaluation methodology — Improving the intrinsic equivalence scoring, adding new downstream benchmarks, or refining thresholds
Dataset expansion — Contributing to the HuggingFace dataset with new high-quality compression pairs

See CONTRIBUTING.md for setup instructions, branch strategy, and development workflow.

License

MIT