Can we fine-tune small LLMs to compress context while preserving semantic equivalence across models?
This project trains LoRA adapters on small (3B–8B) language models to rewrite verbose LLM context into compressed representations that produce equivalent reasoning outputs from Claude, GPT, and Gemini — reducing token usage while maintaining answer quality.
Read the full write-up: Semantic Compression — Research Overview
Key Results
Summary of trained adapter evaluations on the full test set (2,497 examples from
Sudhendra/semantic-compression-sft):
Compression (Token Ratio)
| Model | Params | Backend | LoRA Config | Avg Token Ratio | Avg Input Tokens | Avg Output Tokens |
|---|---|---|---|---|---|---|
| Nanbeige4.1-3B (8-bit) | 3B | MLX (local) | rank 8, lr 1e-4, 500 iters | 34.8% | 276.3 | 103.2 |
| Qwen3-8B | 8B | Tinker (cloud) | rank 16, lr 2e-4, 4962 steps | 38.6% | 237.5 | 98.1 |
Token ratio =
output_tokens / input_tokens. Lower is better. A ratio of 34.8% means the compressed output uses ~35% of the original token count.
Extrinsic Evaluation (Downstream Task Performance)
The primary evaluation: does compressed context still let a task model answer correctly? Each example is evaluated twice — once with full context and once with compressed context — using the same task model and prompt. The delta between full-context and compressed-context accuracy measures the cost of compression.
Supported benchmarks:
| Benchmark | Task | Metric |
|---|---|---|
| HotPotQA | Multi-hop QA | Exact Match, Token F1 |
| Qasper | Long-document QA | Exact Match, Token F1 |
| DS1000 | Code generation | Exact Match, Token F1 |
Compressor baselines:
| Compressor | Description |
|---|---|
identity |
No compression (upper bound on accuracy) |
truncate |
Hard token-limit truncation |
extractive |
Top sentences by TF-IDF similarity to query |
adapter_local |
Learned LoRA adapter (MLX, local) |
adapter_tinker |
Learned LoRA adapter (Tinker, cloud) |
Results — 300 examples per benchmark,
gpt-4o-minias task model, seed 42. Each example is evaluated twice (full context vs compressed context); deltas show the accuracy cost of compression. Seedocs/downstream-eval.mdfor the full workflow.
HotPotQA (multi-hop QA, n = 300):
| Compressor | Compressed EM | ΔEM | Compressed F1 | ΔF1 | Ratio | Cost |
|---|---|---|---|---|---|---|
identity |
0.043 | −0.003 | 0.177 | +0.000 | 1.00 | $0.129 |
truncate |
0.040 | −0.007 | 0.093 | −0.087 | 0.31 | $0.085 |
extractive |
0.030 | −0.020 | 0.068 | −0.112 | 0.20 | $0.079 |
adapter_local |
0.027 | −0.020 | 0.104 | −0.075 | 0.26 | $0.084 |
adapter_tinker |
0.010 | −0.037 | 0.067 | −0.109 | 0.31 | $0.087 |
DS1000 (code generation, n = 300):
| Compressor | Compressed EM | ΔEM | Compressed F1 | ΔF1 | Ratio | Cost |
|---|---|---|---|---|---|---|
identity |
0.213 | 0.000 | 0.459 | +0.002 | 1.00 | $0.102 |
truncate |
0.210 | +0.003 | 0.435 | −0.021 | 0.87 | $0.094 |
extractive |
0.170 | −0.030 | 0.326 | −0.128 | 0.65 | $0.087 |
adapter_local |
0.067 | −0.143 | 0.159 | −0.301 | 0.38 | $0.079 |
adapter_tinker |
0.037 | −0.170 | 0.079 | −0.386 | 0.62 | $0.084 |
Key takeaways:
- Identity confirms the baseline — zero compression, near-zero delta (noise only).
- Truncate preserves accuracy well on DS1000 (ratio 0.87, ΔF1 −0.02) where contexts are shorter, but loses more on HotPotQA (ratio 0.31, ΔF1 −0.09) where multi-hop evidence spans the full context.
- Learned adapters achieve aggressive compression (0.26–0.62 ratio) but with significant accuracy drops, especially on code generation. The adapter was trained on conversational text, not code — domain mismatch explains the gap.
- Qasper results pending.
Intrinsic Equivalence (Cross-Model Semantic Fidelity)
A diagnostic evaluation measuring whether the compressed text preserves the same semantic content as the original verbose text. Both are sent to three frontier models (Claude, GPT, Gemini) for fact-extraction tasks; the outputs are compared through a 3-gate system:
| Gate | Metric | Threshold | What it catches |
|---|---|---|---|
| 1 | Embedding similarity (MiniLM-L6-v2) | >= 0.60 | Gross topic drift, garbled outputs |
| 2 | Fact overlap (atomic fact coverage) | >= 0.55 | Dropped facts, numbers, parameters |
| 3 | LLM judge | >= 0.75 | Subtle reasoning gaps, quality issues |
A sample passes only when the minimum score across all three models exceeds the threshold on every active gate.
| Model | Sample (seed 42) | Pass Rate | Avg Min-Equiv | Median Min-Equiv |
|---|---|---|---|---|
| Qwen3-8B (step 4500) | 300 | 2.0% | 0.390 | 0.413 |
| Nanbeige 3B (iter 500) | 300 | 1.3% | 0.402 | 0.434 |
Qwen3-8B gate-level breakdown
Gate failure rates (how often each gate was the bottleneck):
| Gate | Fail Rate | Avg Score |
|---|---|---|
| Embedding (>= 0.60) | 15.3% | 0.782 |
| Fact overlap (>= 0.55) | 76.0% | 0.535 |
| LLM judge (>= 0.75) | 96.3% | 0.584 |
Without the LLM judge (gates 1+2 only), the pass rate would be 24.0%.
Per-model avg scores (across 300 samples):
| Model | Embedding | Fact Overlap | LLM Judge |
|---|---|---|---|
| Claude Sonnet | 0.783 | 0.539 | 0.572 |
| GPT-4o-mini | 0.795 | 0.525 | 0.607 |
| Gemini Flash | 0.768 | 0.542 | 0.574 |
Per-domain results (3-gate, all models):
| Domain | n | Pass Rate | Avg Min-Equiv | Median Min-Equiv |
|---|---|---|---|---|
| NL | 95 | 2.1% | 0.320 | 0.304 |
| Mixed | 171 | 1.8% | 0.415 | 0.444 |
| Code | 34 | 2.9% | 0.459 | 0.450 |
Nanbeige 3B gate-level breakdown
Gate failure rates (how often each gate was the bottleneck):
| Gate | Fail Rate | Avg Score |
|---|---|---|
| Embedding (>= 0.60) | 12.7% | 0.804 |
| Fact overlap (>= 0.55) | 74.0% | 0.556 |
| LLM judge (>= 0.75) | 97.0% | 0.596 |
Without the LLM judge (gates 1+2 only), the pass rate would be 25.7%.
Per-model avg scores (across 300 samples):
| Model | Embedding | Fact Overlap | LLM Judge |
|---|---|---|---|
| Claude Sonnet | 0.805 | 0.553 | 0.594 |
| GPT-4o-mini | 0.817 | 0.548 | 0.618 |
| Gemini Flash | 0.790 | 0.566 | 0.577 |
Per-domain results (3-gate, all models):
| Domain | n | Pass Rate | Avg Min-Equiv | Median Min-Equiv |
|---|---|---|---|---|
| NL | 95 | 3.2% | 0.346 | 0.350 |
| Mixed | 171 | 0.6% | 0.419 | 0.444 |
| Code | 34 | 0.0% | 0.476 | 0.456 |
Samples are domain-stratified (95 NL / 171 mixed / 34 code) to match the test set distribution.
Training run details
Nanbeige 3B — 500 iterations (MLX)
- Model:
mlx-community/Nanbeige4.1-3B-8bit - LoRA: rank 8, alpha 16, 16 layers
- Config: 500 iters, batch size 4, lr 1e-4
- Eval artifact:
models/eval/ratio_nanbeige_iter500.jsonl
Qwen3-8B (Tinker cloud)
- Model:
Qwen/Qwen3-8B - LoRA: rank 16, alpha 32, dropout 0.05
- Config: 2 epochs, batch size 4, lr 2e-4
- Status: completed, early-stopped at step 4962 (planned 9924)
- Best checkpoint: step 4500 (val_loss = 0.2579)
- Train examples: 19,845
Quick Start
# Clone and setup git clone https://github.com/Sudhendra/compression-layer.git cd compression-layer python3 -m venv .venv source .venv/bin/activate pip install -e ".[dev]" # Copy and configure environment variables cp .env.example .env # Edit .env with your API keys # Run tests pytest tests/ -v
Environment Variables
Create a .env file (see .env.example):
ANTHROPIC_API_KEY=sk-ant-... # For Claude intrinsic eval OPENAI_API_KEY=sk-... # For GPT intrinsic eval + downstream task model GOOGLE_API_KEY=... # For Gemini intrinsic eval HF_TOKEN=hf_... # For HuggingFace dataset access TINKER_API_KEY=tk_... # For Tinker cloud training/inference
Reproduce Evaluation
1. Extrinsic Eval (Downstream Task Performance)
The primary evaluation. Measures whether compressed context preserves task accuracy.
Build a dataset:
python scripts/prepare_downstream_eval.py \ --benchmark hotpotqa \ --split validation \ --limit 25 \ --output data/eval/downstream/hotpotqa_validation_25.jsonl
Run baseline comparisons (identity, truncate, extractive):
# Identity baseline (full context, no compression — upper bound) python scripts/evaluate_downstream.py \ --dataset data/eval/downstream/hotpotqa_validation_25.jsonl \ --compressor identity \ --task-model gpt-4o-mini \ --output models/eval/downstream_hotpotqa25_identity_gpt4omini.jsonl \ --max-cost 0.50 \ --resume # Truncation baseline python scripts/evaluate_downstream.py \ --dataset data/eval/downstream/hotpotqa_validation_25.jsonl \ --compressor truncate \ --truncate-tokens 256 \ --task-model gpt-4o-mini \ --output models/eval/downstream_hotpotqa25_truncate_gpt4omini.jsonl \ --max-cost 0.50 \ --resume # Extractive baseline python scripts/evaluate_downstream.py \ --dataset data/eval/downstream/hotpotqa_validation_25.jsonl \ --compressor extractive \ --extractive-chars 1000 \ --task-model gpt-4o-mini \ --output models/eval/downstream_hotpotqa25_extractive_gpt4omini.jsonl \ --max-cost 0.50 \ --resume
Run learned compressor (requires adapter):
# Local MLX adapter python scripts/evaluate_downstream.py \ --dataset data/eval/downstream/hotpotqa_validation_25.jsonl \ --compressor adapter_local \ --adapter-model mlx-community/Nanbeige4.1-3B-8bit \ --adapter-path models/runs/mlx/.../adapter \ --task-model gpt-4o-mini \ --output models/eval/downstream_hotpotqa25_adapterlocal_gpt4omini.jsonl \ --max-cost 0.50 \ --resume # Tinker cloud adapter python scripts/evaluate_downstream.py \ --dataset data/eval/downstream/hotpotqa_validation_25.jsonl \ --compressor adapter_tinker \ --checkpoint-path "tinker://.../weights/step-004500" \ --task-model gpt-4o-mini \ --output models/eval/downstream_hotpotqa25_adaptertinker_gpt4omini.jsonl \ --max-cost 0.50 \ --resume
See docs/downstream-eval.md for the full workflow
including budget rules and reporting metrics.
2. Compression Eval (Token Ratio)
Measures how much the adapter compresses input tokens.
Local MLX backend (for Nanbeige / local models):
python scripts/evaluate_tinker.py \
--backend local \
--model mlx-community/Nanbeige4.1-3B-8bit \
--adapter-path models/runs/mlx/2026-03-01_20-53-41--iter-500/adapter \
--hf-dataset Sudhendra/semantic-compression-sft \
--output models/eval/ratio_nanbeige_iter500.jsonl \
--resumeTinker cloud backend (for Qwen3-8B):
python scripts/evaluate_tinker.py \
--backend tinker \
--hf-dataset Sudhendra/semantic-compression-sft \
--checkpoint-path "tinker://161b1f39-3e50-53c0-9d75-f5ce804db7eb:train:0/weights/step-004500" \
--output models/eval/ratio_qwen3-8b_tinker_step004500.jsonl \
--show-examples 0 \
--resume3. Convert to Validation Pairs
Transforms compression eval output into the format needed for intrinsic equivalence testing:
python - <<'PY' import json from pathlib import Path from src.inference.domain_classifier import DomainClassifier src = Path("models/eval/ratio_qwen3-8b_tinker_step004500.jsonl") dst = Path("models/eval/ratio_qwen3-8b_pairs.jsonl") clf = DomainClassifier() with src.open(encoding="utf-8") as fin, dst.open("w", encoding="utf-8") as fout: for line in fin: if not line.strip(): continue row = json.loads(line) verbose = row["input_text"] compressed = row["generated_output"] domain = clf.classify(verbose).value fout.write(json.dumps({ "verbose": verbose, "compressed": compressed, "domain": domain }) + "\n") PY
4. Intrinsic Equivalence Eval (Cross-Model Semantic Fidelity)
A diagnostic evaluation — tests whether compressed outputs produce equivalent reasoning across Claude, GPT, and Gemini. Useful for understanding where information is lost (which gate fails, which facts get dropped):
python scripts/validate_batch.py \ --input models/eval/pairs_sample300_seed42_qwen_step004500.jsonl \ --output models/eval/equiv_qwen_step004500_judge.jsonl \ --models claude gpt gemini \ --use-llm-judge \ --save-all \ --concurrency 3 \ --resume
How It Works
Pipeline Overview
- Raw Corpora — Source material from code repositories and natural language documents
- Synthetic Generation — A trained adapter generates compression pairs from raw inputs, bootstrapping the training data
- Preprocessing — Pairs are cleaned and filtered by compression ratio to remove degenerate outputs
- SFT Training — LoRA fine-tuning on the filtered pairs teaches the model to compress while preserving semantics
- Compression Eval — Token ratio measurement (
output_tokens / input_tokens) across the held-out test set - Extrinsic Eval — Compressed context is substituted into downstream QA and code tasks; accuracy deltas measure the real-world cost of compression
- Intrinsic Equivalence — The compressed text is fed to Claude, GPT, and Gemini; a 3-gate scoring system diagnoses where semantic fidelity breaks down
For full data pipeline reproduction commands, see
docs/data-pipeline.md.
Training
Training is supported on two backends:
- MLX (local) — Apple Silicon, quantized models (3B–4B). Uses
mlx-lmfor LoRA fine-tuning. - Tinker (cloud) — Remote GPU training for larger models (8B+). Manages checkpoints, metrics, and early stopping.
# Local MLX training python scripts/train_tinker.py \ --backend local \ --model Qwen/Qwen3-4B-Instruct-2507 \ --local-model mlx-community/Qwen3-4B-Instruct-2507-8bit \ --hf-dataset Sudhendra/semantic-compression-sft \ --output models/adapters/local_run # Cloud Tinker training python scripts/train_tinker.py \ --backend tinker \ --model Qwen/Qwen3-8B \ --hf-dataset Sudhendra/semantic-compression-sft \ --output models/adapters/tinker
Project Structure
compression-layer/
├── src/
│ ├── validation/ # Cross-model intrinsic equivalence testing
│ ├── evaluation/ # Evaluation logic
│ │ └── downstream/ # Extrinsic eval: runner, dataset, scoring, baselines
│ │ └── benchmarks/ # HotPotQA, Qasper, DS1000 benchmark loaders
│ ├── generation/ # Compression pair generation
│ ├── training/ # Tinker + MLX training pipelines
│ ├── inference/ # Compression inference (local + cloud)
│ └── utils/ # Tokenizers, caching, cost tracking
├── scripts/ # CLI entry points
│ ├── train_tinker.py # Training (local MLX / cloud Tinker)
│ ├── evaluate_tinker.py # Compression ratio evaluation
│ ├── evaluate_downstream.py # Extrinsic downstream evaluation
│ ├── prepare_downstream_eval.py # Benchmark dataset preparation
│ ├── validate_batch.py # Intrinsic equivalence evaluation
│ └── ...
├── data/ # Corpora and datasets (gitignored)
├── models/ # Checkpoints and eval artifacts (gitignored)
├── configs/ # YAML configurations
├── docs/ # Documentation and plans
├── tests/ # Test suite
└── assets/ # README images and diagrams
Contributing
We welcome contributions, especially in these areas:
- New domains — Extending compression beyond code and natural language (e.g., structured data, mathematical notation)
- Model experiments — Training adapters on different base models or with alternative LoRA configurations
- Evaluation methodology — Improving the intrinsic equivalence scoring, adding new downstream benchmarks, or refining thresholds
- Dataset expansion — Contributing to the HuggingFace dataset with new high-quality compression pairs
See CONTRIBUTING.md for setup instructions, branch strategy,
and development workflow.