GitHub - Sudhendra/compression-layer: Build a model-agnostic semantic compression layer that reduces token count for LLM inputs (memories, code, context) while preserving task performance across Claude, GPT, and Gemini.

License: MIT Python 3.10+ Dataset on HF

Can we fine-tune small LLMs to compress context while preserving semantic equivalence across models?

This project trains LoRA adapters on small (3B–8B) language models to rewrite verbose LLM context into compressed representations that produce equivalent reasoning outputs from Claude, GPT, and Gemini — reducing token usage while maintaining answer quality.

Read the full write-up: Semantic Compression — Research Overview

Pipeline: Raw Corpora → Synthetic Generation → Preprocessing → SFT Training → Compression Eval → Equivalence Eval


Key Results

Summary of trained adapter evaluations on the full test set (2,497 examples from Sudhendra/semantic-compression-sft):

Compression (Token Ratio)

Model Params Backend LoRA Config Avg Token Ratio Avg Input Tokens Avg Output Tokens
Nanbeige4.1-3B (8-bit) 3B MLX (local) rank 8, lr 1e-4, 500 iters 34.8% 276.3 103.2
Qwen3-8B 8B Tinker (cloud) rank 16, lr 2e-4, 4962 steps 38.6% 237.5 98.1

Token ratio = output_tokens / input_tokens. Lower is better. A ratio of 34.8% means the compressed output uses ~35% of the original token count.

Extrinsic Evaluation (Downstream Task Performance)

The primary evaluation: does compressed context still let a task model answer correctly? Each example is evaluated twice — once with full context and once with compressed context — using the same task model and prompt. The delta between full-context and compressed-context accuracy measures the cost of compression.

Extrinsic Evaluation: Benchmark datasets → Paired evaluation (full vs compressed context) → Task model → Per-example scoring (EM, F1, compression ratio, cost) → Aggregate comparison

Supported benchmarks:

Benchmark Task Metric
HotPotQA Multi-hop QA Exact Match, Token F1
Qasper Long-document QA Exact Match, Token F1
DS1000 Code generation Exact Match, Token F1

Compressor baselines:

Compressor Description
identity No compression (upper bound on accuracy)
truncate Hard token-limit truncation
extractive Top sentences by TF-IDF similarity to query
adapter_local Learned LoRA adapter (MLX, local)
adapter_tinker Learned LoRA adapter (Tinker, cloud)

Results — 300 examples per benchmark, gpt-4o-mini as task model, seed 42. Each example is evaluated twice (full context vs compressed context); deltas show the accuracy cost of compression. See docs/downstream-eval.md for the full workflow.

HotPotQA (multi-hop QA, n = 300):

Compressor Compressed EM ΔEM Compressed F1 ΔF1 Ratio Cost
identity 0.043 −0.003 0.177 +0.000 1.00 $0.129
truncate 0.040 −0.007 0.093 −0.087 0.31 $0.085
extractive 0.030 −0.020 0.068 −0.112 0.20 $0.079
adapter_local 0.027 −0.020 0.104 −0.075 0.26 $0.084
adapter_tinker 0.010 −0.037 0.067 −0.109 0.31 $0.087

DS1000 (code generation, n = 300):

Compressor Compressed EM ΔEM Compressed F1 ΔF1 Ratio Cost
identity 0.213 0.000 0.459 +0.002 1.00 $0.102
truncate 0.210 +0.003 0.435 −0.021 0.87 $0.094
extractive 0.170 −0.030 0.326 −0.128 0.65 $0.087
adapter_local 0.067 −0.143 0.159 −0.301 0.38 $0.079
adapter_tinker 0.037 −0.170 0.079 −0.386 0.62 $0.084

Key takeaways:

  • Identity confirms the baseline — zero compression, near-zero delta (noise only).
  • Truncate preserves accuracy well on DS1000 (ratio 0.87, ΔF1 −0.02) where contexts are shorter, but loses more on HotPotQA (ratio 0.31, ΔF1 −0.09) where multi-hop evidence spans the full context.
  • Learned adapters achieve aggressive compression (0.26–0.62 ratio) but with significant accuracy drops, especially on code generation. The adapter was trained on conversational text, not code — domain mismatch explains the gap.
  • Qasper results pending.

Intrinsic Equivalence (Cross-Model Semantic Fidelity)

A diagnostic evaluation measuring whether the compressed text preserves the same semantic content as the original verbose text. Both are sent to three frontier models (Claude, GPT, Gemini) for fact-extraction tasks; the outputs are compared through a 3-gate system:

Gate Metric Threshold What it catches
1 Embedding similarity (MiniLM-L6-v2) >= 0.60 Gross topic drift, garbled outputs
2 Fact overlap (atomic fact coverage) >= 0.55 Dropped facts, numbers, parameters
3 LLM judge >= 0.75 Subtle reasoning gaps, quality issues

A sample passes only when the minimum score across all three models exceeds the threshold on every active gate.

Intrinsic Equivalence: 3-Gate Scoring — Compression pair → Frontier model (Claude/GPT/Gemini) fact extraction → 3-gate comparison (embedding, fact overlap, LLM judge) → per-model aggregation → final pass/fail verdict

Model Sample (seed 42) Pass Rate Avg Min-Equiv Median Min-Equiv
Qwen3-8B (step 4500) 300 2.0% 0.390 0.413
Nanbeige 3B (iter 500) 300 1.3% 0.402 0.434
Qwen3-8B gate-level breakdown

Gate failure rates (how often each gate was the bottleneck):

Gate Fail Rate Avg Score
Embedding (>= 0.60) 15.3% 0.782
Fact overlap (>= 0.55) 76.0% 0.535
LLM judge (>= 0.75) 96.3% 0.584

Without the LLM judge (gates 1+2 only), the pass rate would be 24.0%.

Per-model avg scores (across 300 samples):

Model Embedding Fact Overlap LLM Judge
Claude Sonnet 0.783 0.539 0.572
GPT-4o-mini 0.795 0.525 0.607
Gemini Flash 0.768 0.542 0.574

Per-domain results (3-gate, all models):

Domain n Pass Rate Avg Min-Equiv Median Min-Equiv
NL 95 2.1% 0.320 0.304
Mixed 171 1.8% 0.415 0.444
Code 34 2.9% 0.459 0.450
Nanbeige 3B gate-level breakdown

Gate failure rates (how often each gate was the bottleneck):

Gate Fail Rate Avg Score
Embedding (>= 0.60) 12.7% 0.804
Fact overlap (>= 0.55) 74.0% 0.556
LLM judge (>= 0.75) 97.0% 0.596

Without the LLM judge (gates 1+2 only), the pass rate would be 25.7%.

Per-model avg scores (across 300 samples):

Model Embedding Fact Overlap LLM Judge
Claude Sonnet 0.805 0.553 0.594
GPT-4o-mini 0.817 0.548 0.618
Gemini Flash 0.790 0.566 0.577

Per-domain results (3-gate, all models):

Domain n Pass Rate Avg Min-Equiv Median Min-Equiv
NL 95 3.2% 0.346 0.350
Mixed 171 0.6% 0.419 0.444
Code 34 0.0% 0.476 0.456

Samples are domain-stratified (95 NL / 171 mixed / 34 code) to match the test set distribution.

Training run details

Nanbeige 3B — 500 iterations (MLX)

  • Model: mlx-community/Nanbeige4.1-3B-8bit
  • LoRA: rank 8, alpha 16, 16 layers
  • Config: 500 iters, batch size 4, lr 1e-4
  • Eval artifact: models/eval/ratio_nanbeige_iter500.jsonl

Qwen3-8B (Tinker cloud)

  • Model: Qwen/Qwen3-8B
  • LoRA: rank 16, alpha 32, dropout 0.05
  • Config: 2 epochs, batch size 4, lr 2e-4
  • Status: completed, early-stopped at step 4962 (planned 9924)
  • Best checkpoint: step 4500 (val_loss = 0.2579)
  • Train examples: 19,845

Quick Start

# Clone and setup
git clone https://github.com/Sudhendra/compression-layer.git
cd compression-layer

python3 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"

# Copy and configure environment variables
cp .env.example .env
# Edit .env with your API keys

# Run tests
pytest tests/ -v

Environment Variables

Create a .env file (see .env.example):

ANTHROPIC_API_KEY=sk-ant-...   # For Claude intrinsic eval
OPENAI_API_KEY=sk-...          # For GPT intrinsic eval + downstream task model
GOOGLE_API_KEY=...             # For Gemini intrinsic eval
HF_TOKEN=hf_...               # For HuggingFace dataset access
TINKER_API_KEY=tk_...          # For Tinker cloud training/inference

Reproduce Evaluation

1. Extrinsic Eval (Downstream Task Performance)

The primary evaluation. Measures whether compressed context preserves task accuracy.

Build a dataset:

python scripts/prepare_downstream_eval.py \
  --benchmark hotpotqa \
  --split validation \
  --limit 25 \
  --output data/eval/downstream/hotpotqa_validation_25.jsonl

Run baseline comparisons (identity, truncate, extractive):

# Identity baseline (full context, no compression — upper bound)
python scripts/evaluate_downstream.py \
  --dataset data/eval/downstream/hotpotqa_validation_25.jsonl \
  --compressor identity \
  --task-model gpt-4o-mini \
  --output models/eval/downstream_hotpotqa25_identity_gpt4omini.jsonl \
  --max-cost 0.50 \
  --resume

# Truncation baseline
python scripts/evaluate_downstream.py \
  --dataset data/eval/downstream/hotpotqa_validation_25.jsonl \
  --compressor truncate \
  --truncate-tokens 256 \
  --task-model gpt-4o-mini \
  --output models/eval/downstream_hotpotqa25_truncate_gpt4omini.jsonl \
  --max-cost 0.50 \
  --resume

# Extractive baseline
python scripts/evaluate_downstream.py \
  --dataset data/eval/downstream/hotpotqa_validation_25.jsonl \
  --compressor extractive \
  --extractive-chars 1000 \
  --task-model gpt-4o-mini \
  --output models/eval/downstream_hotpotqa25_extractive_gpt4omini.jsonl \
  --max-cost 0.50 \
  --resume

Run learned compressor (requires adapter):

# Local MLX adapter
python scripts/evaluate_downstream.py \
  --dataset data/eval/downstream/hotpotqa_validation_25.jsonl \
  --compressor adapter_local \
  --adapter-model mlx-community/Nanbeige4.1-3B-8bit \
  --adapter-path models/runs/mlx/.../adapter \
  --task-model gpt-4o-mini \
  --output models/eval/downstream_hotpotqa25_adapterlocal_gpt4omini.jsonl \
  --max-cost 0.50 \
  --resume

# Tinker cloud adapter
python scripts/evaluate_downstream.py \
  --dataset data/eval/downstream/hotpotqa_validation_25.jsonl \
  --compressor adapter_tinker \
  --checkpoint-path "tinker://.../weights/step-004500" \
  --task-model gpt-4o-mini \
  --output models/eval/downstream_hotpotqa25_adaptertinker_gpt4omini.jsonl \
  --max-cost 0.50 \
  --resume

See docs/downstream-eval.md for the full workflow including budget rules and reporting metrics.

2. Compression Eval (Token Ratio)

Measures how much the adapter compresses input tokens.

Local MLX backend (for Nanbeige / local models):

python scripts/evaluate_tinker.py \
  --backend local \
  --model mlx-community/Nanbeige4.1-3B-8bit \
  --adapter-path models/runs/mlx/2026-03-01_20-53-41--iter-500/adapter \
  --hf-dataset Sudhendra/semantic-compression-sft \
  --output models/eval/ratio_nanbeige_iter500.jsonl \
  --resume

Tinker cloud backend (for Qwen3-8B):

python scripts/evaluate_tinker.py \
  --backend tinker \
  --hf-dataset Sudhendra/semantic-compression-sft \
  --checkpoint-path "tinker://161b1f39-3e50-53c0-9d75-f5ce804db7eb:train:0/weights/step-004500" \
  --output models/eval/ratio_qwen3-8b_tinker_step004500.jsonl \
  --show-examples 0 \
  --resume

3. Convert to Validation Pairs

Transforms compression eval output into the format needed for intrinsic equivalence testing:

python - <<'PY'
import json
from pathlib import Path
from src.inference.domain_classifier import DomainClassifier

src = Path("models/eval/ratio_qwen3-8b_tinker_step004500.jsonl")
dst = Path("models/eval/ratio_qwen3-8b_pairs.jsonl")
clf = DomainClassifier()

with src.open(encoding="utf-8") as fin, dst.open("w", encoding="utf-8") as fout:
    for line in fin:
        if not line.strip():
            continue
        row = json.loads(line)
        verbose = row["input_text"]
        compressed = row["generated_output"]
        domain = clf.classify(verbose).value
        fout.write(json.dumps({
            "verbose": verbose,
            "compressed": compressed,
            "domain": domain
        }) + "\n")
PY

4. Intrinsic Equivalence Eval (Cross-Model Semantic Fidelity)

A diagnostic evaluation — tests whether compressed outputs produce equivalent reasoning across Claude, GPT, and Gemini. Useful for understanding where information is lost (which gate fails, which facts get dropped):

python scripts/validate_batch.py \
  --input models/eval/pairs_sample300_seed42_qwen_step004500.jsonl \
  --output models/eval/equiv_qwen_step004500_judge.jsonl \
  --models claude gpt gemini \
  --use-llm-judge \
  --save-all \
  --concurrency 3 \
  --resume

How It Works

Pipeline Overview

  1. Raw Corpora — Source material from code repositories and natural language documents
  2. Synthetic Generation — A trained adapter generates compression pairs from raw inputs, bootstrapping the training data
  3. Preprocessing — Pairs are cleaned and filtered by compression ratio to remove degenerate outputs
  4. SFT Training — LoRA fine-tuning on the filtered pairs teaches the model to compress while preserving semantics
  5. Compression Eval — Token ratio measurement (output_tokens / input_tokens) across the held-out test set
  6. Extrinsic Eval — Compressed context is substituted into downstream QA and code tasks; accuracy deltas measure the real-world cost of compression
  7. Intrinsic Equivalence — The compressed text is fed to Claude, GPT, and Gemini; a 3-gate scoring system diagnoses where semantic fidelity breaks down

For full data pipeline reproduction commands, see docs/data-pipeline.md.

Training

Training is supported on two backends:

  • MLX (local) — Apple Silicon, quantized models (3B–4B). Uses mlx-lm for LoRA fine-tuning.
  • Tinker (cloud) — Remote GPU training for larger models (8B+). Manages checkpoints, metrics, and early stopping.
# Local MLX training
python scripts/train_tinker.py \
  --backend local \
  --model Qwen/Qwen3-4B-Instruct-2507 \
  --local-model mlx-community/Qwen3-4B-Instruct-2507-8bit \
  --hf-dataset Sudhendra/semantic-compression-sft \
  --output models/adapters/local_run

# Cloud Tinker training
python scripts/train_tinker.py \
  --backend tinker \
  --model Qwen/Qwen3-8B \
  --hf-dataset Sudhendra/semantic-compression-sft \
  --output models/adapters/tinker

Project Structure

compression-layer/
├── src/
│   ├── validation/        # Cross-model intrinsic equivalence testing
│   ├── evaluation/        # Evaluation logic
│   │   └── downstream/    # Extrinsic eval: runner, dataset, scoring, baselines
│   │       └── benchmarks/  # HotPotQA, Qasper, DS1000 benchmark loaders
│   ├── generation/        # Compression pair generation
│   ├── training/          # Tinker + MLX training pipelines
│   ├── inference/         # Compression inference (local + cloud)
│   └── utils/             # Tokenizers, caching, cost tracking
├── scripts/               # CLI entry points
│   ├── train_tinker.py    # Training (local MLX / cloud Tinker)
│   ├── evaluate_tinker.py # Compression ratio evaluation
│   ├── evaluate_downstream.py  # Extrinsic downstream evaluation
│   ├── prepare_downstream_eval.py  # Benchmark dataset preparation
│   ├── validate_batch.py  # Intrinsic equivalence evaluation
│   └── ...
├── data/                  # Corpora and datasets (gitignored)
├── models/                # Checkpoints and eval artifacts (gitignored)
├── configs/               # YAML configurations
├── docs/                  # Documentation and plans
├── tests/                 # Test suite
└── assets/                # README images and diagrams

Contributing

We welcome contributions, especially in these areas:

  • New domains — Extending compression beyond code and natural language (e.g., structured data, mathematical notation)
  • Model experiments — Training adapters on different base models or with alternative LoRA configurations
  • Evaluation methodology — Improving the intrinsic equivalence scoring, adding new downstream benchmarks, or refining thresholds
  • Dataset expansion — Contributing to the HuggingFace dataset with new high-quality compression pairs

See CONTRIBUTING.md for setup instructions, branch strategy, and development workflow.


License

MIT