GitHub - LinfengGao/ProbeRAG: Code repository for paper: Beyond Black-Box Interventions: Latent Probing for Faithful Retrieval-Augmented Generation

Code repository for paper: Beyond Black-Box Interventions: Latent Probing for Faithful Retrieval-Augmented Generation

This repository provides an evaluation framework for paper: Beyond Black-Box Interventions: Latent Probing for Faithful Retrieval-Augmented Generation, focusing on knowledge intensive QA tasks. The core method is called ProbeRAG, with comparisons to several baseline and alternative approaches.

Features

Automated QA Evaluation Computes F1 score and Exact Match (EM) automatically.
Multi-model Support Works with LLaMA, Qwen, Mistral, and other open-source LLMs.
Dataset Coverage Includes benchmarks such as FaithEval, ConFiQA, and SQuAD.
Multiple Baselines Supports ProbeRAG, WO-Context, KRE, OPIN, CANOE, ContextDPO, ParamMute etc.

Project Structure

src
├── dataset_wrapper.py        # Dataset wrappers (RAGDataset, ConFiQA, FaithEval, Squad, etc.)
├── modules.py                # Core modules (e.g., ConflictDetector)
├── classifier.py             # Classifier Definition
├── prompt_template.py        # Prompt Template
├── vector_extractor.py       # Hidden State Extractor
├── utils.py                  # Utility functions (model loading, etc.)
scripts
├── evaluate_qa.py            # Main evaluation script
├── classifier_training.py    # Classifier trainging script
├── llm_sft.py                # LLM fine-tuning script
├── prepare_sft_data.py       # SFT data preparation script
├── train_sft.sh              # LLM fine-tuning start script

Key Components

QAEvaluator The main evaluation class. It:
1. Generates prompts and gold answers
2. Runs inference with the target model
3. Computes evaluation metrics (F1, EM)
4. Saves detailed results to JSON
Metrics
- f1_score: token-level F1 based on word overlap
- exact_match_score: exact match, with special handling for negation in counterfactual QA

Supported Datasets

FaithEval / Counterfactual QA
ConFiQA (MC, MR, QA subsets)
SQuAD

Dataset paths are configured in dataset_dict, and new datasets can be added easily.

Supported Models

Meta-LLaMA-3.1-8B-Instruct
LLaMA-2-7B-Chat
Qwen2.5-7B-Instruct
Qwen3-8B
Mistral-7B-Instruct-v0.3

Mappings are defined in model_dict for easy extension.

Usage

Prepare data and models
- Place datasets under data/
- Store model weights under checkpoints/ or specify their path
Run evaluation
```
python scripts/evaluate_qa.py
```
Results
- Results are saved to:
```
experiments/qa_exp/{dataset}/{method}-{model}-{date}.json
```
- Each file includes overall metrics and per-case predictions

Example output:

[
    {
        "total_f1": 0.6123,
        "total_em": 0.4321
    },
    {
        "question": "...",
        "answer": ["gold answer 1", "gold answer 2"],
        "predict": "predicted answer",
        "f1": 0.75,
        "em": 1
    }
]

Citation

If you find this work helpful, please cite our paper:

@article{gao2025probing,
  title={Probing Latent Knowledge Conflict for Faithful Retrieval-Augmented Generation},
  author={Gao, Linfeng and Bi, Baolong and Yuan, Zheng and Wang, Le and Chen, Zerui and Wei, Zhimin and Liu, Shenghua and Zhang, Qinggang and Su, Jinsong},
  journal={arXiv preprint arXiv:2510.12460},
  year={2025}
}