If you like our project, please give us a star on GitHub for the latest updates!
InnoEval is an automated evaluation framework designed for assessing research ideas and innovation proposals. It leverages multi-agent systems and LLMs to comprehensively evaluate the novelty, feasibility, and significance of research contributions.
-
Multi-Agent Pipeline
A chain of specialized agents (Extraction, Research, Grounding, Evaluation, Report) working together -
Multi-Source Grounding
Gathers evidence from web pages, code repositories, and academic papers to validate claims -
Persona-Based Evaluation
Simulates multiple reviewer perspectives for balanced and comprehensive assessment -
Flexible Input Modes
Supports both PDF URLs and direct text input for research ideas -
Batch Processing
Point-wise and group-wise evaluation for large-scale dataset analysis
Table of Contents
- π₯ Installation
- π¬ Quick Start
- π Architecture
- π¬ Examples
- π Configuration
- π Acknowledgement
- βοΈ Citation
π₯ Installation
1. Clone the Repository
git clone https://github.com/your-org/InnoEval.git
cd InnoEval2. Create Virtual Environment
conda create -n innoeval python=3.10 -y conda activate innoeval
3. Install Dependencies
pip install -r requirements.txt
4. Configure API Keys
Copy the example configuration file and fill in your API keys:
cd config/ cp LLM.env.example LLM.env # Edit LLM.env with your API keys
Required API keys:
| Key | Description |
|---|---|
DS_API_KEY |
DeepSeek API key (primary LLM) |
DS_API_BASE_URL |
DeepSeek API base URL |
OPENAI_API_KEY |
OpenAI API key (optional) |
GOOGLE_API_KEY |
Google Search API key |
SERPER_API_KEY |
Serper API key for web search |
JINA_API_KEY |
Jina API key for content extraction |
S2_API_KEY |
Semantic Scholar API key |
GH_TOKEN |
GitHub token for repository analysis |
π¬ Quick Start
1. Single Idea Evaluation
Run the complete pipeline for a single research idea:
cd InnoEval
python3 -m innoeval.pipeline.single_idea_pipelineThis executes the full 6-step pipeline:
- ExtractionAgent: Extract structured idea from PDF/text
- ResearchAgent: Search for related works (web, code, papers)
- Report Extraction: Build evidence reports from search results
- GroundingAgent: Map claims to supporting evidence
- EvaluationAgent: Multi-perspective quality assessment
- ReportAgent: Generate final evaluation report
2. Point-wise Dataset Evaluation
Evaluate an entire dataset of research papers:
python3 -m innoeval.pipeline.batch_pipeline
Results are saved to cache/dataset_conference_points/.
3. Group Dataset Evaluation
Process papers organized in groups:
python3 -m innoeval.pipeline.group_pipeline
Results are saved to cache/dataset_conference_groups/.
4. Group/Pair Evaluation
Run comparison evaluation on cached group results:
# Group-wise comparison and ranking python3 -m innoeval.pipeline.group_evaluation # Pair-wise comparison python3 -m innoeval.pipeline.pair_evaluation
These scripts read from cache/dataset_conference_groups/ and do not re-run the pipeline.
π Architecture
Directory Structure
InnoEval/
βββ config/ # Configuration files
β βββ LLM.env # API keys (not tracked)
β βββ LLM.env.example # Example configuration
β βββ kaggle.json # Kaggle API config
βββ dataset/ # Evaluation datasets
β βββ conference_points.jsonl # Point-wise dataset
β βββ conference_groups.json # Group-wise dataset
β βββ conference_pairs_*.json # Pair datasets
βββ cache/ # Pipeline results cache
β βββ reviewer_personas.json # Reviewer personas
βββ innoeval/ # Main package
βββ mas/ # Multi-Agent System
β βββ agents/ # Agent implementations
β β βββ extraction_agent.py
β β βββ research_agent.py
β β βββ grounding_agent.py
β β βββ evaluation_agent.py
β β βββ report_agent.py
β βββ models/ # LLM and model interfaces
β β βββ model_factory.py
β β βββ bge_singleton.py
β βββ tools/ # Utility tools
β βββ searchers/ # Web/code/paper search
β βββ querygen/ # Query generation
β βββ enricher/ # Content enrichment
β βββ grobid_refs/ # Reference extraction
β βββ repo_analysis/ # GitHub repo analysis
βββ pipeline/ # Pipeline implementations
βββ single_idea_pipeline.py
βββ batch_pipeline.py
βββ group_pipeline.py
βββ group_evaluation.py
βββ pair_evaluation.py
Pipeline Workflow
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Input: PDF URL βββββΆβ ExtractionAgent βββββΆβ Idea Object β
β or Text Input β β (Extract) β β (structured) β
βββββββββββββββββββ βββββββββββββββββββ ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Web Pages β β ResearchAgent βββββΆβ SearchResults β
β Code Repos ββββββ (Search) β β (enriched) β
β Papers β βββββββββββββββββββ ββββββββββ¬βββββββββ
βββββββββββββββββββ β
βΌ
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Claims Map ββββββ GroundingAgent ββββββ Reports Data β
β (evidence) β β (Grounding) β β (extracted) β
ββββββββββ¬βββββββββ βββββββββββββββββββ βββββββββββββββββββ
β
βΌ
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Personas βββββΆβEvaluationAgent βββββΆβ EvaluationResultβ
β (reviewers) β β (Evaluate) β β (per-persona) β
βββββββββββββββββββ βββββββββββββββββββ ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β ReportAgent β
β (Synthesize) β
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β Final Report β
β (Markdown) β
βββββββββββββββββββ
Evaluation Dimensions
The framework evaluates research ideas across five core dimensions:
| Dimension | Description |
|---|---|
| Clarity | How clearly the idea is presented and explained |
| Novelty | Originality and innovation compared to existing work |
| Validity | Soundness of methodology and theoretical foundations |
| Feasibility | Practical implementability with available resources |
| Significance | Potential impact and contribution to the field |
Custom evaluation metrics can be added through the user_metric parameter.
π¬ Examples
Example 1: Evaluate from PDF URL
import asyncio from pathlib import Path from innoeval.pipeline.single_idea_pipeline import SingleIdeaPipeline async def evaluate_paper(): pipeline = SingleIdeaPipeline( input_type="pdf", pdf_url="https://openreview.net/pdf?id=YOUR_PAPER_ID", cache_path=Path("cache/my_paper.json"), persona_path=Path("cache/reviewer_personas.json"), research_params={ "title": "Your Paper Title", "after": "2022-01-01", "before": "2024-01-01", "depth": 3, }, num_personas=5, get_future_paper=True, ) result = await pipeline.run() print(result["final_report"]) asyncio.run(evaluate_paper())
Example 2: Evaluate from Text
import asyncio from pathlib import Path from innoeval.pipeline.single_idea_pipeline import SingleIdeaPipeline async def evaluate_idea(): idea_text = """ This paper introduces a novel approach to automated code review using large language models with retrieval-augmented generation... """ pipeline = SingleIdeaPipeline( input_type="text", idea_text=idea_text, cache_path=Path("cache/my_idea.json"), research_params={ "title": "LLM-based Code Review", "after": "2023-01-01", "before": "2024-12-01", }, num_personas=3, ) result = await pipeline.run() print(result["final_decision"]) asyncio.run(evaluate_idea())
Example 3: Custom Evaluation Metrics
# The evaluation agent supports custom metrics eval_params = { "temperature": 0.7, "user_metric": [ { "metric": "Reproducibility", "description": "Evaluate whether sufficient detail is provided for reproduction" }, { "metric": "EthicalConsiderations", "description": "Assess potential ethical implications and mitigation strategies" } ] }
Example 4: Batch Processing with Custom Dataset
# Create a JSONL file with format: # {"paper_id": "xxx", "title": "...", "decision": "accept"} # Then run: # python3 -m innoeval.pipeline.batch_pipeline # Or programmatically: from innoeval.pipeline.batch_pipeline import load_dataset, process_paper items = load_dataset(Path("dataset/my_papers.jsonl"), num=10) for item in items: print(f"Processing: {item.title}")
π Configuration
LLM Configuration
The config/LLM.env file controls all API settings:
# Primary LLM (DeepSeek) DS_API_KEY=your_deepseek_key DS_API_BASE_URL=https://api.deepseek.com/v1 # OpenAI (alternative) OPENAI_API_KEY=your_openai_key OPENAI_API_BASE_URL=https://api.openai.com/v1 # Search APIs GOOGLE_API_KEY=your_google_key SERPER_API_KEY=your_serper_key JINA_API_KEY=your_jina_key S2_API_KEY=your_semantic_scholar_key # GitHub GH_TOKEN=your_github_token # Kaggle (optional) KAGGLE_CONFIG_DIR=./config
Model Configuration
The default model configuration in SingleIdeaPipeline:
model_config = { "models": { "default_provider": "dsr1", "dsr1": { "model_name": "deepseek-v3.2", "api_key": os.getenv("DS_API_KEY"), "base_url": os.getenv("DS_API_BASE_URL"), "max_tokens": 4096, "temperature": 0.7, }, } }
Agent Parameters
| Agent | Key Parameters |
|---|---|
| ExtractionAgent | extract_temperature: 0.3 |
| ResearchAgent | top_k: 10, max_results_per_query: 5, web_max_results: 5, github_max_results: 5 |
| GroundingAgent | extract_temperature: 0.0 |
| EvaluationAgent | temperature: 0.7, num_personas: 5 |
| ReportAgent | temperature: 0.4 |
Research Parameters
| Parameter | Type | Description |
|---|---|---|
title |
str | Paper title for search optimization |
after |
str | Search papers after this date (YYYY-MM-DD) |
before |
str | Search papers before this date (YYYY-MM-DD) |
depth |
int | Search depth (1-5) |
web_temperature |
float | Temperature for web search queries |
code_temperature |
float | Temperature for code search queries |
Cache Structure
Pipeline results are cached in JSON format:
{
"extraction_result": {...},
"search_results_dict": {...},
"reports_data": {...},
"grounding_result": {...},
"evaluation_result": {...},
"final_report": "...",
"final_decision": "accept/reject",
"total_time": 123.45,
"total_token": 50000
}π Acknowledgement
This project builds upon and draws inspiration from the following open-source projects:
InternAgent
We thank the InternAgent project for providing foundational multi-agent architecture patterns and evaluation methodologies that influenced our pipeline design.
RepoMaster
We thank RepoMaster for the repository analysis toolkit that enables comprehensive code repository evaluation in our grounding process.
βοΈ Citation
If you find our work helpful, please use the following citations.
@misc{qiao2026innoevalresearchideaevaluation,
title={InnoEval: On Research Idea Evaluation as a Knowledge-Grounded, Multi-Perspective Reasoning Problem},
author={Shuofei Qiao and Yunxiang Wei and Xuehai Wang and Bin Wu and Boyang Xue and Ningyu Zhang and Hossein A. Rahmani and Yanshan Wang and Qiang Zhang and Keyan Ding and Jeff Z. Pan and Huajun Chen and Emine Yilmaz},
year={2026},
eprint={2602.14367},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2602.14367},
}
License
This project is licensed under the MIT License - see the LICENSE file for details.
