GitHub - DpEpsilon/looplab: AI agent evals framework with durable execution, designed with AI Control research and rapid iteration in mind

looplap

New AI evaluation framework in development!

Not ready for agentic AI evals and AI control evals (which is the primary use case), but it is ready for non-agentic evals, including more complex ones involving multiple prompts or multiple models, such as debate evals (of which there is an example in the repo). However, the framework is not mature, so the API is not stable, and most of the common eval structures you would have support for in frameworks like AISI's Inspect are not available. There is only support for OpenAI and Anthropic completion APIs at this point (including their batch APIs).

Features:

  • Dataflow-like declarative program structure, easy to define high-level evaluation structure tersely
  • Durable execution: Programs can be terminated and restarted from where you left off, even on a different computer!
  • Plays well with batch inference APIs. Transparent batching of tasks, and you don't have to keep the program running actively with an internet connection while you wait for batches to complete.
  • For control evals, greater ability to define and manage the execution environment, e.g. forking docker container filesystem state for agentic monitoring evals (DP's main research area), running Shade Arena evals without needing an external filesystem, etc. (work in progress)