GitHub - BinaryMuse/athanor: ⚗️ Experiment harness on the BEAM; written for AI research, flexible enough for any systematic experimentation

Athanor is an experiment harness, designed for (but not limited to) AI research, built as an Elixir/Phoenix umbrella application. It provides a framework for defining, configuring, executing, and monitoring experiments.

Features

Code-Defined Experiments - Define experiments as versioned Elixir modules
Real-Time Web UI - Monitor running experiments with live logs, results, and progress
Supervised Execution - Each run executes in isolation with graceful cancellation
MCP Server - Programmatic access via Model Context Protocol for AI agents
Flexible Results - Store arbitrary structured data for later analysis

Project Structure

The core umbrella contains two applications:

athanor - Core business logic and runtime system
athanor_web - Phoenix web interface with LiveView for real-time experiment management

Setup

# Install dependencies
mix setup

# Set up the database; see `config/dev.exs` for credentials
mix ecto.setup

# Start the server
iex -S mix phx.server

The web interface runs at http://localhost:4000 — you can choose a port by setting the PORT environment variable when starting the app.

Core Concepts

Code-Defined Experiments

Experiments are Elixir modules that use Athanor.Experiment. Each experiment defines its configuration schema and execution logic in code, making experiments versioned and reproducible.

Instance + Run Separation

Instance: A configured experiment with a name, description, and configuration values
Run: A single execution of an instance

This separation allows the same configuration to be executed multiple times for reproducibility.

Supervised Execution

Each run executes in its own GenServer under a DynamicSupervisor, isolating failures and enabling cancellation.

Creating an Experiment

Create a module that uses Athanor.Experiment:

defmodule MyExperiment do
  use Athanor.Experiment

  alias Athanor.Experiment

  @impl true
  def experiment do
    Experiment.Definition.new()
    |> Experiment.Definition.name("my_experiment")
    |> Experiment.Definition.description("Tests something interesting")
    |> Experiment.Definition.configuration(config())
  end

  defp config do
    Experiment.ConfigSchema.new()
    |> Experiment.ConfigSchema.field(:iterations, :integer,
      default: 10,
      min: 1,
      max: 100,
      label: "Iterations",
      description: "Number of test iterations"
    )
    |> Experiment.ConfigSchema.field(:model, :string,
      default: "gpt-4",
      label: "Model",
      required: true
    )
  end

  @impl true
  def run(ctx) do
    config = Athanor.Runtime.config(ctx)
    total = config["iterations"]

    Athanor.Runtime.log(ctx, :info, "Starting experiment with #{total} iterations")
    Athanor.Runtime.progress(ctx, 0, total)

    for i <- 1..total do
      # Check for cancellation
      if Athanor.Runtime.cancelled?(ctx), do: throw(:cancelled)

      # Do work...
      result = perform_iteration(config, i)

      # Record result and update progress
      Athanor.Runtime.result(ctx, "iteration_#{i}", result)
      Athanor.Runtime.progress(ctx, i, total)
    end

    Athanor.Runtime.complete(ctx)
  catch
    :cancelled -> {:error, "Cancelled by user"}
  end

  defp perform_iteration(config, i) do
    # Your experiment logic here
    %{iteration: i, model: config["model"], output: "..."}
  end
end

Experiments are auto-discovered by the system at runtime.

Runtime API

The Athanor.Runtime module provides the interface for experiments to interact with the harness during execution:

Configuration

# Get the instance configuration as a map
config = Athanor.Runtime.config(ctx)

Logging

# Log messages at different levels
Athanor.Runtime.log(ctx, :info, "Processing item")
Athanor.Runtime.log(ctx, :warn, "Retrying request", %{attempt: 2})
Athanor.Runtime.log(ctx, :error, "Failed to connect")

# Batch multiple log entries
Athanor.Runtime.log_batch(ctx, [
  {:info, "Step 1 complete", nil},
  {:info, "Step 2 complete", nil}
])

Results

Results are persisted to the database and displayed in the web UI:

# Store a result with a key and value
Athanor.Runtime.result(ctx, "model_response", %{
  input: prompt,
  output: response,
  tokens: token_count
})

Progress

Progress updates are broadcast to the web UI in real-time:

# Update progress (current, total, optional message)
Athanor.Runtime.progress(ctx, 5, 100)
Athanor.Runtime.progress(ctx, 50, 100, "Halfway done")

Completion

# Mark the run as successfully completed
Athanor.Runtime.complete(ctx)

# Mark the run as failed with an error message
Athanor.Runtime.fail(ctx, "API rate limit exceeded")

Cancellation

# Check if the user has requested cancellation
if Athanor.Runtime.cancelled?(ctx) do
  # Clean up and exit
end

Executing Experiments

Via Web UI

Navigate to /experiments
Click "New" to create an instance
Select an experiment module and configure it
Click "Run" to execute
Watch logs, results, and progress update in real-time

Programmatically

# Start a run for an existing instance
{:ok, run} = Athanor.Runtime.start_run(instance)

# Cancel a running experiment
Athanor.Runtime.cancel_run(run)

MCP Server

Athanor includes a Model Context Protocol (MCP) server that allows AI agents to programmatically manage experiments, runs, logs, and results. The server exposes 15 tools for complete experiment lifecycle management.

Endpoint: http://localhost:4000/mcp

Available Operations:

List, create, and update experiments
Discover available experiment modules and their schemas
Start, monitor, and cancel runs
Query logs and results

For detailed documentation on all available tools and usage examples, see docs/MCP_SERVER.md.

Quick Example

# Connect an MCP client to the server
# The client can then call tools like:
# - list_experiments
# - create_experiment
# - start_run
# - get_run_logs

Analyzing Results

Results are stored as a simple key/value store in the run_results table. Each result has:

run_id - The run it belongs to
key - A string identifier (e.g., "iteration_1", "model_response")
value - A JSONB column containing arbitrary data

This structure makes results easy to query and analyze outside of Athanor.

Querying Results

# Get all results for a run
Athanor.Experiments.list_results(run_id)

# Query directly with Ecto
import Ecto.Query

Athanor.Repo.all(
  from r in Athanor.Experiments.Result,
  where: r.run_id == ^run_id,
  where: r.key == "model_response"
)

Jupyter Notebook Analysis

Results can be loaded directly into Jupyter notebooks (using Livebook or Python) for analysis:

import psycopg2
import pandas as pd

conn = psycopg2.connect("postgresql://localhost/athanor_dev")

# Load results for a specific run
df = pd.read_sql("""
    SELECT key, value, inserted_at
    FROM run_results
    WHERE run_id = %s
    ORDER BY inserted_at
""", conn, params=[run_id])

# The 'value' column contains JSON - expand it
df = pd.concat([df, pd.json_normalize(df['value'])], axis=1)

Or with Livebook (Elixir):

# In a Livebook connected to your Athanor node
results = Athanor.Experiments.list_results(run_id)

# Convert to a table for analysis
results
|> Enum.map(fn r -> Map.merge(%{key: r.key}, r.value) end)
|> Kino.DataTable.new()

Example: SubstrateShift

The substrate_shift app contains a complete example experiment that tests whether LLMs can detect when they're running on a different underlying model.

Configuration options:

runs_per_pair - Number of test runs per model pair
parallelism - Concurrent pairs to test
model_pairs - List of model pairs to compare

See apps/substrate_shift/lib/substrate_shift.ex for the full implementation.

Development

# Run tests
mix test

# Format code and run checks
mix precommit

# Start interactive shell with server
iex -S mix phx.server

Data Model

experiment_instances - Configured experiments with name, description, and configuration
experiment_runs - Execution records with status, timing, and error info
run_results - Key-value results from each run
run_logs - Log entries with level, message, and metadata