GitHub - BinaryMuse/athanor: ⚗️ Experiment harness on the BEAM; written for AI research, flexible enough for any systematic experimentation

Athanor

Athanor is an experiment harness, designed for (but not limited to) AI research, built as an Elixir/Phoenix umbrella application. It provides a framework for defining, configuring, executing, and monitoring experiments.

Features

  • Code-Defined Experiments - Define experiments as versioned Elixir modules
  • Real-Time Web UI - Monitor running experiments with live logs, results, and progress
  • Supervised Execution - Each run executes in isolation with graceful cancellation
  • MCP Server - Programmatic access via Model Context Protocol for AI agents
  • Flexible Results - Store arbitrary structured data for later analysis

Project Structure

The core umbrella contains two applications:

  • athanor - Core business logic and runtime system
  • athanor_web - Phoenix web interface with LiveView for real-time experiment management

Setup

# Install dependencies
mix setup

# Set up the database; see `config/dev.exs` for credentials
mix ecto.setup

# Start the server
iex -S mix phx.server

The web interface runs at http://localhost:4000 — you can choose a port by setting the PORT environment variable when starting the app.

Core Concepts

Code-Defined Experiments

Experiments are Elixir modules that use Athanor.Experiment. Each experiment defines its configuration schema and execution logic in code, making experiments versioned and reproducible.

Instance + Run Separation

  • Instance: A configured experiment with a name, description, and configuration values
  • Run: A single execution of an instance

This separation allows the same configuration to be executed multiple times for reproducibility.

Supervised Execution

Each run executes in its own GenServer under a DynamicSupervisor, isolating failures and enabling cancellation.

Creating an Experiment

Create a module that uses Athanor.Experiment:

defmodule MyExperiment do
  use Athanor.Experiment

  alias Athanor.Experiment

  @impl true
  def experiment do
    Experiment.Definition.new()
    |> Experiment.Definition.name("my_experiment")
    |> Experiment.Definition.description("Tests something interesting")
    |> Experiment.Definition.configuration(config())
  end

  defp config do
    Experiment.ConfigSchema.new()
    |> Experiment.ConfigSchema.field(:iterations, :integer,
      default: 10,
      min: 1,
      max: 100,
      label: "Iterations",
      description: "Number of test iterations"
    )
    |> Experiment.ConfigSchema.field(:model, :string,
      default: "gpt-4",
      label: "Model",
      required: true
    )
  end

  @impl true
  def run(ctx) do
    config = Athanor.Runtime.config(ctx)
    total = config["iterations"]

    Athanor.Runtime.log(ctx, :info, "Starting experiment with #{total} iterations")
    Athanor.Runtime.progress(ctx, 0, total)

    for i <- 1..total do
      # Check for cancellation
      if Athanor.Runtime.cancelled?(ctx), do: throw(:cancelled)

      # Do work...
      result = perform_iteration(config, i)

      # Record result and update progress
      Athanor.Runtime.result(ctx, "iteration_#{i}", result)
      Athanor.Runtime.progress(ctx, i, total)
    end

    Athanor.Runtime.complete(ctx)
  catch
    :cancelled -> {:error, "Cancelled by user"}
  end

  defp perform_iteration(config, i) do
    # Your experiment logic here
    %{iteration: i, model: config["model"], output: "..."}
  end
end

Experiments are auto-discovered by the system at runtime.

Runtime API

The Athanor.Runtime module provides the interface for experiments to interact with the harness during execution:

Configuration

# Get the instance configuration as a map
config = Athanor.Runtime.config(ctx)

Logging

# Log messages at different levels
Athanor.Runtime.log(ctx, :info, "Processing item")
Athanor.Runtime.log(ctx, :warn, "Retrying request", %{attempt: 2})
Athanor.Runtime.log(ctx, :error, "Failed to connect")

# Batch multiple log entries
Athanor.Runtime.log_batch(ctx, [
  {:info, "Step 1 complete", nil},
  {:info, "Step 2 complete", nil}
])

Results

Results are persisted to the database and displayed in the web UI:

# Store a result with a key and value
Athanor.Runtime.result(ctx, "model_response", %{
  input: prompt,
  output: response,
  tokens: token_count
})

Progress

Progress updates are broadcast to the web UI in real-time:

# Update progress (current, total, optional message)
Athanor.Runtime.progress(ctx, 5, 100)
Athanor.Runtime.progress(ctx, 50, 100, "Halfway done")

Completion

# Mark the run as successfully completed
Athanor.Runtime.complete(ctx)

# Mark the run as failed with an error message
Athanor.Runtime.fail(ctx, "API rate limit exceeded")

Cancellation

# Check if the user has requested cancellation
if Athanor.Runtime.cancelled?(ctx) do
  # Clean up and exit
end

Executing Experiments

Via Web UI

  1. Navigate to /experiments
  2. Click "New" to create an instance
  3. Select an experiment module and configure it
  4. Click "Run" to execute
  5. Watch logs, results, and progress update in real-time

Programmatically

# Start a run for an existing instance
{:ok, run} = Athanor.Runtime.start_run(instance)

# Cancel a running experiment
Athanor.Runtime.cancel_run(run)

MCP Server

Athanor includes a Model Context Protocol (MCP) server that allows AI agents to programmatically manage experiments, runs, logs, and results. The server exposes 15 tools for complete experiment lifecycle management.

Endpoint: http://localhost:4000/mcp

Available Operations:

  • List, create, and update experiments
  • Discover available experiment modules and their schemas
  • Start, monitor, and cancel runs
  • Query logs and results

For detailed documentation on all available tools and usage examples, see docs/MCP_SERVER.md.

Quick Example

# Connect an MCP client to the server
# The client can then call tools like:
# - list_experiments
# - create_experiment
# - start_run
# - get_run_logs

Analyzing Results

Results are stored as a simple key/value store in the run_results table. Each result has:

  • run_id - The run it belongs to
  • key - A string identifier (e.g., "iteration_1", "model_response")
  • value - A JSONB column containing arbitrary data

This structure makes results easy to query and analyze outside of Athanor.

Querying Results

# Get all results for a run
Athanor.Experiments.list_results(run_id)

# Query directly with Ecto
import Ecto.Query

Athanor.Repo.all(
  from r in Athanor.Experiments.Result,
  where: r.run_id == ^run_id,
  where: r.key == "model_response"
)

Jupyter Notebook Analysis

Results can be loaded directly into Jupyter notebooks (using Livebook or Python) for analysis:

import psycopg2
import pandas as pd

conn = psycopg2.connect("postgresql://localhost/athanor_dev")

# Load results for a specific run
df = pd.read_sql("""
    SELECT key, value, inserted_at
    FROM run_results
    WHERE run_id = %s
    ORDER BY inserted_at
""", conn, params=[run_id])

# The 'value' column contains JSON - expand it
df = pd.concat([df, pd.json_normalize(df['value'])], axis=1)

Or with Livebook (Elixir):

# In a Livebook connected to your Athanor node
results = Athanor.Experiments.list_results(run_id)

# Convert to a table for analysis
results
|> Enum.map(fn r -> Map.merge(%{key: r.key}, r.value) end)
|> Kino.DataTable.new()

Example: SubstrateShift

The substrate_shift app contains a complete example experiment that tests whether LLMs can detect when they're running on a different underlying model.

Configuration options:

  • runs_per_pair - Number of test runs per model pair
  • parallelism - Concurrent pairs to test
  • model_pairs - List of model pairs to compare

See apps/substrate_shift/lib/substrate_shift.ex for the full implementation.

Development

# Run tests
mix test

# Format code and run checks
mix precommit

# Start interactive shell with server
iex -S mix phx.server

Data Model

  • experiment_instances - Configured experiments with name, description, and configuration
  • experiment_runs - Execution records with status, timing, and error info
  • run_results - Key-value results from each run
  • run_logs - Log entries with level, message, and metadata