Athanor is an experiment harness, designed for (but not limited to) AI research, built as an Elixir/Phoenix umbrella application. It provides a framework for defining, configuring, executing, and monitoring experiments.
Features
- Code-Defined Experiments - Define experiments as versioned Elixir modules
- Real-Time Web UI - Monitor running experiments with live logs, results, and progress
- Supervised Execution - Each run executes in isolation with graceful cancellation
- MCP Server - Programmatic access via Model Context Protocol for AI agents
- Flexible Results - Store arbitrary structured data for later analysis
Project Structure
The core umbrella contains two applications:
athanor- Core business logic and runtime systemathanor_web- Phoenix web interface with LiveView for real-time experiment management
Setup
# Install dependencies mix setup # Set up the database; see `config/dev.exs` for credentials mix ecto.setup # Start the server iex -S mix phx.server
The web interface runs at http://localhost:4000 — you can choose a port by setting the PORT environment variable when starting the app.
Core Concepts
Code-Defined Experiments
Experiments are Elixir modules that use Athanor.Experiment. Each experiment defines its configuration schema and execution logic in code, making experiments versioned and reproducible.
Instance + Run Separation
- Instance: A configured experiment with a name, description, and configuration values
- Run: A single execution of an instance
This separation allows the same configuration to be executed multiple times for reproducibility.
Supervised Execution
Each run executes in its own GenServer under a DynamicSupervisor, isolating failures and enabling cancellation.
Creating an Experiment
Create a module that uses Athanor.Experiment:
defmodule MyExperiment do use Athanor.Experiment alias Athanor.Experiment @impl true def experiment do Experiment.Definition.new() |> Experiment.Definition.name("my_experiment") |> Experiment.Definition.description("Tests something interesting") |> Experiment.Definition.configuration(config()) end defp config do Experiment.ConfigSchema.new() |> Experiment.ConfigSchema.field(:iterations, :integer, default: 10, min: 1, max: 100, label: "Iterations", description: "Number of test iterations" ) |> Experiment.ConfigSchema.field(:model, :string, default: "gpt-4", label: "Model", required: true ) end @impl true def run(ctx) do config = Athanor.Runtime.config(ctx) total = config["iterations"] Athanor.Runtime.log(ctx, :info, "Starting experiment with #{total} iterations") Athanor.Runtime.progress(ctx, 0, total) for i <- 1..total do # Check for cancellation if Athanor.Runtime.cancelled?(ctx), do: throw(:cancelled) # Do work... result = perform_iteration(config, i) # Record result and update progress Athanor.Runtime.result(ctx, "iteration_#{i}", result) Athanor.Runtime.progress(ctx, i, total) end Athanor.Runtime.complete(ctx) catch :cancelled -> {:error, "Cancelled by user"} end defp perform_iteration(config, i) do # Your experiment logic here %{iteration: i, model: config["model"], output: "..."} end end
Experiments are auto-discovered by the system at runtime.
Runtime API
The Athanor.Runtime module provides the interface for experiments to interact with the harness during execution:
Configuration
# Get the instance configuration as a map config = Athanor.Runtime.config(ctx)
Logging
# Log messages at different levels Athanor.Runtime.log(ctx, :info, "Processing item") Athanor.Runtime.log(ctx, :warn, "Retrying request", %{attempt: 2}) Athanor.Runtime.log(ctx, :error, "Failed to connect") # Batch multiple log entries Athanor.Runtime.log_batch(ctx, [ {:info, "Step 1 complete", nil}, {:info, "Step 2 complete", nil} ])
Results
Results are persisted to the database and displayed in the web UI:
# Store a result with a key and value Athanor.Runtime.result(ctx, "model_response", %{ input: prompt, output: response, tokens: token_count })
Progress
Progress updates are broadcast to the web UI in real-time:
# Update progress (current, total, optional message) Athanor.Runtime.progress(ctx, 5, 100) Athanor.Runtime.progress(ctx, 50, 100, "Halfway done")
Completion
# Mark the run as successfully completed Athanor.Runtime.complete(ctx) # Mark the run as failed with an error message Athanor.Runtime.fail(ctx, "API rate limit exceeded")
Cancellation
# Check if the user has requested cancellation if Athanor.Runtime.cancelled?(ctx) do # Clean up and exit end
Executing Experiments
Via Web UI
- Navigate to
/experiments - Click "New" to create an instance
- Select an experiment module and configure it
- Click "Run" to execute
- Watch logs, results, and progress update in real-time
Programmatically
# Start a run for an existing instance {:ok, run} = Athanor.Runtime.start_run(instance) # Cancel a running experiment Athanor.Runtime.cancel_run(run)
MCP Server
Athanor includes a Model Context Protocol (MCP) server that allows AI agents to programmatically manage experiments, runs, logs, and results. The server exposes 15 tools for complete experiment lifecycle management.
Endpoint: http://localhost:4000/mcp
Available Operations:
- List, create, and update experiments
- Discover available experiment modules and their schemas
- Start, monitor, and cancel runs
- Query logs and results
For detailed documentation on all available tools and usage examples, see docs/MCP_SERVER.md.
Quick Example
# Connect an MCP client to the server # The client can then call tools like: # - list_experiments # - create_experiment # - start_run # - get_run_logs
Analyzing Results
Results are stored as a simple key/value store in the run_results table. Each result has:
run_id- The run it belongs tokey- A string identifier (e.g.,"iteration_1","model_response")value- A JSONB column containing arbitrary data
This structure makes results easy to query and analyze outside of Athanor.
Querying Results
# Get all results for a run Athanor.Experiments.list_results(run_id) # Query directly with Ecto import Ecto.Query Athanor.Repo.all( from r in Athanor.Experiments.Result, where: r.run_id == ^run_id, where: r.key == "model_response" )
Jupyter Notebook Analysis
Results can be loaded directly into Jupyter notebooks (using Livebook or Python) for analysis:
import psycopg2 import pandas as pd conn = psycopg2.connect("postgresql://localhost/athanor_dev") # Load results for a specific run df = pd.read_sql(""" SELECT key, value, inserted_at FROM run_results WHERE run_id = %s ORDER BY inserted_at """, conn, params=[run_id]) # The 'value' column contains JSON - expand it df = pd.concat([df, pd.json_normalize(df['value'])], axis=1)
Or with Livebook (Elixir):
# In a Livebook connected to your Athanor node results = Athanor.Experiments.list_results(run_id) # Convert to a table for analysis results |> Enum.map(fn r -> Map.merge(%{key: r.key}, r.value) end) |> Kino.DataTable.new()
Example: SubstrateShift
The substrate_shift app contains a complete example experiment that tests whether LLMs can detect when they're running on a different underlying model.
Configuration options:
runs_per_pair- Number of test runs per model pairparallelism- Concurrent pairs to testmodel_pairs- List of model pairs to compare
See apps/substrate_shift/lib/substrate_shift.ex for the full implementation.
Development
# Run tests mix test # Format code and run checks mix precommit # Start interactive shell with server iex -S mix phx.server
Data Model
experiment_instances- Configured experiments with name, description, and configurationexperiment_runs- Execution records with status, timing, and error inforun_results- Key-value results from each runrun_logs- Log entries with level, message, and metadata
