feat: add e2e workspace build duration metric by sreya · Pull Request #21739 · coder/coder

Adds coderd_template_workspace_build_duration_seconds histogram that tracks
the full duration from workspace build creation to agent ready. This captures
the complete user-perceived build time including provisioning and agent startup.

The metric is emitted when the agent reports ready/error/timeout via the
lifecycle API, ensuring each build is counted exactly once per replica.

Labels: template_name, organization_name, workspace_transition, status

Fixes #21621

@sreya

…tric

- Add 'prebuild' label to distinguish prebuild creation from user builds
- Emit metric only when all agents are ready (multi-agent workspaces)
- Use worst status across all agents (error > timeout > success)
- Use MAX(ready_at) for accurate duration calculation
- Add EnableOpenMetrics to promhttp.HandlerOpts for protobuf scraping
- Remove debug logging from lifecycle.go
Removes the self-join by querying workspace_resources directly instead of
starting from workspace_agents. The agent's ResourceID is already available
at the call site, making this the more efficient approach.

@sreya

Also adds dbauthz test for GetWorkspaceBuildMetricsByResourceID.

dannykopping

@sreya @dannykopping

Co-authored-by: Danny Kopping <danny@coder.com>

@sreya

@sreya

Makes WorkspaceBuildDurationHistogram injectable on LifecycleAPI so
tests can use a per-test prometheus.Registry. Each metric-related test
now gathers metrics and asserts:

- Correct label values (template_name, org, transition, status, is_prebuild)
- Exactly one observation with positive duration
- No observations when AllAgentsReady is false

@sreya

Replace custom helper functions with the existing promhelp package
(coderd/coderdtest/promhelp) which is the codebase standard for
Prometheus metric testing.

Each test now uses:
- promhelp.HistogramValue() to validate labels, sample count, and sum
- promhelp.MetricValue() to assert metric absence
- Per-test prometheus.Registry for isolation

@sreya

Move metric name to an exported constant so tests reference it
instead of duplicating the string.
Use fixed buildCreatedAt and agentReadyAt times (90s apart) so tests
can assert the exact GetSampleSum() rather than just > 0.
Required for Prometheus to scrape native histograms via protobuf
format, used by the workspace build duration metric.

@sreya

Not required for native histograms - protobuf scraping is controlled
by Prometheus scrape_protocols config, not the server-side handler.

@sreya

dannykopping

@sreya

Replace the global WorkspaceBuildDurationSeconds var with
NewBuildDurationHistogram(reg) so tests use the real histogram
definition (same buckets, native histogram config, labels) instead
of a separate test-only histogram.

The histogram is created in coderd.go when Prometheus is enabled,
stored on the API struct, and threaded through to the agent API
via Options.
- Replace package-level var with LifecycleMetrics struct and
  NewLifecycleMetrics constructor
- Make emitBuildDurationMetric a receiver on LifecycleAPI
- Unexport emitMetricsOnce
- Thread LifecycleMetrics through coderd -> agentapi Options
- Tests use NewLifecycleMetrics with per-test registries

@sreya

@sreya

dannykopping

@sreya

@sreya

@sreya sreya deleted the feat/workspace-build-e2e-metric branch

February 6, 2026 22:26