fix(conductor): increase retries in case ingress restarts by joroshiba · Pull Request #2270 · astriaorg/astria

Conversation

@joroshiba

Summary

Gives conductor a larger backup when attempting to fetch a block commit and validator set, and adds a debug log so we can see if getting errors which are not retried through logs.

Background

When a sequencer restarts or ingress has a bounce, we tend to get lagging firm commits because we can't fetch the sequencer information and conductor drops the block from its queue. This change increases the amount of time it will go, noting the retries here are on an exponential backoff.

Changes

  • Backoff max time increased from 10s to 5m on fetch commit and fetch validators
  • new debug log

greptile-apps[bot]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

Increased retry backoff max delay from 10s to 5 minutes for fetching sequencer commits and validators to handle ingress/sequencer restarts better.

  • Increased max_delay from 10s to 300s (5 minutes) in fetch_commit_with_retry and fetch_validators_with_retry
  • Added debug log when non-retryable errors are encountered (but with incorrect syntax)
  • Uses exponential backoff starting at 100ms, doubling until reaching the 5-minute cap
  • Only retries on HTTP/Timeout errors; other errors break immediately

Confidence Score: 3/5

  • This PR has a syntax error that will prevent logging from working correctly
  • The retry logic change is sensible and addresses a real operational issue. However, there's a syntax error in the new debug log that needs to be fixed. Additionally, the hard-coded 300s value should ideally be configurable per the custom instructions, though this is a style concern rather than a blocker
  • Pay close attention to crates/astria-conductor/src/celestia/verify.rs - fix the debug log syntax error before merging

Important Files Changed

File Analysis

Filename Score Overview
crates/astria-conductor/src/celestia/verify.rs 3/5 Increased retry max delay from 10s to 5m and added debug log with incorrect field syntax

Sequence Diagram

sequenceDiagram
    participant Conductor
    participant CometBftRetryStrategy
    participant SequencerClient
    participant Sequencer

    Conductor->>SequencerClient: fetch_commit_with_retry(height)
    activate SequencerClient
    
    loop Retry with exponential backoff (max 5 min)
        SequencerClient->>Sequencer: commit(height)
        alt Retryable Error (HTTP/Timeout)
            Sequencer-->>SequencerClient: Error
            CometBftRetryStrategy->>CometBftRetryStrategy: Check error type
            CometBftRetryStrategy->>CometBftRetryStrategy: Double delay (exponential backoff)
            Note over CometBftRetryStrategy: Max delay capped at 300s (5min)<br/>Previously 10s
            CometBftRetryStrategy-->>SequencerClient: RetryPolicy::Delay
        else Non-retryable Error
            Sequencer-->>SequencerClient: Error
            CometBftRetryStrategy->>CometBftRetryStrategy: Check error type
            CometBftRetryStrategy->>CometBftRetryStrategy: Log debug message
            Note over CometBftRetryStrategy: New debug log added
            CometBftRetryStrategy-->>SequencerClient: RetryPolicy::Break
            SequencerClient-->>Conductor: Return Error
        else Success
            Sequencer-->>SequencerClient: Commit Response
            SequencerClient-->>Conductor: Return Commit
        end
    end
    
    deactivate SequencerClient
    
    Conductor->>SequencerClient: fetch_validators_with_retry(height)
    activate SequencerClient
    
    loop Retry with exponential backoff (max 5 min)
        SequencerClient->>Sequencer: validators(height)
        alt Retryable Error (HTTP/Timeout)
            Sequencer-->>SequencerClient: Error
            CometBftRetryStrategy->>CometBftRetryStrategy: Check error type
            CometBftRetryStrategy->>CometBftRetryStrategy: Double delay (exponential backoff)
            Note over CometBftRetryStrategy: Max delay capped at 300s (5min)<br/>Previously 10s
            CometBftRetryStrategy-->>SequencerClient: RetryPolicy::Delay
        else Non-retryable Error
            Sequencer-->>SequencerClient: Error
            CometBftRetryStrategy->>CometBftRetryStrategy: Check error type
            CometBftRetryStrategy->>CometBftRetryStrategy: Log debug message
            Note over CometBftRetryStrategy: New debug log added
            CometBftRetryStrategy-->>SequencerClient: RetryPolicy::Break
            SequencerClient-->>Conductor: Return Error
        else Success
            Sequencer-->>SequencerClient: Validators Response
            SequencerClient-->>Conductor: Return Validators
        end
    end
    
    deactivate SequencerClient
Loading

1 file reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

let retry_config = RetryFutureConfig::new(u32::MAX)
.custom_backoff(CometBftRetryStrategy::new(Duration::from_millis(100)))
.max_delay(Duration::from_secs(10))
.max_delay(Duration::from_secs(300))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: hard-coded 300s (5min) max delay should be configurable via environment variable, similar to other timeout values in the codebase that have XXX: This should probably be configurable comments

Context Used: Rule from dashboard - Hard-coded values that may vary between different networks (local, mocha, mainnet) should be configu... (source)

Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/astria-conductor/src/celestia/verify.rs
Line: 298:298

Comment:
**style:** hard-coded 300s (5min) max delay should be configurable via environment variable, similar to other timeout values in the codebase that have `XXX: This should probably be configurable` comments

**Context Used:** Rule from `dashboard` - Hard-coded values that may vary between different networks (local, mocha, mainnet) should be configu... ([source](https://app.greptile.com/review/custom-context?memory=3a7f89a9-1179-4b2d-b733-e3acbebacdde))

How can I resolve this? If you propose a fix, please make it concise.
let retry_config = RetryFutureConfig::new(u32::MAX)
.custom_backoff(CometBftRetryStrategy::new(Duration::from_millis(100)))
.max_delay(Duration::from_secs(10))
.max_delay(Duration::from_secs(300))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: hard-coded 300s (5min) max delay should be configurable via environment variable, similar to other timeout values in the codebase that have XXX: This should probably be configurable comments

Context Used: Rule from dashboard - Hard-coded values that may vary between different networks (local, mocha, mainnet) should be configu... (source)

Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/astria-conductor/src/celestia/verify.rs
Line: 333:333

Comment:
**style:** hard-coded 300s (5min) max delay should be configurable via environment variable, similar to other timeout values in the codebase that have `XXX: This should probably be configurable` comments

**Context Used:** Rule from `dashboard` - Hard-coded values that may vary between different networks (local, mocha, mainnet) should be configu... ([source](https://app.greptile.com/review/custom-context?memory=3a7f89a9-1179-4b2d-b733-e3acbebacdde))

How can I resolve this? If you propose a fix, please make it concise.

mycodecrafting

Fraser999

Labels

conductor

pertaining to the astria-conductor crate