fix(conductor): increase retries in case ingress restarts by joroshiba · Pull Request #2270 · astriaorg/astria
Conversation
Summary
Gives conductor a larger backup when attempting to fetch a block commit and validator set, and adds a debug log so we can see if getting errors which are not retried through logs.
Background
When a sequencer restarts or ingress has a bounce, we tend to get lagging firm commits because we can't fetch the sequencer information and conductor drops the block from its queue. This change increases the amount of time it will go, noting the retries here are on an exponential backoff.
Changes
- Backoff max time increased from 10s to 5m on fetch commit and fetch validators
- new debug log
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
Increased retry backoff max delay from 10s to 5 minutes for fetching sequencer commits and validators to handle ingress/sequencer restarts better.
- Increased
max_delayfrom 10s to 300s (5 minutes) infetch_commit_with_retryandfetch_validators_with_retry - Added debug log when non-retryable errors are encountered (but with incorrect syntax)
- Uses exponential backoff starting at 100ms, doubling until reaching the 5-minute cap
- Only retries on HTTP/Timeout errors; other errors break immediately
Confidence Score: 3/5
- This PR has a syntax error that will prevent logging from working correctly
- The retry logic change is sensible and addresses a real operational issue. However, there's a syntax error in the new debug log that needs to be fixed. Additionally, the hard-coded 300s value should ideally be configurable per the custom instructions, though this is a style concern rather than a blocker
- Pay close attention to
crates/astria-conductor/src/celestia/verify.rs- fix the debug log syntax error before merging
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| crates/astria-conductor/src/celestia/verify.rs | 3/5 | Increased retry max delay from 10s to 5m and added debug log with incorrect field syntax |
Sequence Diagram
sequenceDiagram
participant Conductor
participant CometBftRetryStrategy
participant SequencerClient
participant Sequencer
Conductor->>SequencerClient: fetch_commit_with_retry(height)
activate SequencerClient
loop Retry with exponential backoff (max 5 min)
SequencerClient->>Sequencer: commit(height)
alt Retryable Error (HTTP/Timeout)
Sequencer-->>SequencerClient: Error
CometBftRetryStrategy->>CometBftRetryStrategy: Check error type
CometBftRetryStrategy->>CometBftRetryStrategy: Double delay (exponential backoff)
Note over CometBftRetryStrategy: Max delay capped at 300s (5min)<br/>Previously 10s
CometBftRetryStrategy-->>SequencerClient: RetryPolicy::Delay
else Non-retryable Error
Sequencer-->>SequencerClient: Error
CometBftRetryStrategy->>CometBftRetryStrategy: Check error type
CometBftRetryStrategy->>CometBftRetryStrategy: Log debug message
Note over CometBftRetryStrategy: New debug log added
CometBftRetryStrategy-->>SequencerClient: RetryPolicy::Break
SequencerClient-->>Conductor: Return Error
else Success
Sequencer-->>SequencerClient: Commit Response
SequencerClient-->>Conductor: Return Commit
end
end
deactivate SequencerClient
Conductor->>SequencerClient: fetch_validators_with_retry(height)
activate SequencerClient
loop Retry with exponential backoff (max 5 min)
SequencerClient->>Sequencer: validators(height)
alt Retryable Error (HTTP/Timeout)
Sequencer-->>SequencerClient: Error
CometBftRetryStrategy->>CometBftRetryStrategy: Check error type
CometBftRetryStrategy->>CometBftRetryStrategy: Double delay (exponential backoff)
Note over CometBftRetryStrategy: Max delay capped at 300s (5min)<br/>Previously 10s
CometBftRetryStrategy-->>SequencerClient: RetryPolicy::Delay
else Non-retryable Error
Sequencer-->>SequencerClient: Error
CometBftRetryStrategy->>CometBftRetryStrategy: Check error type
CometBftRetryStrategy->>CometBftRetryStrategy: Log debug message
Note over CometBftRetryStrategy: New debug log added
CometBftRetryStrategy-->>SequencerClient: RetryPolicy::Break
SequencerClient-->>Conductor: Return Error
else Success
Sequencer-->>SequencerClient: Validators Response
SequencerClient-->>Conductor: Return Validators
end
end
deactivate SequencerClient
1 file reviewed, 3 comments
| let retry_config = RetryFutureConfig::new(u32::MAX) | ||
| .custom_backoff(CometBftRetryStrategy::new(Duration::from_millis(100))) | ||
| .max_delay(Duration::from_secs(10)) | ||
| .max_delay(Duration::from_secs(300)) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: hard-coded 300s (5min) max delay should be configurable via environment variable, similar to other timeout values in the codebase that have XXX: This should probably be configurable comments
Context Used: Rule from dashboard - Hard-coded values that may vary between different networks (local, mocha, mainnet) should be configu... (source)
Prompt To Fix With AI
This is a comment left during a code review. Path: crates/astria-conductor/src/celestia/verify.rs Line: 298:298 Comment: **style:** hard-coded 300s (5min) max delay should be configurable via environment variable, similar to other timeout values in the codebase that have `XXX: This should probably be configurable` comments **Context Used:** Rule from `dashboard` - Hard-coded values that may vary between different networks (local, mocha, mainnet) should be configu... ([source](https://app.greptile.com/review/custom-context?memory=3a7f89a9-1179-4b2d-b733-e3acbebacdde)) How can I resolve this? If you propose a fix, please make it concise.
| let retry_config = RetryFutureConfig::new(u32::MAX) | ||
| .custom_backoff(CometBftRetryStrategy::new(Duration::from_millis(100))) | ||
| .max_delay(Duration::from_secs(10)) | ||
| .max_delay(Duration::from_secs(300)) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: hard-coded 300s (5min) max delay should be configurable via environment variable, similar to other timeout values in the codebase that have XXX: This should probably be configurable comments
Context Used: Rule from dashboard - Hard-coded values that may vary between different networks (local, mocha, mainnet) should be configu... (source)
Prompt To Fix With AI
This is a comment left during a code review. Path: crates/astria-conductor/src/celestia/verify.rs Line: 333:333 Comment: **style:** hard-coded 300s (5min) max delay should be configurable via environment variable, similar to other timeout values in the codebase that have `XXX: This should probably be configurable` comments **Context Used:** Rule from `dashboard` - Hard-coded values that may vary between different networks (local, mocha, mainnet) should be configu... ([source](https://app.greptile.com/review/custom-context?memory=3a7f89a9-1179-4b2d-b733-e3acbebacdde)) How can I resolve this? If you propose a fix, please make it concise.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters