-
Notifications
You must be signed in to change notification settings - Fork 31
Open
Description
Enhance Network Timeout Handling and Error Recovery for Production Reliability
Description:
Currently, the tlock implementation has basic timeout handling with a fixed 5-second timeout for network operations (const timeout = 5 * time.Second in networks/http/http.go). This can be problematic in production environments where network conditions vary and more robust error handling is needed.
Current Limitations:
- Fixed timeout duration:
// networks/http/http.go
const timeout = 5 * time.Second- Basic error handling without retries:
// networks/http/http.go
ctx, cancel := context.WithTimeout(context.Background(), timeout)
defer cancel()
result, err := n.client.Get(ctx, roundNumber)
if err != nil {
return nil, err
}- Limited error context in network operations:
// tlock_age.go
signature, err := t.network.Signature(roundNumber)
if err != nil {
return nil, fmt.Errorf(
"%w: expected round %d > %d current round",
ErrTooEarly,
roundNumber,
t.network.Current(time.Now()))
}Proposed Changes:
- Configurable Timeout Settings
- Add configuration options for different timeout types:
type NetworkConfig struct { DialTimeout time.Duration RequestTimeout time.Duration KeepAliveTimeout time.Duration RetryTimeout time.Duration }
- Allow timeout configuration through environment variables and CLI flags
- Implement reasonable defaults for different network operations
- Retry Mechanism with Exponential Backoff
- Implement a retry mechanism for transient failures:
type RetryConfig struct { MaxAttempts int InitialDelay time.Duration MaxDelay time.Duration BackoffMultiplier float64 }
- Add exponential backoff for failed requests
- Distinguish between retryable and non-retryable errors
- Enhanced Error Context
- Create custom error types for different failure scenarios:
type NetworkError struct { Op string RoundNumber uint64 Attempt int Timeout time.Duration Err error }
- Add detailed error messages with:
- Network endpoint information
- Request timing details
- Retry attempt count
- Specific failure reason
- Monitoring and Logging
- Add structured logging for network operations
- Include metrics for:
- Request latencies
- Retry counts
- Failure rates
- Timeout occurrences
Implementation Details:
- Create a new network client configuration structure:
type NetworkClientConfig struct {
Timeouts RetryConfig
Retries RetryConfig
Logging LogConfig
}- Implement retry logic with context:
func (n *Network) getWithRetry(ctx context.Context, roundNumber uint64) (*Result, error) {
var lastErr error
for attempt := 0; attempt < n.config.Retries.MaxAttempts; attempt++ {
select {
case <-ctx.Done():
return nil, &NetworkError{
Op: "get_signature",
RoundNumber: roundNumber,
Attempt: attempt,
Err: ctx.Err(),
}
default:
// Implement exponential backoff
backoff := n.calculateBackoff(attempt)
time.Sleep(backoff)
result, err := n.client.Get(ctx, roundNumber)
if err == nil {
return result, nil
}
lastErr = err
if !isRetryableError(err) {
return nil, err
}
}
}
return nil, fmt.Errorf("max retries exceeded: %w", lastErr)
}- Add configuration validation:
func validateConfig(config NetworkClientConfig) error {
if config.Timeouts.RequestTimeout < minTimeout {
return fmt.Errorf("request timeout %v below minimum %v",
config.Timeouts.RequestTimeout, minTimeout)
}
// Add other validation rules
return nil
}Benefits:
- Improved reliability in unstable network conditions
- Better error handling and recovery
- More detailed error reporting for debugging
- Configurable behavior for different deployment environments
- Better monitoring and observability
Testing:
Add new test cases:
- Test retry behavior with simulated network failures
- Verify timeout configurations
- Test error handling with different network conditions
- Validate monitoring metrics
Acceptance Criteria:
- Configurable timeout settings implemented
- Retry mechanism with exponential backoff working
- Enhanced error messages with context
- Monitoring and logging improvements
- Test coverage for new functionality
- Documentation updated with new configuration options
- Backward compatibility maintained
Metadata
Metadata
Assignees
Labels
No labels