Skip to content

feat: Implement Robust Network Timeout Handling with Retries #88

@pranavkonde

Description

@pranavkonde

Enhance Network Timeout Handling and Error Recovery for Production Reliability

Description:

Currently, the tlock implementation has basic timeout handling with a fixed 5-second timeout for network operations (const timeout = 5 * time.Second in networks/http/http.go). This can be problematic in production environments where network conditions vary and more robust error handling is needed.

Current Limitations:

  1. Fixed timeout duration:
// networks/http/http.go
const timeout = 5 * time.Second
  1. Basic error handling without retries:
// networks/http/http.go
ctx, cancel := context.WithTimeout(context.Background(), timeout)
defer cancel()

result, err := n.client.Get(ctx, roundNumber)
if err != nil {
    return nil, err
}
  1. Limited error context in network operations:
// tlock_age.go
signature, err := t.network.Signature(roundNumber)
if err != nil {
    return nil, fmt.Errorf(
        "%w: expected round %d > %d current round",
        ErrTooEarly,
        roundNumber,
        t.network.Current(time.Now()))
}

Proposed Changes:

  1. Configurable Timeout Settings
  • Add configuration options for different timeout types:
    type NetworkConfig struct {
        DialTimeout       time.Duration
        RequestTimeout    time.Duration
        KeepAliveTimeout time.Duration
        RetryTimeout     time.Duration
    }
  • Allow timeout configuration through environment variables and CLI flags
  • Implement reasonable defaults for different network operations
  1. Retry Mechanism with Exponential Backoff
  • Implement a retry mechanism for transient failures:
    type RetryConfig struct {
        MaxAttempts      int
        InitialDelay     time.Duration
        MaxDelay         time.Duration
        BackoffMultiplier float64
    }
  • Add exponential backoff for failed requests
  • Distinguish between retryable and non-retryable errors
  1. Enhanced Error Context
  • Create custom error types for different failure scenarios:
    type NetworkError struct {
        Op          string
        RoundNumber uint64
        Attempt     int
        Timeout     time.Duration
        Err         error
    }
  • Add detailed error messages with:
    • Network endpoint information
    • Request timing details
    • Retry attempt count
    • Specific failure reason
  1. Monitoring and Logging
  • Add structured logging for network operations
  • Include metrics for:
    • Request latencies
    • Retry counts
    • Failure rates
    • Timeout occurrences

Implementation Details:

  1. Create a new network client configuration structure:
type NetworkClientConfig struct {
    Timeouts RetryConfig
    Retries  RetryConfig
    Logging  LogConfig
}
  1. Implement retry logic with context:
func (n *Network) getWithRetry(ctx context.Context, roundNumber uint64) (*Result, error) {
    var lastErr error
    for attempt := 0; attempt < n.config.Retries.MaxAttempts; attempt++ {
        select {
        case <-ctx.Done():
            return nil, &NetworkError{
                Op:          "get_signature",
                RoundNumber: roundNumber,
                Attempt:     attempt,
                Err:        ctx.Err(),
            }
        default:
            // Implement exponential backoff
            backoff := n.calculateBackoff(attempt)
            time.Sleep(backoff)
            
            result, err := n.client.Get(ctx, roundNumber)
            if err == nil {
                return result, nil
            }
            lastErr = err
            
            if !isRetryableError(err) {
                return nil, err
            }
        }
    }
    return nil, fmt.Errorf("max retries exceeded: %w", lastErr)
}
  1. Add configuration validation:
func validateConfig(config NetworkClientConfig) error {
    if config.Timeouts.RequestTimeout < minTimeout {
        return fmt.Errorf("request timeout %v below minimum %v", 
            config.Timeouts.RequestTimeout, minTimeout)
    }
    // Add other validation rules
    return nil
}

Benefits:

  1. Improved reliability in unstable network conditions
  2. Better error handling and recovery
  3. More detailed error reporting for debugging
  4. Configurable behavior for different deployment environments
  5. Better monitoring and observability

Testing:

Add new test cases:

  • Test retry behavior with simulated network failures
  • Verify timeout configurations
  • Test error handling with different network conditions
  • Validate monitoring metrics

Acceptance Criteria:

  • Configurable timeout settings implemented
  • Retry mechanism with exponential backoff working
  • Enhanced error messages with context
  • Monitoring and logging improvements
  • Test coverage for new functionality
  • Documentation updated with new configuration options
  • Backward compatibility maintained

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions