Skip to content

Two phase shutdown process #576

@danilofuchs

Description

@danilofuchs

Our production application has 3 main entrypoints:

  1. HTTP Routes (Koa)
  2. Jobs/Crons (pg-boss)
  3. Events (event-emitter2)

All within the same process in a Docker container.

When shutting down the application gracefully, we devised three steps:

  1. Stop route handlers and job workers and drain them
  2. Drain event listeners
  3. Close connections to application DB, Redis, Sentry

Route handlers, job workers and event listeners may want to publish jobs or dispatch events at any time. As the event bus is in-memory, it needs to be available until all handlers/workers are drained.

Currently we use pg-boss stop({ wait: true, close: true, graceful: true })

This causes issues if, on step 2, an event listener wants to publish a job. Then, the pg-boss DB connection is already closed.

Proposed solution

Expose a drain(timeout) method that stops workers (internally offWork() + wait with timeout) but keeps the DB open so new jobs may be published.

Allow stop() to stop the DB connection even if the worker is already stopped.

Considered alternatives

We are going with alternative 3 for now, but would love to have your thoughts on this workflow!

1. Calling stop 2 times

// Drain
    await this.boss.stop({
      close: false, // Keep DB connection open so we can still add jobs
      wait: true,
      graceful: true,
      timeout: timeoutMs,
    });
// Close connections
    await this.boss.stop({
      close: true,
      wait: true,
      graceful: true,
      timeout: timeoutMs,
    });

This does not work as stop checks if the this.stopped flag is set and short-circuits

2. Stopping via getDb().stop()

// Drain
    await this.boss.stop({
      close: false, // Keep DB connection open so we can still add jobs
      wait: true,
      graceful: true,
      timeout: timeoutMs,
    });
await this.boss.getDb().stop();

The issue here is that the Db interface should not expose stop() as it is an implementation detail. We could patch it on our side but seems shady.

3. Implement custom Db

Copy the pg-boss implementation of Db to our code and manage the pool manually

4. offWork() + stop()

Call offWork() with each queue name (step 1) and then stop() after event listeners drain (step 3)

New jobs won't be processed during shutdown, but does not wait for current jobs to drain. This means these jobs may depend on the event listener which is shutdown.

5. Custom worker tracking

We could track running jobs on our side and offWork/wait for them.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions