Supervisor

The supervisor is the orchestrator of your Ductwork infrastructure. As the parent process for each bin/ductwork instance, it launches and monitors all child processes, ensuring your pipelines continue running even when individual processes fail.

Overview

When you run bin/ductwork, you start a supervisor process that:

Launches one pipeline advancer process
Launches one job worker process for each configured pipeline
Monitors all child processes through heartbeat checks
Automatically restarts failed or hung processes
Coordinates graceful shutdown across all processes

The supervisor acts as the resilience layer, making Ductwork pipelines fault-tolerant without manual intervention.

Process Hierarchy

Each Ductwork instance creates the following process tree:

supervisor (bin/ductwork)
├── pipeline advancer
│   └── thread for Pipeline A
│   └── thread for Pipeline B
│   └── thread for Pipeline C
├── job worker (Pipeline A)
│   └── worker thread 1
│   └── worker thread 2
│   └── ...
├── job worker (Pipeline B)
│   └── worker thread 1
│   └── worker thread 2
│   └── ...
└── job worker (Pipeline C)
    └── worker thread 1
    └── worker thread 2
    └── ...

Responsibilities

Process Lifecycle Management

The supervisor manages the complete lifecycle of child processes:

Startup:

Read configuration from YAML file
Fork the pipeline advancer process
Fork one job worker process per configured pipeline
Register signal handlers for graceful shutdown
Enter monitoring loop

Monitoring:

Check heartbeats from each child process
Track process health and uptime
Detect crashes or hangs
Log process status changes

Recovery:

Automatically restart failed processes
Maintain pipeline availability during failures
Preserve pipeline state through process restarts

Shutdown:

Forward shutdown signals to all children
Wait for graceful shutdown with timeout
Terminate unresponsive processes
Clean up resources and exit

Heartbeat Monitoring

The supervisor continuously monitors child process health through periodic heartbeats. Each child process reports its status at regular intervals, confirming it’s alive and processing work.

Detection: If a child process fails to report a heartbeat within 5 minutes—indicating a crash, hang, or deadlock—the supervisor detects the failure.

Recovery: The supervisor immediately spawns a replacement process to restore full pipeline capacity. The new process picks up where the previous one left off, resuming work on pending jobs.

Why 5 minutes? This timeout balances quick failure detection with tolerance for legitimately slow operations. Steps should typically complete in seconds, but this buffer accounts for temporarily degraded performance without false positives.

Configuration

The supervisor’s behavior is controlled through config/ductwork.yml:

`pipelines`

Specifies which pipelines to run. The supervisor creates child processes based on this configuration.

default: &default
  pipelines:
    - EnrichUserDataPipeline
    - ProcessOrdersPipeline

Or use the wildcard to run all defined pipelines:

default: &default
  pipelines: "*"

Note: The supervisor creates one advancer and one job worker per pipeline listed here.

`supervisor.polling_timeout`

How long (in seconds) the supervisor sleeps between heartbeat checks.

Default: 1 second

default: &default
  supervisor:
    polling_timeout: 5

Tuning: Shorter intervals provide faster failure detection but increase CPU usage. Longer intervals reduce overhead but delay failure detection. The default (1 second) works well for most applications.

`supervisor.shutdown_timeout`

Maximum time (in seconds) to wait for child processes to shut down gracefully. After this timeout, remaining processes receive SIGKILL and terminate immediately.

Default: 30 seconds

default: &default
  supervisor:
    shutdown_timeout: 45

Important: This value should be larger than job_worker.shutdown_timeout to allow proper cascading. If the supervisor timeout is too short, workers won’t have time to finish their shutdown sequence.

Recommended values:

job_worker.shutdown_timeout: 20 seconds
supervisor.shutdown_timeout: 30 seconds (gives 10 seconds buffer)

Signal Handling

The supervisor responds to Unix signals for control and debugging:

TERM and INT - Graceful Shutdown

Triggers the graceful shutdown sequence:

Supervisor forwards signal to all child processes
Child processes begin their shutdown sequences
Supervisor waits up to supervisor.shutdown_timeout seconds
Processes still alive after timeout are killed with SIGKILL
Supervisor exits

# Send TERM signal
kill -TERM <supervisor_pid>

# Or INT signal (both behave identically)
kill -INT <supervisor_pid>

See Signal Handling for detailed shutdown behavior.

TTIN - Thread Backtrace Dump

Requests thread backtraces from all child processes for debugging hung or slow processes.

kill -TTIN <supervisor_pid>

The supervisor forwards this signal to all children, which dump their thread backtraces to the configured logger. This is invaluable for diagnosing performance issues or deadlocks in production.

See TTIN Signal Handling for details.

Lifecycle Hooks

Ductwork.on_supervisor_start do
  Rails.logger.info "Ductwork supervisor starting"
  # Initialize monitoring, notify deployment tracking, etc.
end

Ductwork.on_supervisor_stop do
  Rails.logger.info "Ductwork supervisor shutting down"
  # Flush metrics, notify monitoring systems, etc.
end

These hooks run once per supervisor lifecycle—at the very beginning of startup and the very end of shutdown. Use them for initialization, cleanup, or integration with external systems.

See Lifecycle Hooks for all available hooks.

Monitoring

Track supervisor health and behavior by monitoring:

Process Metrics

Supervisor uptime
Number of child process restarts
Child process spawn rate
Failed startup attempts

Resource Usage

Supervisor CPU and memory usage
Total memory across all child processes
Open file descriptors
Database connection count

Heartbeat Status

Time since last heartbeat from each child
Heartbeat check frequency
Missed heartbeat count

Shutdown Behavior

Time to complete graceful shutdown
Number of processes killed after timeout
Shutdown success rate

Running Multiple Supervisors

You can run multiple bin/ductwork instances to isolate pipelines or scale horizontally:

Isolate Critical Pipelines

# Critical pipelines with dedicated resources
bin/ductwork -c config/ductwork.critical.yml

# Background pipelines on separate instance
bin/ductwork -c config/ductwork.background.yml

production:
  pipelines:
    - ProcessPaymentsPipeline
    - SendNotificationsPipeline
  job_worker:
    worker_count: 20

# config/ductwork.background.yml
production:
  pipelines:
    - GenerateReportsPipeline
    - CleanupDataPipeline
  job_worker:
    worker_count: 5

Scale Across Machines

Run separate supervisors on different servers for horizontal scaling:

# Server 1 - Handle user-facing pipelines
bin/ductwork -c config/ductwork.user_facing.yml

# Server 2 - Handle batch processing pipelines
bin/ductwork -c config/ductwork.batch.yml

Benefits:

Fault isolation (one failing pipeline doesn’t affect others)
Resource allocation (dedicate CPU/memory to specific pipelines)
Independent scaling (scale critical pipelines without scaling everything)
Deployment flexibility (deploy changes to specific pipeline groups)

Considerations:

More operational complexity
Higher total resource usage (overhead per supervisor)
Need coordination for monitoring across instances

Process Management

Integrate Ductwork with your process manager:

systemd

[Unit]
Description=Ductwork Pipeline Supervisor
After=network.target postgresql.service

[Service]
Type=simple
User=deploy
WorkingDirectory=/var/www/myapp
ExecStart=/var/www/myapp/bin/ductwork -c config/ductwork.yml
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

Docker

# Dockerfile
CMD ["bin/ductwork", "-c", "config/ductwork.production.yml"]

Kubernetes

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ductwork
spec:
  replicas: 2
  template:
    spec:
      containers:
      - name: ductwork
        image: myapp:latest
        command: ["bin/ductwork"]
        args: ["-c", "config/ductwork.yml"]

The supervisor’s resilient design makes it suitable for containerized environments and orchestration platforms.