Background Jobs

Automation, Workers, and Event-Driven Workflows


Not everything needs to happen immediately. Sending a welcome email can wait a few seconds. Generating a report can happen in the background. Syncing data with an external system doesn't need to block the user's request.

Background jobs move work out of the request-response cycle into a separate processing queue. This keeps your application responsive while handling heavy or slow operations asynchronously. The implementation seems straightforward until you encounter your first production failure at 3am, or discover that your email provider rate-limited you and 10,000 welcome emails never sent.

Job Processing Engine
RUNNING
0
Jobs Pending

The Constraint: Why Synchronous Processing Fails

The standard web request lifecycle assumes operations complete quickly. PHP's default execution timeout is 30 seconds. Most load balancers timeout at 60 seconds. Users expect pages to load in under 3 seconds. When you try to do heavy work inside a request, everything breaks.

Consider a real scenario: a user uploads a CSV containing 50,000 product records for import. Processing each row requires validation, database lookups for duplicate detection, image downloads from external URLs, and writes to multiple tables. Even at 50ms per row, that's 41 minutes of processing. The request times out. The browser shows an error. The user has no idea if anything imported or not.

The synchronous trap: HTTP requests are designed for fast responses. Trying to do slow work inside them leads to timeouts, memory exhaustion, and users staring at spinners until their browser gives up.

The constraints multiply when you consider external dependencies. Sending email through an SMTP server can take 2-5 seconds per message. Calling a payment API might take 10 seconds during peak load. Generating a PDF with complex charts might consume 500MB of memory. None of this belongs in a web request. This is one of the key considerations when scaling without chaos.

Background job processing exists to handle work that is slow, unreliable, resource-intensive, or time-insensitive. The user gets an immediate acknowledgement ("Your import is processing"), and the actual work happens separately, with proper handling for failures, retries, and resource management.


The Naive Approach: dispatch() and Hope

The tutorial version of background jobs is deceptively simple. Laravel's documentation shows a clean example: create a job class, call dispatch(), done. The job runs in the background. It works perfectly in development with the sync driver.

Then production happens.

Jobs fail silently

A job throws an exception. It retries three times (the default). It fails permanently. Nobody notices for three days until a customer asks why their invoice never arrived. The failed_jobs table has 2,000 rows nobody has looked at.

Queues back up invisibly

A bulk email campaign dispatches 50,000 jobs. The single worker processes 10 per second. At that rate, the queue takes 83 minutes to drain. Meanwhile, password reset emails sit behind marketing emails. Users think the system is broken.

Duplicate operations on retry

A job sends an email, then calls an external API. The API call fails. The job retries. The email sends again. The customer receives two identical invoices. Or ten, if the API keeps failing.

Memory leaks crash workers

A job processes images, loading them into memory. After processing 500 jobs, the worker has consumed 2GB of RAM. It crashes. The supervisor restarts it. It crashes again. The queue backs up while workers cycle endlessly.

The naive approach treats job dispatch as "fire and forget". In reality, background jobs require the same (or more) attention to failure handling, observability, and operational concerns as your main application code. A job that fails in the background is often worse than code that fails in a request, because at least a request failure produces an error the user sees.


The Robust Pattern: Queues as Production Infrastructure

A production-grade job processing system treats the queue as critical infrastructure, not an afterthought. This means explicit queue topology, idempotent job design, comprehensive monitoring, and failure handling strategies that match business requirements.

Queue Topology

Not all jobs are equal. A password reset email must send within seconds. A weekly report can wait hours. A bulk data import can run overnight. Putting these on the same queue means urgent work waits behind bulk work.

Queue separation principle: Separate queues by latency requirements and resource consumption. High-priority, low-latency work should never share a queue with bulk processing.

A typical production configuration uses three to five queues.

Queue Purpose Worker Config
high Password resets, payment confirmations, security alerts 2+ workers, --tries=1, --timeout=30
default Welcome emails, notifications, standard webhooks 4+ workers, --tries=3, --timeout=60
low Analytics processing, report generation, data sync 2 workers, --tries=3, --timeout=300
bulk Mass emails, large imports, batch operations 1-2 workers, --tries=3, --timeout=900
long Video transcoding, large file processing, ML inference 1 worker, --tries=1, --timeout=3600

Workers are assigned to queues explicitly. High-priority workers process only the high queue. Default workers process high and default (high first). This ensures a bulk email campaign never delays a password reset.

Idempotent Job Design

Jobs run "at least once", not "exactly once". Network failures, worker crashes, and deployment restarts all cause jobs to run multiple times. If your job sends an email without checking whether it already sent, users receive duplicate messages.

Every job that performs a side effect must be idempotent: running it twice produces the same result as running it once.

Pattern: Idempotency keys

Generate a unique key for each operation. Before performing the operation, check if the key exists. If it does, skip. If not, perform the operation and record the key.

Store keys in Redis with TTL, or a dedicated database table. Key format: operation:entity:timestamp (e.g., invoice_email:order_123:2024-01-15).

Pattern: State checking

Before performing work, check if the work is still needed. Before sending an invoice email, check if the invoice status is still "pending_notification". If it's already "notified", skip.

This requires that jobs update state atomically after completing work. Use database transactions where possible.

Idempotency applies to the entire job, including all side effects. If a job sends an email, updates a database, and calls an external API, the retry must not re-send the email or re-call the API if those already succeeded.

Failure Handling Strategy

Jobs fail for different reasons, and different reasons require different responses.

Failure Type Example Response
Transient Network timeout, rate limit, temporary API outage Retry with exponential backoff
Permanent Invalid email address, deleted record, bad data Fail immediately, log for review
Resource Out of memory, disk full, connection pool exhausted Release job back to queue, alert operations
Dependency External service down, API returning 500s Delay retry, activate circuit breaker

Jobs should catch specific exceptions and respond appropriately. A validation exception should not retry. A connection exception should retry with backoff. An out-of-memory error should release the job and alert.

Dead letter queues: After maximum retries, failed jobs go to a dead letter queue (Laravel's failed_jobs table). This is not a black hole. Production systems need daily review of failed jobs, automated categorisation of failure reasons, and processes to retry or discard after investigation.


Common Job Patterns

Certain patterns appear across most applications. Understanding these patterns and their failure modes saves debugging time later.

Email Dispatch

The most common background job. Appears simple until you consider: SMTP server rate limits, bounce handling, duplicate prevention, and send tracking.

Implementation checklist
  • Idempotency: Track sent emails by recipient + email type + entity ID. Check before sending.
  • Rate limiting: Most SMTP providers limit sends per second. Use rate limiters on the job or queue level.
  • Bounce handling: Subscribe to bounce webhooks. Update email status. Prevent re-sending to bounced addresses.
  • Retry strategy: SMTP failures are usually transient. Retry 3-5 times with exponential backoff.
  • Payload: Store email ID or entity ID, not rendered content. Render at send time for fresh data.

Data Import

CSV or Excel imports with thousands of rows. The naive approach processes all rows in one job. When the job fails at row 5,000, you start over.

Implementation checklist
  • Chunking: Parent job reads file, dispatches child jobs for chunks of 100-500 rows.
  • Progress tracking: Store import status in database. Update as chunks complete. Show progress to user.
  • Error collection: Child jobs report errors back to parent. Collect all errors, don't fail on first.
  • Rollback strategy: Define what happens if 10% of rows fail. Rollback all? Keep good rows? User choice?
  • File storage: Store uploaded file in permanent storage (S3) before processing. Jobs reference storage path, not local temp file.

Report Generation

Building PDFs or Excel files from large datasets. Memory-intensive, slow, and prone to timeouts.

Implementation checklist
  • Streaming: Use streaming writes (Laravel Excel's FromQuery with chunk). Never load 100,000 rows into memory.
  • Temporary storage: Write to temporary file, then move to permanent storage. User downloads from storage URL.
  • Notification: Email user when report is ready. Include download link with expiry.
  • Timeout: Set generous timeout (15-30 minutes for large reports). Monitor duration.
  • Caching: If the same report is requested multiple times, serve cached version. Invalidate on data change.

External API Sync

Pushing data to or pulling from external systems. CRM updates, payment gateway calls, inventory sync. The external system is outside your control. See our API integrations guide for handling unreliable external services.

Implementation checklist
  • Circuit breaker: If the external API returns 5 consecutive 500 errors, stop trying. Alert operations. Resume after cooldown.
  • Rate limiting: Respect API rate limits. Use Redis-based rate limiter. Queue jobs when limit exceeded.
  • Retry with backoff: 1 second, 5 seconds, 30 seconds, 2 minutes, 10 minutes. Stop after 5 attempts.
  • Payload freshness: Store entity ID, not data snapshot. Fetch current data at execution time. Avoid sync conflicts.
  • Conflict resolution: Define what happens when external system has newer data. Last write wins? Merge? Alert?

Webhook Processing

Receiving webhooks from external services (Stripe, payment providers, shipping services). The external service expects a fast 200 response. Processing might take minutes.

Implementation checklist
  • Immediate ack: Controller stores raw webhook payload to database, dispatches job, returns 200. Total time: under 500ms.
  • Signature verification: Verify webhook signature before storing. Reject invalid webhooks at controller level.
  • Deduplication: External services retry webhooks. Store webhook ID. Skip processing if already processed.
  • Ordering: Webhooks can arrive out of order. Check event timestamps. Handle "order cancelled" arriving before "order created".
  • Replay capability: Store raw payload permanently. Enable manual replay if processing logic changes.

Job Prioritisation and Scheduling

Beyond queue separation, individual jobs sometimes need priority within a queue, and many applications need scheduled jobs that run at specific times.

Priority Within Queues

Redis-backed queues (like Laravel Horizon with Redis) support job priorities. A priority 1 job runs before a priority 10 job, even if the priority 10 job was dispatched first.

Priority use cases: Premium customer emails before free tier. Paid account sync before trial accounts. Single-item orders before bulk orders. The business defines priority; the queue respects it.

Priority within a queue is not a substitute for queue separation. Use separate queues for fundamentally different workloads (transactional email vs bulk email). Use priority for ordering within similar workloads (premium customers first).

Scheduled Jobs

Many applications need jobs that run at specific times: daily reports at 6am, weekly summaries on Monday, monthly billing on the 1st, hourly cache warming.

Laravel's scheduler handles this through a single cron entry that runs every minute. The scheduler checks which jobs are due and dispatches them.

Overlap prevention

A daily report takes 45 minutes. If the scheduler runs at 6am and the job isn't finished by 7am, should a second instance start? Usually not. Use withoutOverlapping() to prevent concurrent runs of the same scheduled job.

Timezone handling

A daily summary should send at 9am in the user's timezone, not the server's. Store user timezones. Schedule per-timezone or use delayed dispatch to send at the right local time.

Run verification

Did the scheduled job actually run? Log scheduled job execution. Alert if expected jobs don't run. Monitor last successful run timestamp. A scheduled job that silently stops running is worse than one that fails loudly.

Catch-up behaviour

If the server was down during scheduled run time, should the job run immediately when the server comes back? Sometimes yes (missed billing run). Sometimes no (sending "good morning" at 3pm is worse than not sending).

Delayed Dispatch

Jobs can be dispatched with a delay: "send this email in 30 minutes", "process this refund in 24 hours", "check if user completed onboarding in 7 days".

Delayed jobs remain in the queue until their dispatch time. This has implications for queue size and for deployments. If you deploy code changes, jobs dispatched before the deployment run with the old code until they execute.


Monitoring and Observability

A queue without monitoring is a black box. Jobs fail, queues back up, workers crash, and nobody knows until users complain. Production queue systems need comprehensive observability.

Queue Health Metrics

The essential metrics for any queue system.

Metric What It Measures Alert Threshold (Example)
Queue depth Jobs waiting to be processed High queue > 10, Default queue > 100
Wait time Time from dispatch to processing start High queue > 30s, Default > 5min
Processing time Time from start to completion Job-specific, flag outliers
Throughput Jobs processed per minute Below baseline for time of day
Failure rate Percentage of jobs failing > 1% on any queue
Worker count Active workers per queue Below expected for deployment

Laravel Horizon provides these metrics out of the box for Redis queues. For database queues, you'll need custom instrumentation or tools like Laravel Telescope in production mode.

Job-Level Observability

Beyond queue metrics, individual jobs need tracing.

What to log for each job
  • Job ID: Unique identifier for this job instance
  • Job class: Which job type is running
  • Queue: Which queue it ran on
  • Payload summary: Key identifiers (user ID, order ID) without sensitive data
  • Dispatch time: When the job was created
  • Start time: When processing began
  • End time: When processing completed (success or failure)
  • Outcome: Success, failure, or retry
  • Attempt number: Which retry is this?
  • Error message: If failed, the exception message and stack trace

This data enables debugging ("why did this user's invoice not send?") and analysis ("which job types fail most often?"). Store it in your logging system (ELK, Datadog, CloudWatch) with appropriate retention.

Failed Job Management

Failed jobs need a process, not just a table.

Daily review

Someone checks the failed_jobs table daily. Categorise failures: bad data (fix data, retry), bug (fix code, retry), external system (wait, retry), permanent (discard). Track failure categories over time.

Automated triage

Parse exception types. Auto-retry transient failures after a delay. Auto-discard known-permanent failures. Alert on new exception types that need investigation.

Alerting Strategy

Not every queue metric needs an alert. Alert on conditions that require human intervention.

Alert: High-priority queue depth exceeds threshold for more than 2 minutes. Password resets are delayed.
Alert: Worker count drops to zero on any production queue. Jobs are not processing.
Alert: Failure rate exceeds 5% over 15 minutes. Something is systematically broken.
Don't alert: Individual job failure. Log it, review daily, but don't page someone at 2am.
Don't alert: Bulk queue depth during expected bulk operations. Bulk jobs are supposed to queue up.

Scaling Workers and Performance

When queue depth grows faster than workers can process, you need more capacity. But scaling workers isn't always the right answer.

Horizontal vs Vertical Scaling

Adding more workers (horizontal) helps when you're CPU-bound or I/O-bound on parallelisable work. But some bottlenecks don't yield to more workers.

More workers help when
  • Jobs are I/O bound (waiting for APIs, email servers)
  • Jobs are CPU bound and workers are at capacity
  • Work is parallelisable (independent jobs)
  • External rate limits haven't been reached
More workers don't help when
  • Jobs are hitting database locks (serialisation required)
  • External API rate limits are the bottleneck
  • Memory is the constraint (each worker needs RAM)
  • Jobs need ordering (can't parallelise)

Worker Configuration

Worker configuration affects throughput, resource usage, and reliability.

Key worker settings
  • --sleep: How long to wait when the queue is empty. Lower values (1s) mean faster pickup. Higher values (5s) reduce Redis/database load.
  • --timeout: Maximum time a job can run. Set per-queue based on expected job duration. Workers are killed after timeout.
  • --tries: Maximum retry attempts before moving to failed_jobs. Set based on job type and failure characteristics.
  • --memory: Maximum memory before worker restarts. Prevents memory leaks from accumulating. 128-256MB is typical.
  • --max-jobs: Restart worker after processing N jobs. Cleans up any accumulated state. 500-1000 is typical.
  • --max-time: Restart worker after N seconds. Ensures workers pick up code changes. 3600 (1 hour) is typical.

Database Connection Management

Long-running workers hold database connections. With 20 workers, that's 20 persistent connections. If each job also opens connections for queries, connection pools exhaust quickly.

Connection hygiene: Workers should release connections after each job. In Laravel, use --max-jobs to restart workers periodically. For persistent connections, consider connection pooling (PgBouncer for PostgreSQL) between workers and the database.

Memory Management

PHP's memory model means memory allocated during job processing isn't always released. Image processing, PDF generation, and large data operations can accumulate memory across jobs.

The standard mitigation: restart workers after processing a fixed number of jobs or after a memory threshold. This clears accumulated memory without losing work (the current job completes before restart).


Event-Driven Workflows

Background jobs often form parts of larger workflows. A single business event triggers multiple jobs, some running in parallel, some in sequence, some conditionally.

Event-Based Dispatch

Instead of dispatching jobs directly from controllers, dispatch events. Listeners attached to events dispatch their jobs. This decouples the event source from the actions it triggers.

Example: Order placed

Controller dispatches OrderPlaced event. Listeners respond independently:

  • SendOrderConfirmation: Dispatches email job to high queue
  • ReserveInventory: Dispatches inventory job to default queue
  • NotifyWarehouse: Dispatches notification job to default queue
  • UpdateAnalytics: Dispatches analytics job to low queue
  • TriggerWebhooks: Dispatches webhook jobs to bulk queue

Adding a new action (e.g., notify Slack) means adding a new listener. No controller changes required.

Job Chains

Some workflows require sequential execution. Process file, then validate contents, then import records, then send confirmation. Each step depends on the previous.

Job chaining handles this natively in Laravel. The chain only continues if each job succeeds. If any job fails, the chain stops (with configurable behaviour for partial completion).

Job Batches

Some workflows require parallel execution with aggregation. Import 1,000 products, then send a summary email when all complete. Process 50 images, then mark the gallery as ready.

Job batches track completion of multiple jobs dispatched together. Define callbacks for: all succeeded, any failed, all completed (regardless of outcome). The batch tracks progress and provides completion percentage.

Saga Pattern

Complex workflows that span multiple services need coordination and compensation. If step 4 fails, steps 1-3 might need to be undone. For more complex orchestration patterns, see our workflow engines page.

When to use sagas: Multi-service transactions (reserve inventory, charge payment, allocate shipping), long-running processes with human steps, workflows where partial completion is worse than no completion. For simpler workflows, job chains are sufficient.


Deployment and Operations

Deploying applications with background workers requires coordination. Workers process jobs with old code while the deployment installs new code.

Zero-Downtime Deployment

The standard approach: signal workers to terminate gracefully after finishing their current job, deploy new code, start new workers.

1

Send SIGTERM to worker processes. Workers finish current job, then exit.


2

Wait for all workers to terminate. Timeout after reasonable period (job timeout + buffer).


3

Deploy new code. Clear caches. Run migrations if needed.


4

Start new workers. Supervisor ensures correct number restart.

For Laravel Horizon, the horizon:terminate command handles graceful shutdown. For raw workers, supervisorctl stop sends SIGTERM.

Job Serialisation Across Deployments

A job dispatched before deployment runs after deployment. If the job class changed (different constructor signature, renamed class, moved namespace), deserialization fails.

Safe: Adding new optional constructor parameters with defaults.
Safe: Changing job logic (the payload deserializes, execution differs).
Unsafe: Removing constructor parameters that are in serialized payloads.
Unsafe: Changing class namespace without aliasing.
Unsafe: Renaming job classes without migration strategy.

For breaking changes, drain the queue before deployment (let existing jobs finish, stop dispatching new ones), or version your job classes.


Infrastructure Choices

The choice of queue backend affects performance, reliability, and operational complexity.

Queue Backends

Backend Best For Trade-offs
Redis Most production applications. Fast, supports priorities, integrates with Horizon. Requires Redis infrastructure. Memory-based (data loss risk without persistence).
Database Simpler deployments. No additional infrastructure. Works with existing database backups. Slower polling. Table locks under high load. No priority support.
Amazon SQS AWS-native applications. Managed service. High availability. No priority support. 256KB payload limit. Visible delay on delivery.
Beanstalkd Simple queue requirements. Low overhead. Less ecosystem tooling. Fewer operational features.

For most Laravel applications, Redis with Horizon provides the best balance of features, performance, and observability. Database queues are acceptable for low-volume applications or simpler infrastructure requirements. SQS makes sense when the rest of the stack is AWS-native.


Common Pitfalls

These issues appear repeatedly across projects. Knowing them in advance saves debugging time.

Too much in the payload

Storing entire Eloquent models or large objects in job payloads. Leads to serialization issues, stale data (the model changed after dispatch), and excessive queue memory. Store IDs; fetch fresh data at execution time.

No idempotency

Jobs that assume exactly-once execution. Leads to duplicate emails, double charges, repeated API calls on retry. Every job that performs side effects must be idempotent.

Blocking queues

One slow job type blocking all other jobs. A 30-minute report generation blocks password reset emails. Use separate queues for different latency requirements.

No monitoring

Queues backing up without alerting. Problems discovered only when users complain. Production queues need dashboards, metrics, and alerts.

Silent failures

Jobs that catch all exceptions and return success. The job "succeeds" but the work wasn't done. Only catch specific exceptions you can handle. Let others bubble up.

Ignoring failed_jobs

The failed_jobs table grows to 50,000 rows. Nobody looks at it. Permanent failures, bugs, and transient issues all mixed together. Failed jobs need a daily review process.


The Business Link

Background job systems require upfront investment: queue infrastructure, monitoring, failure handling, operational processes. The investment pays off in several ways.

  • User experience that doesn't degrade under load Pages respond instantly. Heavy operations happen invisibly. Users don't wait for email servers or report generation.
  • Reliability through retry and recovery Transient failures don't become user-visible errors. External API outages don't lose data. Work gets done eventually, automatically.
  • Scalability without architectural changes Need more email sending capacity? Add workers. Peak load causing delays? Scale workers horizontally. The architecture doesn't change.
  • Visibility into what the system is doing Queue dashboards show work in progress. Metrics show throughput and latency. Alerts notify before users complain.
  • Zero-downtime deployments for background work Deploy new code without losing jobs. Workers restart gracefully. Queued work survives infrastructure changes.

The alternative is worse: slow pages that timeout, lost emails that nobody knows about, errors that require manual intervention, and scaling that requires rewriting the application. Doing job processing properly upfront avoids these costs later.


Further Reading

  • Laravel Horizon - Official queue dashboard and monitoring for Redis-backed queues.
  • Laravel Queues - Core documentation for job dispatch, processing, and failure handling.
  • Redis Persistence - Understanding queue durability with RDB and AOF persistence options.

Build Your Job Processing System

We implement background job processing that handles work asynchronously with proper failure handling, monitoring, and operational processes. Emails, file processing, data syncing, heavy calculations: moved out of the request cycle. Reliable, monitorable, scalable.

Let's talk about background processing →
Graphic Swish