The Context

Some internal operational flows did not belong inside the main request cycle.

The most visible example was an approval route that needed to send email after the action completed. When that email dispatch happened inline, the API response had to wait for mail processing before returning to the user. That made the route slower than it should have been and created unnecessary coupling between a critical user action and an unreliable external dependency.

Report generation had a similar shape. It was useful work, but not work that should delay a user-facing response.

At the same time, debugging production issues was harder than it should have been because logs were not structured around user activity in a way that made tracing errors practical.

The project needed two things:

a more reliable async job model for background work
better observability for debugging production incidents

The Challenge

The main issues were straightforward but important:

Approval routes were slowed down by inline follow-up work
Email sending happened before the API response returned, so route latency depended on mail processing and external delivery behavior.
Operational failures needed clearer visibility
When users hit errors, it was too easy to end up with logs that showed the exception but not enough request or user context to debug efficiently.
Debugging needed to be faster in production
Support and engineering workflows improve a lot when logs can be filtered by the logged-in user, affected request path, or job execution context.

The Solution

I built the internal service around NestJS, then separated background execution from the main application flow and improved observability across the stack.

1. BullMQ for Email and Reporting Jobs

I used BullMQ to move background work into queues for:

email delivery
report generation
retryable async tasks

The main improvement was simple: after an approval action succeeded, the API could return immediately while email dispatch continued in the background.

That made the route faster and removed unnecessary waiting from the user path.

This was also the point where a simple async function was not enough.

Plain async handling would remove some blocking, but it would still keep the task too close to the web process. That means failed work is easier to lose during restarts, retries need to be handled manually, and background execution is harder to inspect operationally.

BullMQ stood out here because it gave me the right operational behavior for this kind of work:

retries for transient failures
clear visibility into queued, active, failed, and completed jobs
Redis-backed queues that were simple to run and reason about
support for delayed or scheduled background work when needed

In other words:

plain async helps with non-blocking execution
BullMQ adds durability, retries, visibility, and control

The queue-based setup also made it easier to:

retry failed jobs safely
manage job execution separately from user requests
keep reporting and notification workloads from degrading the main application experience

2. Centralized Logging with LGTM, Loki, and Alloy

For observability, I set up logging using:

Alloy to collect and forward logs
Loki for centralized log storage
Grafana for querying and inspection

This created a cleaner path from application logs to production debugging.

3. User-Aware Log Context

The most useful part of the setup was attaching enough application context to logs so debugging could start from the user, not just from the stack trace.

That included logging context such as:

logged-in user information
request-level context
service and job execution metadata

With that in place, it became much easier to trace:

which user hit an issue
what path or action triggered it
which background job or service path was involved

That is a much better debugging model than searching raw logs with minimal context.

The Result

The final system improved both delivery reliability and operational visibility.

Approval routes no longer waited on email delivery before responding
Email and reporting jobs ran outside the main request path
Async work became easier to retry and reason about
Centralized logs made production debugging faster
User-linked logging context made issue tracing more practical during investigation

The project reinforced a pattern I trust: background work should be queued, and observability should be designed around how engineers actually debug systems in production.

From Blocking to Background: Fixing Slow Approval Routes with a Queue

Business Impact