From Blocking to Background: Fixing Slow Approval Routes with a Queue
Business Impact
Improved async reliability for operational jobs and made user-level production issues easier to trace and debug.
The Context
Some internal operational flows did not belong inside the main request cycle.
The most visible example was an approval route that needed to send email after the action completed. When that email dispatch happened inline, the API response had to wait for mail processing before returning to the user. That made the route slower than it should have been and created unnecessary coupling between a critical user action and an unreliable external dependency.
Report generation had a similar shape. It was useful work, but not work that should delay a user-facing response.
At the same time, debugging production issues was harder than it should have been because logs were not structured around user activity in a way that made tracing errors practical.
The project needed two things:
- a more reliable async job model for background work
- better observability for debugging production incidents
The Challenge
The main issues were straightforward but important:
- Approval routes were slowed down by inline follow-up work
Email sending happened before the API response returned, so route latency depended on mail processing and external delivery behavior. - Operational failures needed clearer visibility
When users hit errors, it was too easy to end up with logs that showed the exception but not enough request or user context to debug efficiently. - Debugging needed to be faster in production
Support and engineering workflows improve a lot when logs can be filtered by the logged-in user, affected request path, or job execution context.
The Solution
I built the internal service around NestJS, then separated background execution from the main application flow and improved observability across the stack.
1. BullMQ for Email and Reporting Jobs
I used BullMQ to move background work into queues for:
- email delivery
- report generation
- retryable async tasks
The main improvement was simple: after an approval action succeeded, the API could return immediately while email dispatch continued in the background.
That made the route faster and removed unnecessary waiting from the user path.
This was also the point where a simple async function was not enough.
Plain async handling would remove some blocking, but it would still keep the task too close to the web process. That means failed work is easier to lose during restarts, retries need to be handled manually, and background execution is harder to inspect operationally.
BullMQ stood out here because it gave me the right operational behavior for this kind of work:
- retries for transient failures
- clear visibility into queued, active, failed, and completed jobs
- Redis-backed queues that were simple to run and reason about
- support for delayed or scheduled background work when needed
In other words:
- plain async helps with non-blocking execution
- BullMQ adds durability, retries, visibility, and control
The queue-based setup also made it easier to:
- retry failed jobs safely
- manage job execution separately from user requests
- keep reporting and notification workloads from degrading the main application experience
2. Centralized Logging with LGTM, Loki, and Alloy
For observability, I set up logging using:
- Alloy to collect and forward logs
- Loki for centralized log storage
- Grafana for querying and inspection
This created a cleaner path from application logs to production debugging.
3. User-Aware Log Context
The most useful part of the setup was attaching enough application context to logs so debugging could start from the user, not just from the stack trace.
That included logging context such as:
- logged-in user information
- request-level context
- service and job execution metadata
With that in place, it became much easier to trace:
- which user hit an issue
- what path or action triggered it
- which background job or service path was involved
That is a much better debugging model than searching raw logs with minimal context.
The Result
The final system improved both delivery reliability and operational visibility.
- Approval routes no longer waited on email delivery before responding
- Email and reporting jobs ran outside the main request path
- Async work became easier to retry and reason about
- Centralized logs made production debugging faster
- User-linked logging context made issue tracing more practical during investigation
The project reinforced a pattern I trust: background work should be queued, and observability should be designed around how engineers actually debug systems in production.