Published June 4, 2026 in Tech

Building observability into Notion’s dead-letter queue

By Maya Lekhi

Engineering, Notion

There’s a background task behind almost every action in Notion. Sending a notification when someone @-mentions you, indexing new pages so they show up in search, generating embeddings for AI, syncing a calendar event, exporting a workspace—all of these run through Notion’s job queue.

The queue retries tasks that fail, but some tasks exhaust their retries and end up in a dead-letter queue (DLQ): a holding area for failed work that needs investigation before it can be recovered. A dead task often represents something a user is waiting on, so being able to inspect and replay these tasks quickly is directly tied to the reliability of the product.

For a long time, the only way to interact with the dead-letter queue (DLQ) was through the AWS CLI with direct S3 access. Debugging a failure meant having the right credentials, knowing which S3 bucket to look in, and constructing queries by hand. In practice, many failures went without thorough investigation due to this painful process. Beyond the friction, broad direct S3 access is something we’ve been actively working to restrict.

To manage this, we built the DLQ Explorer. The DLQ Explorer is part of an effort to give engineers a controlled interface to data they need without requiring raw bucket access. This post covers the infrastructure behind it and the design decisions that shaped it. At a high level, the goal was to make dead tasks as easy to inspect and recover as any other production surface, without introducing new data pipelines or operational overhead.

Handling failed tasks

A diagram demonstrating the task lifecycle, from enqueued through completed or dead task

Notion’s job queue distributes work across the infrastructure in a way that prevents any single workspace or team from monopolizing workers. When a task fails all its retries, it gets written to a per-cell S3 bucket as a DeadTask record. Each record contains everything you’d need to understand what happened:

The event name and task group
The workspace ID and actor
The full task payload as JSON
The target cell
The failure category
The specific error, including stack trace
When the task was enqueued, how many times it was attempted, and when it landed in the DLQ

Building the data layer

The first decision was where to put the query infrastructure. Dead-task records are written to S3 naturally as part of the queue’s existing failure handling. What we needed was a way to run structured queries over that data without moving it or replicating it.

We used Athena with partition projection over the existing S3 buckets. Each cell gets its own external table (dead_tasks_{cellId}) in a per-region Glue database, with partition projection on date and hour columns. This means queries can prune partitions before scanning data, which keeps costs predictable even as the volume of dead tasks grows. This avoided introducing a secondary indexing system or ETL pipeline, which would have added both latency and operational complexity.

The design also had to account for IAM and KMS setup and constraints. Athena needs permissions to read from the source S3 buckets, write query results to a dedicated results bucket, and access the Glue catalog. Those permissions also had to be wired correctly across every environment—staging and production—and across every region we operate in. We provisioned a dedicated Athena workgroup and results bucket per region rather than sharing infrastructure with other Athena users, which makes it easier to enforce workgroup-level query policies.

In practice, most of the complexity was not in defining permissions, but in ensuring they behaved consistently across regions and environments without introducing gaps in access or over-permissioning.

A diagram showing the relationship between tasks in S, tasks in a Glue catalog, and Athena

The query layer

With the data layer in place, the DLQ Explorer lets engineers query dead tasks across cells, environments, and time ranges without touching the CLI.

The search form takes a cell, environment, and date range as required inputs, with optional filters for event name (prefix match), workspace ID (exact match), and free text across the reason, payload, and error fields. Under the hood, the handler resolves the AWS account from the environment and the AWS region from the cell ID, fetches a cached Athena client for that account and region pair, and constructs a SQL query with partition pruning applied before any filter predicates. Partition pruning is applied before any additional filters so that queries remain bounded even when scanning across large time ranges or multiple cells.

The interface is designed to support common debugging workflows: querying across multiple cells, quickly identifying failure patterns through aggregation, and returning exact row counts so engineers can tell whether they are looking at the full result set. This was made to mirror how engineers typically debug incidents: start broad, identify patterns, and then narrow down to specific failures.

The recovery layer

A diagram depicting a re-enqueue workflow, from selecting tasks in a results table to the resulting re-enqueue

Querying dead tasks is only half the workflow. The other half is recovering them safely. In practice, most investigations end with some form of replay, which makes recovery just as important as inspection.

Re-enqueue is a first-class workflow in the Explorer. You select tasks from the results table, click Re-enqueue, and a confirmation modal requires a justification reason before the operation proceeds. The handler calls the existing retryDeadTasks() helper, which streams the selected tasks from S3 and sends them in batches to the cell’s overflow SQS queue. The response returns success and failure counts along with the full context of the operation: who triggered it, which cell, which event types, and the reason provided.

Re-enqueue history is tracked and surfaced in the Explorer. Every re-enqueue operation is recorded, so you can see whether a task has been retried, when, and by whom. Whether a task getting re-enqueued keeps failing matters for debugging, and the process also provides a lightweight audit trail, which is important when replaying tasks that may have side effects.

What this changes

Before the DLQ Explorer, investigating a spike in dead tasks required AWS credentials, CLI access, and significant context about the bucket structure. There was no built-in way to query and filter results, and the re-enqueue path had no guardrails.

Now an engineer can use the explorer to filter dead tasks by space, event name, or error message across any cell and environment, understand failures through the summary view, and re-enqueue with a full audit trail—all without leaving the browser.

The DLQ Explorer turned a slow, manual workflow into a fast, repeatable one. Investigations that once took around 20 minutes can now take under a minute, and the tool is already seeing steady adoption across teams.

Share this post

Building observability into Notion’s dead-letter queue

Handling failed tasks

Building the data layer

The query layer

The recovery layer

What this changes

Try it now

Get going on web or desktop

We also have Mac & Windows apps to match.

We also have iOS & Android apps to match.

Web app

Desktop app