At Resend we strive to move fast and ship with quality. However, we also understand that failures are inevitable in complex systems.
What matters to us is how we respond to those failures. We treat every incident as an opportunity to learn and take ownership. Getting to the root cause helps us avoid repeat incidents and continuously raise the bar on product quality.
For every incident, we prepare an incident review (aka post mortem) document that outlines what happened, why it happened, and what we learned. It includes a timeline of events, impact assessment, contributing factors, resolution steps, and follow-up actions.
We use incident.io to standardize this process, which helps us capture key details during the incident and guides us through the post incident workflow.
Using the same structure across all incidents makes it easier to review patterns over time and spot recurring issues.
Once the incident has been documented, we run a dedicated post incident meeting. The goal is to reflect as a team on what happened, share context, and align on next steps.
We typically run the meeting within a few days after the incident is resolved, giving everyone time to digest what happened and contribute to the write-up. The meeting is open to any interested team member, even those outside of Engineering. This cross-team collaboration helps foster shared ownership and surfaces perspectives that might otherwise be missed.
We walk through the incident timeline, impact, contributing factors, and the steps that led to resolution. The post mortem doc guides the discussion, which works to identify systemic improvements, not assign blame.
Incidents are rarely caused by a single mistake. They’re usually the result of multiple factors coming together in ways we didn’t anticipate. That’s why we take a blameless approach.
Instead of asking "who caused this?", we ask "how did this happen?". The focus is on improving our systems, tooling and assumptions, not assigning fault to individuals. People should feel safe speaking up about what they saw, what they did, and what they think could be improved.
Keeping it blameless doesn’t mean ignoring mistakes. It means acknowledging that errors happen, and using them as a signal that something in the system needs to change.
Every post incident meeting ends with clearly defined follow-up actions. These tasks are actionable, assigned, and tracked to completion. If something feels unclear or unresolved, we dig deeper. The goal is to learn collectively and prevent similar incidents from happening again.
Follow-ups are based on the lessons learned during and post incident. They might include engineering fixes, monitoring improvements, documentation updates, or process changes. Each one is tied to a clear outcome and owned by someone on the team.
We create and assign each task directly in incident.io, which syncs with Linear. This workflow ensures nothing falls through the cracks.
We assign a priority level to every task and follow internal SLAs to keep them on track:
Incident follow-ups take priority over regular feature work. We track them actively in Linear, and if a task is at risk of slipping, we surface it early. The goal is not only to fix what failed, but also to strengthen the system for the future.