Hey everyone,
I’ve been thinking a lot about webhook delivery reliability lately. In many projects I’ve worked on, building robust webhook infra turned out to be deceptively complex:
- Retry logic (exponential backoff, timeouts) - Handling non-2xx responses - Delivery monitoring and alerting - Back-pressure or queueing to avoid overwhelming receivers - Secure signing and validation flows
In one project, a failed webhook caused a payment processing delay for hours because the retry logic was buggy. Another time, burst traffic took down the receiver endpoint with no DLQ strategy in place.
I’ve been researching different approaches teams here use:
Do you build your own custom webhook delivery queue and monitoring system? Use cloud solutions like AWS EventBridge or Step Functions to orchestrate? Or integrate third-party tools that handle delivery, retries, and observability for you?
I’m curious about how you ensure production-grade reliability at scale without burning dev hours on plumbing. Recently, I’ve been working on a tool in this space to handle these issues automatically, but would love to hear:
- What architecture have you found most reliable? - What are the edge cases you’ve encountered (e.g. signature mismatches, downstream outages)? - Any horror stories or lessons learned from webhook failures in production?
Looking forward to learning from your experiences and best practices around webhook infra!