Best Practices for Happy Webhooks

I love webhooks. I love automating things and minimizing the shit work I and my users have to do, and webhooks have been an invaluable solution to a real problem.

I will also say right up front that I haven’t implemented outbound webhooks all the way into production. I’ve started to a few times and never released it. But I’ve received more than a few webhooks, and these are the things that, as a receiver, make me happy.

HMAC-based Authentication

I need to know that I can trust the source of the webhook. IP addresses can be spoofed, and DNS can be compromised, but fortunately we have cryptographic tools! And even better, HMAC is a very simple tool to use.

Mandrill sends an HMAC header, as does GitHub. GitHub’s is slightly easier to calculate.

Stripe has an annoying answer: they send the full event data and then suggest you pull it back via their API. Given that actually doing that is optional, I’d rather they either only sent the ID, making it non-optional, or included an HMAC.

Checking the HMAC is also optional, of course, but it’s optional and faster.

MailChimp has basically no support for securing webhook data. You can’t even request the event data from the API to verify it. Their solution: security by obscurity. (Given that it’s the same folks as Mandrill, this really surprises and disappoints me.)

Unique Event IDs

Every webhook that gets fired should contain a unique ID for that event. Even with HMAC-based auth, an attacker can still record the full request and replay it against application. This is fairly benign in some cases (e.g. a MailChimp unsubscribe event) but can be extremely dangerous when using something like Stripe.

Fortunately, Stripe events all have a unique ID, and they do recommend logging which events you’ve processed. GitHub has X-Github-Delivery but I’m not sure if it’s specific to the event or the attempt to deliver it. (I’ll update this post if they answer me.)

You absolutely cannot rely on the timestamp. At any meaningful scale, it’s entirely possible for two events to have the same timestamp, especially if it’s only second-resolution. A Mandrill click event, for example, could have the same set of (event type, timestamp, message-ID) if a user clicked twice on a link, and that doesn’t even require a large set of event sources.

Generate a UUID or however else your system creates unique IDs. It’s worth it.

One Event at a Time

For origin servers, it can be tempting to batch several events (ala Mandrill) to reduce their own load, but this just creates complexity for receivers. Handling one webhook per request makes keeping track of replay (see above) and retry (see below) much simpler and thus more reliable.

Retry with Back-off

This is probably the most basic of all webhook requirements: if the destination server isn’t available, the origin should retry it at increasing intervals. How long the retries should continue is up to what’s reasonable for that use case. Stripe needs to retry harder than Mailchimp, for example.

A good solution is “any 2xx status code means it was handled, anything else means retry”. That means receivers can opt to drop certain events on the floor when they’re working properly.

Data Format

Don’t bother wrapping JSON-encoded data in multipart/form-data. Stick to one or the other.

Yeah, this is a small complaint, but it’s just more pleasant to treat the whole request body the same way.