Post Mortem: auth change 9/2020

This is a deep dive into a recent issue that happened recently on Slipmat. I’ll cover what happened, why, and what’s done to prevent similar issues happening in the future.

What happened

During week 38/2019 Slipmat backend was upgraded to handle authentication in a more secure manner. (Spesifically, the session cookies were changed to http-only, the cookie domain was changed to point to .slipmat.io, and the new live page and some of the old APIs were decorated with CSRF protection.)

Some browsers did not automatically accept the new session cookies and were treated as authenticated (user saw most of the site working as before) in some APIs but unauthenticated in others. As an unintended result of this change the live page chat and some of the DJ functionalities were broken for some users.

I wrote a post titled Changes to Slipmat login / authentication to communicate about the change but soon found out that many users were unable to fix their sessions simply by refreshing or logging out and back in again.

The issue lead to about ten days long bug hunting and code fixing path in addition of dozens of users having really bad experiences trying to log in and get the event chat working.

Why

The technical change had to be done to allow more secure ways to authentication between the old APIs and upcoming new frontend services. We also have a token-based authentication for the experimental mobile client but using tokens in browser environment is inherently insecure. Slipmat has always been an industry leader both in technology and features so we decided to build the upcoming authentication layer in most secure way possible.

But why the problem actually happened is not because of any new technologies but because of the current challenging state of Slipmat as we are running both old and new code at the same time. It’s easy to see it in hindsight but it wasn’t that obvious during development and early testing.

What will be done to prevent this happening again

First, these kind of changes on a live site are extremely rare. Also if Slipmat didn’t have a big Beta badge stamped everywhere this change wouldn’t have happened at all. In beta it’s much easier to move fast and break things. That said, this issue taught a lot of things that have already been guiding the development of our next-gen code.

Errors should never pass silently

This is one of the core principles in the Zen of Python and our old frontend JavaScript code breaks this rule in many places because the old plumbing just doesn’t have an easy way to handle errors gracefully and in normal situations everything Just Works. Until it doesn’t.

While this is really hard to fix in the old code, I pushed back the launching of the new DJ Dashboard to completely rewrite some of the low-level error handling code so that these kind of issues never happen by accident. If we want, we can silence errors but it should never happen by default. So, for all new Slipmat UIs we have a proper error handling in place that shows the user a proper error message if anything important breaks.

(Some important places like the live event chat also have been updated to clearly show an error message to let the user know if something is wrong.)

Users and DJs should see the important notifications

About 60% of my own time handling this issue was spent posting the one and same link over and over again as users either did not see the Backstage message or understand that they should actually read and react to it. While this is again much bigger problem of (even beta) users never reading any instructions, there is a lot to improve regarding general communication.

Slipmat staff don’t currently have any means to put out any other general information to DJs or listeners other than posting here on Backstage or sending email. This will be fixed in the upcoming new UIs with a notification system that has both normal and non-dismissable (ie. “must read”) notifications. This will help a lot with everyday communications and also allow us to let moderators and mentors and other staff groups to post information to our users as well.

2 Likes

Cheers for the explanation. It’s fascinating to know how these things go.

1 Like