This topic is for announcing and the postmortems of unplanned service disruptions on Slipmat.
When: Today at 2020-04-22 01:27 EEST
What happened: Slipmat main site was offline for about 4 minutes. In addition, our monitoring services were online even though they were supposed to be paused for a maintenance window.
We had a planned maintenance window that needed to be extended due to slow compilation process of the new code. I managed to push broken code twice that ended up resulting approximately 4 minutes of downtime.
As an insult to the injury, our new monitoring services were supposed to be in maintenance window but due to user error the window that had been set up wasn’t actually activated for the specific monitors so false alerts were send to the team.
Both the code and the monitoring has now been fixed.