A data loss issue on Monday, 2019-06-03

We have lost some data from the server, probably because of an issue with our service provider. Their help hasn’t been so far very helpful with the issue but there is some corrupted data between 11 AM and 21 PM local server time (GMT+2). Unfortunately this time was between our daily backups (morning backups had already run, evening backups hadn’t been run yet) so the missing data is probably lost.

The server logs haven’t been helpful and application logs from the site haven’t offered any insights either. We’ll keep investigating the issue and will also make a plan for more a robust backup solution to prevent anything like this happening in the future. (We’ve been running the site 3 years now and this is the first time we’ve had any data loss, and only a couple of months after migrating to a new, supposedly better server architecture.)

If you encounter any missing data on the site from the past 24 hours, please report it.

Okay, time for a full incident analysis (as far as we can do one).

What happened and why

On Monday, 3rd of June 2019, at approximately 11 am Finnish time (GMT+3) our service provider had an issue with a server that runs the database of Slipmat.io. The downtime was minimal but at approximately 21pm the server disk was apparently swapped with an old copy, which effectively deleted all Slipmat.io database data between 11 am and 21 pm. (Note: our service provider wasn’t able to confirm this but the facts speak loudly themselves, this is what we were able to figure out based on the evidence.)

It’s difficult to know how much data was lost as almost all of of our data is engineered for security reasons not to use for example automatically incremented integers as IDs. But based on the few low-level admin logs we do have, we lost about 4 live events, one user profile and all their related data (like statistics).

I noticed the problem from admin logs that pointed to events that weren’t in the database. Unfortunately the database backups were not useful for two reasons:

  1. the backups were stored on the same server that runs the database. This was due to unfinished configuration work left mistakenly undone after our recent server move.

  2. our backup cycle was 24 hours and this incident happened between backups.

Lessons learned

After the incident we immediately wrote new backup scripts and a system to monitor the process. Our new backup system saves full snapshots every 6 hours and they are rotated daily keeping full 7 days of backups ready. Old backups are automatically deleted so we keep minimal necessary user data on the disk at any given time.

The new system also allows us to further enable deleting private user data from the backups. (Full disclosure: this part of the system isn’t written yet.) We also have now systems in place to monitor the status of the backups which can further be automated to notify the staff if for some reason the latest backup is older than 6 hours. (Again, this is not implemented yet, but we can easily add it when we decide to.)

Slipmat has run over 3 years without any data loss so this incident is very unfortunate. We were also unhappy about the way our new service provider handled the issue (we did not get any notification nor did we get a satisfactory explanation from support when trying to solve the issue) but it’s very unlikely anything like this could happen again as Slipmat runs on multiple servers and the backups are now stored on a separate server than the database itself.

We still have more work on building the server infrastructure and making the site more robust but for now we are already at a much better place than where we started.

Lastly, I want to personally apologize from any user who lost data because of this issue. We are still in private beta but that is not an excuse to lose your data. We have learned from this and hopefully we don’t have to write any more incident reports like this.