May 13 post-incident report
On May 13 we received the cronjob message: cannot delete snapshot and cannot snapshot read only file system. We discussed the issue in our Daily and assumed a problem with the Filesystem Btrfs. After reading the error logs and some initial research on the topics we decided to restart the server. We could not mount the data storage.
We realized it was time to apply our Disaster Recovery Plan, something we have never had to do in our 6 years of non-stop operation. We ramped up a couple of instances for different recovery approaches. This was quickly done thanks to terraform. We tried both Btrfs recover and Btrfs repair.
In the Btrfs Docu it says: “Only use Repair command in case a community member has said this is the only or the best option”. On Thurs May 14 we reached out to the community. After some back and forth we got the critical clue: the Btrfs ran out of storage for the metadata. We had plenty of storage on disk, but there wasn’t enough storage allocated for writing metadata and hence we could neither mount nor properly recover the filesystem. We increased metadata storage size and that solved the problem. Everything was restored and working as expected by Fri. May 15. This incident was the first one that affected the daily backup. In hindsight, we’re happy with our choice of Filesystem Btrfs. There are many ways to recover data.
The solution to our problem came from the community. The most important lesson for us was to reach out for help instantly to get a better understanding of the problem before wasting a lot of time trying to research and recover by ourselves. We now have a list of expert contacts we can reach out to during an incident. Beyond the lesson about communication, we gained new ideas on how to make our setup even more resilient, and where to improve our alerting.