Hello all, we’ve been worried since day 1 about the potential for long term bit rot on our cluster. This has driven a decision we’re making. Starting May 1st we will be doing weekly data scrubs of all of our CEPH pools. We’ve also implemented blue store and will be adding extra RAM to each cluster node over the month to accommodate the work.

BlueStore has bitrot protection. It stores checksums for every block and validates them on reads. If they’re bad, it throws errors rather than returning known-bad data; that triggers the higher-level RADOS recovery mechanisms.

We’ll provide updates as more info is available.

