We’ve decided to move from RAID 10 to RAID 6 with all new deployments. Why? More safety and more space!
With our current setup we could loose any one drive on a CEPH storage node without that node being impacted. However there is a potential corner case where loosing 2 drives within 24 hours could cause the storage node to fail. The node’s activity would be picked up by another CEPH replica but this is still a situation we’d like to avoid.
With our new RAID6 setup we can loose any two drives and the node will continue to function.
What’s the downside?
In the RAID 10 config we could loose a controller card since each side of the RAID was separated onto it’s own card and still function. With RAID6 the loss of the controller card would halt the node.
Why move the risk from the drives to the card?
Two reasons, first we think a controller failure is less likely than a drive failure. Secondly a failed controller can be replaced without any data loss on the node while a simultaneous (Within about 24 hours) failure of two disks in a RAID10 could result in data loss.
Wait a minute are you just trying to get more data on the array and possibly impacting performance?
Yes on the frst part no on the second. Yes we’d like to increase our storage density. As to performance we are screaming > 3GBIT from most nodes now and only leveraging ~30% of the available drive IO capacity in our RAID6 tests. When you pack 48 spindles together it gets fast!