2020-05-29 Test Environment Service Interruption – ERP.BG

The event

On Friday, 29.5.2020 20:50:00, the test environment for some databases went down. The high-availability (HA) solution for a cluster failed and we commenced a disaster recovery (DR) plan. Since it is a test environment, the admins took their time and the full recovery of all affected services was completed around Monday, 1.6.2020 11:00.

Situation at a glance

The event occurred on a test & development environment. It employs a hyper-converged infrastructure of two mirrored hardware servers. This is sufficient HA resiliency for test environment. On the DR side, the situation is the following:

- The DEV, PARTNER and EDUCATION instances were with regular backup plans, with both local & Azure synchronized backups.

- The SQL instances, which are for TEST and STAGING purposes, have no backup plans. They are considered disposable in case of a HA failure.

- We recently performed the annual recovery testing exercise, which includes "test restore from Azure", so we were pretty sure that everything with the backups is OK.

Some quick facts

- The failed cluster consists of two mirrored hardware servers, with 24 SSD disks each.

- The two servers employed Microsoft Storage Spaces Direct (S2D) for a mirrored hyper-converged infrastructure.

- The SSD disks were mostly new - less than 8 months old.

- Wear level was far from exhausted - perhaps less than 5% usage.

- The disks were from a comparatively new Intel series - Intel® SSD D3-S4510 Series 3.84 TB.

So, what happened?

On first glance, it seemed that we were hit by a very unexpected and unfortunate coincidence, which should be nearly impossible to happen.

On May 26th, we had a minor failure - the disk at position 8 on one of the servers failed. Since this is test environment, there is no reserved space (the S2D term for "hot spare disk") and spare disks. So, we just ordered a replacement disk. But on May 29th, just 3 days later, the disk at the same position 8 (out of 24) on the other server also failed!

Now, this is a coincidence, huh?

Every HA solution employs some kind of data duplication in order to avoid HA fail for a simple disk fail. S2D mirrored servers, with 24 disks each, should endure multiple disk fails, provided that no two disks at the same position fail. But that is what just happened to us with 2 disks failing at position 8 within just 3 days.

The investigation

The investigation found out, that we were hit by a weird Intel bug. The bug is present on their high-capacity drives (3.84+TB) of this series. After 1700 idle power-on hours, the disks can experience "NAND channel hang". The bug is documented:

https://downloadcenter.intel.com/download/28673/SSD-S4510-S4610-2-5-non-searchable-firmware-links/

https://support.microsoft.com/en-us/help/4499612/intel-ssd-drives-unresponsive-after-1700-idle-hours

Although, the Microsoft article is worded to sound like a mild problem, don't be fooled - it is a total failure for the involved disk. The unfortunate thing is that disks at the same position in a mirrored cluster would have very similar "idle power-on hours" and will hence fail at the same time! So, the mirror HA environment is not saving you from the cluster failing.

We were not the only ones to be hit by the problem:

https://www.reddit.com/r/storage/comments/d9ilo2/rant_that_moment_when_you_find_all_the_drives_in/

Conclusion

In the aftermath of the problem, we are planning how to better avoid such problems in the future. Especially for the production clusters. Of course, we promptly updated the firmware and checked the wear level, but this is not enough for the future:

- Even if we setup 3-way (or even 4-way) mirrored servers, this would not save us - all servers would have experienced failure of disk 8.

- Microsoft recommends that "Storage Spaces Direct works best when every server has exactly the same drives."

https://docs.microsoft.com/en-us/windows-server/storage/storage-spaces/drive-symmetry-considerations

OKay, but than we can be hit with the same bug.

- There are some mixed mirror/parity solutions, but they have their performance drawbacks.

Having said that, the production clusters are harder to be hit. They have more complicated setup, reserved space (hot spare disks) and would not go down with these 2 disk fails in 3 days. But in case of more massive (or very fast consecutive) disk fails, any HA environment would ultimately fail. So, we are still planning how to avoid service interruption in the future in such circumstances.

The good thing was the test of our DR strategy, planning and execution - they showed excellence in the face of the emergency.