This post describes an open data research collaboration between the Ceph open source project and the Red Hat AI Center of Excellence, with a goal of addressing a long-standing problem with preemptively predicting storage device failure in order to improve overall system reliability. Although historically failure prediction capabilities have been limited to cloud-scale environments and traditional enterprise storage vendors’ products, our aim is to create an open source and open data solution.
Data centers experience frequent IT equipment failures, with storage devices being among their most dominant failing components. Storage systems often implement redundancy (be that HW/SW RAID, mirroring, erasure coding, etc.), but whenever a drive failure occurs, the redundancy factor is decreased, which increases the risks of data loss until the failure is handled and data on the failed drive is restored.
Performance is also affected, since re-replicating the data is usually a high priority operation, and this can happen during peak hours. If we could predict that a drive is going to fail, we could preemptively migrate data while the drive is still online without lowering the replication factor, and avoiding undesired load on the system by scheduling data re-replication/repair to off-peak hours.
Red Hat’s Emerging Technologies blog includes posts that discuss technologies that are under active development in upstream open source communities and at Red Hat. We believe in sharing early and often the things we’re working on, but we want to note that unless otherwise stated the technologies and how-tos shared here aren’t part of supported products, nor promised to be in the future.
Why do Storage Drives Fail?
This question has been the subject of dozens of studies in recent years. Since storage drives differ in their underlying technologies, there are different reasons for them to stop functioning.
For example, a solid-state drive (SSD) can wear out simply because it reached its maximum write endurance limit. A hard disk drive (HDD), on the other hand, contains moving parts and can be damaged by vibrations in the server rack. Temperature, humidity, firmware errors, noise, and sudden power outages might also cause a drive to fail. The list of reasons continues to expand, since researchers keep discovering interactions between components that cause failures.
Drive Failure Prediction
HDDs and SSDs report health metrics to the host system. Self-Monitoring, Analysis and Reporting Technology (SMART), for example, is a standard for health metrics reporting for SATA devices. Device health metrics include drive temperature, reallocated sectors count (in HDDs), wear-leveling (in SSDs), power-on hours, etc.
SMART and related device health reporting standards were developed in order to evaluate the storage device’s health based on data reported by the drive’s internal sensors and counters, and to alert in case of an imminent failure. However, over the years various studies have shown that the device’s built-in health assessment by itself is not accurate enough in its predictions, and suggested improved prediction models based on the full range of the drive’s health metrics (e.g., SMART) data.
One study on predicting disk failure shows that disk and OS performance and latency counters are highly valuable, and should also be considered when building a more accurate failure prediction model.
Data Sets for Drive Health
In practice, in order to come up with effective prediction models, a large, diverse, and open drive health data set is needed. Back in 2013, Backblaze, a cloud storage and data backup company, decided to share its drive health data and insights with the rest of the world.
Each quarter they publish updated reports and the corresponding raw data sets. Backblaze looks at the failure rates of different models of hard drives across its data centers. This isn’t a comprehensive look at all devices on the market, but it does provide a snapshot of the failure rates of specific models of hard drives from a number of manufacturers.
Thanks to this approach, multiple studies on HDD failures have been conducted, and data scientists who are looking to build improved models use Backblaze’s data set as their go-to. However, the data is lacking HDD and SSD vendors and models diversity.
Other published research is based on private access to proprietary drive health data sets (usually a partnership with a large cloud provider), which cannot be accessed by other data scientists who wish to study the same data or try to build a better failure prediction model.
Ceph Drive Telemetry and Drive Data Reporting
The devicehealth Ceph Manager module, first introduced in the 2019 Nautilus release, allows for user-friendly drive management, including scraping and exposing device health metrics like SMART.
Ceph also introduced the telemetry ceph-mgr module at about the same time. This module allows users to report anonymized, non-identifying data about their cluster’s deployment and configuration and the health metrics of their storage drives. The cluster data helps Ceph developers to better understand how Ceph is being used in the wild, identify issues, and prioritize bugs. The drive health metrics data is aimed at building an open data set to help data scientists create accurate failure prediction models.
Ceph Disk Failure Prediction
Ceph comes with a free and open source HDD failure prediction model developed by the AI Center of Excellence at Red Hat. This AI model can be activated by enabling the diskprediction ceph-mgr module. This model utilizes the health metrics we’ve collected to assess the cluster’s disk health and estimate whether the disk will fail soon.
As the size of our public data set expands, we plan to continue to refine and improve the predictive models to improve accuracy and encompass newer devices as they come into service.
All data collected is anonymized on the client side before it is sent to the telemetry server. Ceph does not collect any identifying and sensitive information, and replaces both host name and drive serial number with random UUIDs.
To enhance users’ privacy, two separate telemetry reports are generated, one with anonymized cluster data, and the other with anonymized drives health metrics, which are sent to different endpoints.
Previewing the cluster and device reports is available via the Ceph dashboard, or with:
# ceph telemetry show
# ceph telemetry show-device
Ceph telemetry module does not send any data unless you explicitly allow it via the Ceph dashboard, or with:
# ceph telemetry on
Telemetry is sent daily and does not hinder cluster performance. Re-opting-in is required in case new metrics are added to the reports.
Live dashboards with aggregated statistics from both cluster and device data are available. Users can learn about Ceph versions distribution, average cluster total and used capacity, and see what drives models are most popular.
Public Data Set
We are excited to announce that the drive health open data set is available. Please share with us how you use this data, any interesting findings, or any questions you may have via the address linked on that page.
Looking forward, the goal of the Ceph project is to build a large, dynamic, diverse, free, and open drive health data set to help researchers improve drive failure prediction models. To do so, we plan on expanding the collection of health metrics to include performance and latency counters as well.
In addition, we plan to develop a generic standalone agent that collects anonymized drive health metrics on any Linux host so that we can draw from storage devices outside of the Ceph ecosystem.
In the meantime, we encourage the Ceph community to join us in this important effort of improving drive reliability by opting-in to phone home their anonymized telemetry data.