The prospect of true machine learning is a tangible goal for data scientists and researchers. It has been long known that the platform on which such ML apps can run have to be fast and hyper efficient so that learning can be that much faster. This is the motivation for Red Hat engineers in the Office of the CTO who are working to optimize such an open source platform: Open Data Hub.
Open Data Hub is built on Red Hat OpenShift Container Platform, Ceph Object Storage, and Apache Kafka/Strimzi integrated into a collection of open source projects to enable a machine-learning-as-a-service platform. That’s a lot of components to be integrated, and to ensure that their contributions to Open Data Hub perform well, Red Hat engineers have taken the step of creating an Internal Data Hub within Red Hat as a proving ground and learning environment.
Alex Corvin and Landon LaSmith, both Senior Software Engineers at Red Hat, explained that by working on the Internal Data Hub, Red Hat associates get a chance to experiment with an active machine-learning environment, while developers and engineers can learn to better optimize data management.
“We become Customer Zero for Open Data Hub,” Corvin added.
Another advantage to an Internal Data Hub, LaSmith said, is that it is easy for engineers inside Red Hat to play around with an Internal Data Hub and not worry about being “messy.” Changes can be implemented, tested, and–if they don’t work–easily reverted. So far, the experiment has worked out well, as Corvin, LaSmith, and associates have identified some key areas in which they can improve the efficiency of the Internal Data Hub, and thus by extension the Open Data Hub.
What’s Been Learned
Monitoring is perhaps the first key element that any production data environment should have. The solutions LaSmith and Corvin have identified are Prometheus and Grafana. Prometheus has already proven to be an effective integrator with OpenShift and Kubernetes, so its use in Open Data Hub was a good fit.
Scalable processes are another big part of optimizing Open Data Hub. And scaling in this context does not necessarily mean bigger. Scaling can also apply to the use of well-defined, repeatable processes. Processes need to be defined first, and then they can be improved. Without such scalability, customer onboarding is inefficient, data can go unused, and security can suffer.
Organizing data is a big play for maximizing efficiency of Open Data Hub. Performance can degrade, for example, if data buckets in Ceph S3 get too big. By limiting these buckets to around 1.5 million objects per bucket, performance can remain at a good level. Therefore, operators need to plan for how much data will be used and determine how to break up that data into more manageable sizes. Similar data management techniques can be applied to Elasticsearch cluster sizing and sharding.
Managing the overall data volume is also pretty important, because if you give customers any access to a shared data platform, they’re going to want to send data to it. A lot LaSmith and Corvin recommend having control over the amount of time data is retained for use with a shared data platform. Ceph S3 and Curator, for example, have policy controls that would allow expiring old data in Elasticsearch that would help cull data to keep the volume of data manageable.
By experimenting with changes and policies such as these in the Internal Data Hub, engineers in Red Hat are getting real-time experience in how to better manage platforms like Open Data Hub and bringing the goal of machine learning that much closer.