Machine Learning with Open Source Infrastructure

by | Aug 9, 2019 | AI

As machine learning becomes more interesting to technology companies, it is hardly surprising that a company like Red Hat is going to approach the challenges of this aspect of artificial intelligence with an open source methodology in mind.

The immediate benefits to open source machine learning tools are plain as day to anyone familiar with how open source works: lower cost, more flexibility, no vendor lock-in… you know, the usual.

But dig a little deeper and it quickly becomes apparent that open source means more for cutting-edge software than just a faster way to get cheaper software. 

Open Data Hub, for instance, is an end-to-end AI/ML platform based on purely open source tools. Specifically, explained Juana Nakfour, a senior software engineer at Red Hat, Open Data Hub is based on OpenShift, a Kubernetes distribution, and also incorporates tools like Apache Kafka and Ceph to enable data scientists to create models using Jupyter notebooks, and select from popular tools such as TensorFlow, scikit-learn, or Apache Spark for developing their models. 

There’s a lot of tools in this data modelling kit, and again, the flexibility and cost benefits are very apparent, since pulling all this together as a proprietary software package would be prohibitive to say the least. Under the service, however, even cooler things are going on.

For instance, Nakfour related, Open Data Hub uses Grafana as the front end for its monitoring solution, and recently contributors from Open Data Hub submitted fixes that optimize permissions issues that were cropping up in OpenShift around containers. It is important that containers not run as root, and this fix enabled Grafana users to avoid doing this. Since this fix was incorporated into Grafana, that means it will be available for any time Grafana is used with OpenShift, not just within Open Data Hub.

Even better; since OpenShift is itself a distribution of Kubernetes, these fixes can be applied across all of the Kubernetes ecosystem.

This is the kind of ripple effect common in open source projects, where a small change somewhere can be used by other projects to solve their problems. It is any wonder, then, that many of the  buzzword-y technological achievements in the past decade–big data, non-relational databases, and cloud computing–have few if any proprietary components?

For open source, the value isn’t just in “being cheap”–it’s about reducing barriers to innovation.