A Hub for Open Data at Mass Open Cloud

by | Sep 19, 2018 | AI

Open source software is good. Open source plus open data is even better. That makes initiatives such as the Open Data Hub both useful in and of themselves and as a template for maintaining control over your data.

Access to, and the ability to collaboratively build upon, open source code is genuinely useful. If it weren’t, open source software wouldn’t have become such an important part of how technology has developed over the past couple of decades. There are ideological reasons to prefer open source as well, but its effectiveness as a development model has won over the pragmatists.

Figure 1. The Open Data Hub isn’t just about storing data but doing so more securely and putting it to work in useful ways. More broadly, Open Data Hub can be thought of as a meta-project that’s focused on integration and service abstraction. Source: Máirín Duffy et al. (Red Hat), CC-BY-SA 4.0.

Figure 1. The Open Data Hub isn’t just about storing data but doing so more securely and putting it to work in useful ways. More broadly, Open Data Hub can be thought of as a meta-project that’s focused on integration and service abstraction. Source: Máirín Duffy et al. (Red Hat), CC-BY-SA 4.0.

Code remains useful and valuable today. However, we’re seeing more and more value shifting toward data in addition to the software that works with the data. This isn’t a new trend. Open data was a topic at an O’Reilly conference more than a decade ago. But it’s a trend that’s heating up. This recognition that data is becoming such an integral part of modern computing is the impetus behind the Open Data Hub project on the Mass Open Cloud (MOC). (Also related is the Dataverse project, open source research data repository software.)

Launched in 2014, the MOC runs at the Massachusetts Green High Performance Computing Center (MGHPCC) in Holyoke. It’s a collaboration of academia, industry, and the state with overall project leadership provided by Boston University. Its objective is to provide an Open Cloud eXchange (OCX) that enables cloud-related systems research as well as a production platform that isn’t locked into a single public provider. It’s built on the Red Hat OpenStack Platform and uses Ceph for its storage foundation.

It’s likely not a surprise that data science and machine learning workloads have been getting more and more popular on the MOC. This, in turn, creates a demand for both better tooling and a platform for the data used by that tooling. Hence, the Open Data Hub project.

Open Data Hub can be thought of as a meta-project that’s focused on integration and service abstraction. The idea is to insulate data science users from the details of the underlying platform. It’s part of a hybrid cloud application platform for data that is being implemented initially on the MOC. It plans to offer flexible entry points such as storage or application platform and to enable related projects to pick and choose services based on their needs.

Open Data Hub’s first use case, general data science experimentation, brings together Ceph storage, Apache Spark (a unified analytics engine for big data processing), TensorFlow (an open source software library for numerical computation), and Jupyter notebooks (a collaborative tool for writing and sharing code and text). Open Data Hub complements other data projects at the MOC and associated universities such as Cloud Dataverse, the open source research data repository software project.

Open Data Hub, which runs on top of OpenShift and OpenStack, is a proof-of-concept that is accepting early adopters. Access is intended to be expanded over time. In addition, although it’s currently running on the MOC, it’s designed to be portable so that, like the MOC architecture as a whole, it will be able to run across a variety of on-premise and public clouds. Collaborators and others can view the current project at OpenDataHub.io.

It’s this rise of data science and machine learning—or artificial intelligence more broadly if you like—that’s driving much of the attention to data in the first place. A machine learning model is effectively a creation of its training data set. Absent access to that data, there’s no way to verify the validity of the model or to reproduce the result.

In effect, being open means treating the data as if it’s part and parcel of the open code and managing a model’s code and data together throughout their life cycle. For example, you may choose a particular type of deep learning model to discover features of disease in medical images. Deciding whether that particular model is more or less effective than other models requires evaluating it using a common set of training data across both.

Opening up datasets has its own set of challenges. For example, sharing medical data for research purposes can run afoul of HIPAA regulations. Differential privacy—limiting individual privacy loss when private information is used to create aggregated statistics—is an area of active research. Nonetheless, it’s clear that we need to increasingly think about openness in the context of code and data that are intertwined rather than just one or the other.