Seeing the Trees in the Forest: Anomaly Detection with Prometheus

Red Hat’s work within the field of artificial intelligence is primarily taking three directions right now. First, our engineers see the inclusion of AI features as a workload requirement for our platforms, as well as AI being applicable to Red Hat’s existing core business in order to increase open source development and production efficiency. In short, Red Hat thinks AI can be good for our customers and good for us, too.

Second, Red Hat is collaborating with the Mass Open Cloud project to establish the one thing that all AI tools need the most: data. Our team members are working on the Open Data Hub, a cloud platform that lets data scientists spend less time on dealing with infrastructure administration and more time building and running their data models.

The third aspect of Red Hat’s work in AI right now is at the application level. More to the point, how can developers plug in AI tools to applications so that data from those applications can be gathered for storage and later modeling?

This creation of “intelligent apps” was the focus of Marcel Hild’s discussion at the Open Source Summit EU in Edinburgh, Scotland on October 22. Hild, a principal software engineer within Red Hat’s AI Center of Excellence (AI CoE), highlighted how tools such as Prometheus can be used to monitor data from connected applications and start the journey to monitor for anomalies.

Beginning the Path

In his talk, Hild was quick to point out that right now there is no end-to-end solution available for this sort of data gathering. While Prometheus is a good open source tool to start the monitoring process, it is only the first part of a larger toolchain.

Hild laid out a simplified architecture for Prometheus, describing how the tool can be set to watch specific targets within an application, which then reports what is happening with that part of the application at that moment.  The target data can be pulled into its time-series database. Used alone, this monitored data can then be used to set up alerts in real time.

But if you are going to try to examine this sort of data to determine longer-term trends, the inherent nature of a time-series database (TSDB) provides an immediate obstacle. Time series data are simply measurements or events that are tracked, monitored, downsampled, and aggregated over time and within Prometheus, are relatively short-term and not very good at the petabyte-sized scale needed for AI analysis.

One potential solution is Thanos, which is planned to eventually provide the capability to hold Prometheus data on a larger scale and for a longer time. But, unfortunately, Thanos is in its early days of development and “it wasn’t quite ready for our purposes,” Hild told his audience. In an interview after his talk, Hild did add that Thanos’ status “might have changed since then. So be sure to give it a spin.”

Another tool with potential that the AI CoE is working on connecting to Prometheus is InfluxDB. But, while InfluxDB provides some excellent storage capabilities to Prometheus and data scientists really like the tool, Hild told his audience that it currently “eats RAM for breakfast.” InfluxDB’s “open source version does not have clustering features. Users must look to InfluxData’s commercial offerings for clustering features.”

For now, Hild and his team have devised another solution using tools that already work with Open Data Hub:  scraping the data into a Ceph storage cluster and then using Apache Spark to analyze the stored data from Prometheus’ monitoring. This solution also provides a good migration path to Thanos, because users will be currently working against the same Prometheus API.

Where is the Anomaly?

Once the time-series data has been gathered and effectively stored at scale using all open source tools, the work of training the AI tools can really begin.

Specifically, AI tools can look at consistently gathered time-series data and determine:

  • Trends. An increase or decrease in the series over a period of time.
  • Seasonality. A regular pattern of up and down fluctuations. This is a short-term variation occurring due to seasonal factors.
  • Cyclicity. A medium-term variation caused by circumstances, which repeat in irregular intervals.
  • Irregularities. Variations that occur due to unpredictable factors and also do not repeat in predictable patterns.

To find such patterns in the Prometheus data, Hild described Prophet, another open source tool from Facebook that forecasts time-series data “based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects.”

With Prophet (though any open source forecaster could work), data can be used to train the data model to find and highlight any particular anomalies generated by an application. One practical set of anomalies that could be found includes predicting how much latency is expected from a given containerized application. If predictions match reality for one application, then the data model could be applied to other applications to predict their performance.

While there is no end-to-end open source solution available yet, Hild and his team have put together a simple application to collect data from a Prometheus host and train a model on that data, using the Prometheus-Ceph-Spark-Prophet toolchain Hild described, hosted on OpenShift. There are plans for a pool of predictive data models to be made available so data scientists can share what their tools have learned.

As this work continues, the practical benefits of AI analysis at the application level could enhance and improve the container development and deployment ecosystem to a great degree.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s