A well-known tactic for figuring out how to identify the root cause of a problem that has caused an outage in a production environment is to go back and see what the environment has been doing so far. Through the analysis of logs, developers and operators alike can determine usage information that ideally reveal what’s wrong with a given application or how it can be improved to work better.
In the early days of logging, there wasn’t a great deal of activity going on, so it was possible for a human being (or two) to examine such logs and figure out what was up. It didn’t hurt that the logs were not only sparse in content, but also not terribly complicated in terms of what they reported. Alerts such as “Help, my processor is melting” really didn’t take a lot to figure out how to fix. Applications now are more distributed and that further complicates the situation. But over time, logs got far more voluminous and more detailed in what they were reporting.
Because of the volume of logs, some important information may go unread. The expansion of logs has made it very, very difficult for any one (or two, or however many) humans to manually read and try to determine anything beyond the most obvious error message (“Really, I’m about to crash over here”). And even then, that kind of error might be hard to pick out in the middle of lines and lines of reports.
Then there is the problem that even if all errors are noted, they may not give the reason why the error occurred in the first place. Did that processor overheat due to a hardware failure? Or did an app somewhere get stuck in a loop that put far too much demand on one machine’s processor? Did a load balancer fail? Until the root cause is determined, operators could be sentenced to a series of reactionary machine shutdowns.
Since humans aren’t capable of keeping up with these logs, it makes sense to task a machine to try to examine the logs and figure out what might be going on. But while an application can certainly examine millions of lines of text, the original problem still remains: how is the root cause of a problem discovered?
Michael Clifford, a data scientist in Red Hat’s Office of the CTO, began working on a prototype to find abnormal behavior in logs when he started as an intern with Red Hat. Zak Hassan, a Senior Software Engineer in Red Hat’s Office of the CTO, leads the log anomaly detection project, has continued in this direction, specifically with building an online machine learning system in production that processes application logs and performs natural language processing and unsupervised machine learning. The project is detailed online.
“Finding the root cause of an outage is like searching for a needle in a haystack,” Zak Hassan says. “Searching for logs with the severity tag of ‘Error’ is not enough,” Clifford said in a recent conversation. “It is very complicated to track down the actual cause.”
Supervised learning is when a machine learning algorithm is given a set of data that consists of inputs to and outputs from a system. In this case, the data scientist knows what the answer is. The aim of the machine-learning algorithm is to figure out the rules governing the mapping from inputs to known outputs.
Unsupervised learning is successful when a machine learning tool is pointed at a set of data and seeing what, if any, patterns can be inferred from that data. Without knowing what those patterns might be.
At first glance, this might not sound like a hard problem to solve, and natural language processing (NLP) would be overkill. Error and informational messages aren’t exactly verbose, and to call them formulaic is an understatement, such as these examples:
|2015-10-17 15:37:57,634||INFO||dfs.DataNode$DataXceiver: Receiving block blk_5102849382819239340 src: /10.0.0.1:57800 dest: /10.0.0.1:50010|
|2015-10-17 15:37:57,720||INFO||dfs.DataNode$DataXceiver: writeBlock blk_5102849382819239340 received exception java.io.IOException: Could not read from stream|
On their own, these informational messages might not yield an answer to any problem, unless someone knew exactly where to look. A machine-learning app, particularly unsupervised, would not. So, Clifford reasoned, NLP would be necessary to analyze even the most formulaic messages to help a machine learning data model find the patterns he was seeking.
One analogy to describe the problem is to think about the human body. Our brains are pretty good at identifying symptoms. “I have a runny nose,” or “my back hurts.” But what’s causing that backache? Is it a pulled muscle? Or is there a problem with one of your kidneys? Only further analysis of this and other symptoms would potentially reveal the root cause.
This is exactly the nature of the problem Clifford and his colleague Zak Hassan are trying to solve. The logs show the symptoms, often a great deal of them. So a machine-learning tool will need to examine all of those symptoms to ascertain the patterns that it can find. Perhaps, in the hours before that processor went into superheated mode, an app was throwing out certain error messages that would be early indicators of this issue. If that pattern can be discovered and, like all good science, repeated, then there is a proactive warning that says, if “you don’t fix this app soon, it’s about to get very warm inside this server.”
The problems won’t have to be this dramatic (and likely won’t be), but the general idea still holds: machine learning holds the key to a potential future of diagnosing problems before they become bad enough to affect efficiency. That kind of research could yield real benefits for IT operators in the near future.