How Computers Became More Specialized Again

Computing styles ebb and flow. The centralized mainframe in the glass room largely ebbed in favor of the PC revolution that itself gave way, at least in part, to the web and the cloud. Today, we have a complex mix of massive datacenters, Internet-of-Things (IoT) devices, and sophisticated computers we can hold in the palm of our hand.

We’re also in the midst of a shift with respect to the processors we’re using. As an industry, we transitioned from an often vendor-specific collection of processor and system architectures to one dominated by the x86 instruction set in all but the lowest power and lowest cost applications. However, we’re now seeing a shift back to a more varied processor landscape.

This shift is old news for some types of computers. x86 CPUs never truly gained a beachhead in mobile beyond conventional laptops. With smartphones, tablets, Chromebooks, and IoT devices occupying an ever-increasing slice of the client computing pie, alternative architectures, especially those based on ARM, continue to expand their relative footprint.

The latest news, though, is what’s happening on the server front. Red Hat’s recent announcement of ARM server support for Red Hat Enterprise Linux is one data point. Another is the increased role of graphics processing units (GPU) and specialized application-specific integrated circuits (ASIC) such as the tensor processing unit (TPU) designed by Google for machine learning.

Three major trends are playing into this shift: semiconductor technology, computing at massive scale using open source platforms, and machine learning. This convergence was captured by Google datacenter head Urz Holze at the recent Structure conference: “Machine learning is the first indication that, with the dying of Moore’s Law, you really have to build specialized architectures for specialized operations.”

(This is the point where I insert the obligatory disclaimer that “Moore’s Law” is a convenient but somewhat inaccurate shorthand for the rate of CPU performance increases. Moore’s Law is really an arguably self-fulfilling historical observation about the rate at which transistor density can be economically increased. But this density has generally correlated closely enough to useful performance that the term has stuck.)

Beyond Moore’s Law

The problem is that we seem to be reaching the physical limits to how small we can shrink transistors.

There are other paths forward to increase performance. It seems as if there’s a consensus developing around 3D stacking and other packaging improvements as good near-ish-term bets. Improved interconnects between chips is likely another area of interest. Though a few years old, this presentation by Robert Colwell, presenting at Hot Chips in 2013 when he was director of the Microsystems Technology Office at DARPA, is still a good read.

However, Colwell also points out that from 1980 to 2010, clock rate improved 3500X (in large part through process shrinks) while other improvements contributed about another 50X performance boost. In other words, Moore’s Law has overshadowed just about everything else. This is not to belittle in any way all the engineering work that went into enabling CMOS process technology or the other work that has helped to translate transistor count increases to useful performance. But understand that CMOS has been a very special unicorn and an equivalent CMOS 2.0 isn’t likely to pop into existence anytime soon.

When I was an analyst, we took lots of calls from vendors wanting to discuss their products. Some were looking for advice. Others just wanted us to write about them. In any case, we saw a fair number of specialty processors. Some were designed around some sort of massive-number-of-cores concept. Others optimized for performance per watt in different ways.

Almost universally, these startups didn’t make it. Part of it is just that, well, most startups don’t make it and the capital requirements for even fabless custom hardware are relatively high. However, there was also a pattern.

Even in the best case, these companies were fighting a relentless doubling of processor speed every 18 to 24 months driven by companies like Intel and AMD on the back of enormous volume. So these startups didn’t just need to have a more optimized design than x86. They needed to be much better to compete, on much lower volume, against an incumbent improving at a rapid predictable space. Furthermore, in many cases, they were fighting against the enormous software inertia of the x86 platform, especially with proprietary software vendors which had spent the past couple of decades reducing the number of platforms they supported.

It was a tough equation.

But take away the predictable metronome of process shrinks and the equation changes. If you increasingly lack the option of just waiting for a process shrink, you have to consider alternative processor architecture and co-processor options even if they increase the amount of work you need to do in software development and operations. For example, as Google’s Holze also noted at Structure, your cluster management needs to deal with heterogeneity.

Open Source, Supercomputers, and the Cloud

It’s also just more practical to make use of a variety of processing options than it has been in the past. Many of the processor cycles today are consumed on supercomputers, massive Internet sites, and cloud service providers. They all run Linux and other open source software, which they and others can optimize for a wide range of hardware. In fact, the most recent TOP 500 list of the fastest supercomputers contains nothing but systems running Linux.

The organizations running these computers from national labs to Amazon to Google operate at large scale and have deep technical expertise. They can therefore amortize hardware optimizations over high volumes and have the capability to support specialized designs. It’s a very different situation from the typical enterprises who increasingly standardized on a single general-purpose instruction set over time.

Machine Learning

And nowhere do these optimizations come into play more obviously than in the case of machine learning generally and deep learning specifically, which many of the Internet giants both use internally and offer, in various guises, as a service.

Deep learning works by training a model with a huge number of matrix multiplications over a large data set. It’s a very computationally intensive task in aggregate but the individual operations are fairly simple and can be carried out in parallel to a significant degree.

It turns out that this is a good match for a GPU, which has thousands of small, efficient cores designed for handling multiple tasks simultaneously. (By contrast, a CPU has relatively few cores mostly optimized for sequential serial processing.) Since Nvidia came out with CUDA in 2016, it’s been possible to write programs for GPUs in a high-level language. GPUs can thereby be paired with CPUs to offload compute-intensive tasks. However, as Holze also noted, a GPU is still pretty general purpose for machine learning.

A custom processor, such as Google’s TPU, is even more optimized for machine learning. (Google says a CPU plus TPU is 15x to 30x faster than a CPU plus GPU on AI workloads and is even better on a performance per watt basis.) Google describes the TPU, which is built around an architecture called a systolic array, in this post.

Machine learning is a particularly pertinent example of a workload that is computationally intensive, is widely used, benefits disproportionately from optimized hardware, and can often be offered to users as a service in a way that insulates them from the underlying hardware. In the example I just gave, you don’t need to know what hardware TensorFlow is running on; you write to the library.

Conclusion

There is, to be sure, still a great deal of inertia in general purpose. And, indeed, good reasons to focus on designs that work well across most workloads. Gratuitous specialization leads to a lot of duplicated effort, incompatibilities, and fragmented communities.

We’ll likely see some areas of acceleration standardize and possibly even be folded into CPUs. Other types of specialty hardware will be used only when the performance benefits are compelling enough for a given application to be worth the additional effort. (As in the case of machine learning.) Linux can also help to abstract away changes and specializations in the hardware foundation as it has in the past. And, as I’ve noted, the increased use of open source software more broadly means that even end-user companies have far more options to modify applications and other code to use specialized hardware than when they were limited to proprietary vendors. Services delivered over the network are an alternative way to abstract server hardware from users and applications.

It’s clear, though, that we’re seeing a shift in the patterns that have dominated the computer industry for the past couple of decades or so. Much has already taken place in software with open source, containers, cloud services, and more. But there will now have to be an adjustment to a world where server hardware doesn’t just automagically, from the perspective of a user, get faster at a steady, consistent rate.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s