Keylime: Using TPM to Secure Your Slice of the Cloud

by , | Jun 25, 2019 | Trust

As people move workloads to shared and public cloud environments, what methods are available to attest their environment has not been tampered with? Is there a good way to use a standardized cryptographic module to do remote attestation, trusted system boot, and so on?

In this post we’ll introduce the Keylime project in some detail, and save a technology demo for a following hands-on article.

Keylime is an open source community-based project endeavoring to be the go-to technology for establishing and maintaining trusted infrastructure in distributed system deployments via two technologies:  the use of embedded Trusted Platform Module (TPM) hardware (version 2 and later); and the Linux kernel subsystem – Integrity Measurement Architecture (IMA).

Keylime’s design allows the remote attestation and IMA monitoring of thousands of nodes due to its performant-based architecture. Through the development of a virtual TPM (vTPM) quote, Keylime can also scale to be used for even thousands of virtual machines that are running on a single host while reducing the performance penalty of directly calling the hardware TPM of the cloud node to cryptographically sign data.

The TPM is a chip-based root of trust facility introduced by the Trusted Computing Group (TCG), standardized in 2009, and updated in 2015 to version 2.0. The TPM standard includes general system trust facilities such as random number generation, secure key generation, data encryption, and remote attestation. Version 2.0 is not backward compatible with previous TPM versions, thus Keylime targets only version 2.0 of the standard going forward, though the current implementation of Keylime written in Python continues to support version 1.2.

Keylime supports a number of use cases covering domains from trusted system boot through trusted workload execution. First, Keylime targets the Trusted Boot use case by providing measured boot functionality and secrets provisioning using encrypted payloads. After a trusted measured boot completes, Keylime enables runtime integrity checks and verification.

Structure of Keylime

Using the below components, Keylime attests system integrity during node provisioning as part of a Trusted Boot workflow as well as continuously attesting the trustability of the infrastructure while it is operational. The only external component Keylime needs to operate is a functional TPM provided by each infrastructure node where attestation is desired.

The three main components to the Keylime system are the Agent, the Registrar, and the Verifier. All three components have been initially developed in Python. Components that have greater performance and security needs, like the Agent, are being ported to the Rust language for its performant nature as a low-level systems language, and for the strict security model of ownership enforced by the compiler.

The Keylime Agent is required to be installed on each node in the infrastructure where attestation is desired. The Agent is responsible for interacting with the TPM of the system it resides on. Thus the Agent is responsible for TPM 2.0 functions such as requesting cryptographic quotes. The Agent is then responsible for communicating the collected information back to other system components to enable the processing of the trust chain.

The Verifier component, also known as the Cloud Verifier, is the component responsible for bootstrapping a new node into the system and continuously requesting the quotes from each Agent component in the system. The Verifier then performs the attestation on the quotes returned to determine if there have been any unauthorized changes to the remote systems.

Rounding out the set of Keylime components is the Registrar. The Registrar is responsible for maintaining the set of known secure (public) key values used during attestation processing. The Agent on each node registers itself with the Registrar upon boot up, locking in the initial state of the node for later comparison. The Registrar’s secure key set also includes the public keys for the hardware manufacturer of each node in the system. These manufacturer keys are used to verify that the Hardware TPM is valid and can be used as the root of trust for the respective node.

Also included in the Keylime tooling is a tenant CLI utility (keylime_tenant). The tenant utility uses the Keylime system RESTful interfaces to communicate with the Keylime components. The user can either employ the tenant utility, the Keylime web UI, or integrate a management system with Keylime by integrating with the Keylime REST API directly.

Keylime also includes a simple Certificate Authority (CA) that can be managed by the tenant utility or through its dedicated Keylime CA utility (keylime_ca). The CA is an integral part of initially establishing trust during the bootstrapping phase of node provisioning and in enforcing the trust relationship of the node thereafter. The CA is initially responsible for signing all boot keys sent to the nodes being provisioned, establishing the initial trust the system relies on. If the Verifier detects a breach of the established trust via broken attestations, the CA is notified and expected to revoke the trust by invalidating the keys associated with the compromised node.

How Keylime Supports Trusted Workflows

To understand how Keylime supports trusted workflows, you must start by understanding a few key concepts:  the Endorsement Key (EK), the Storage Root Key (SRK), and the Attestation Identity Key (AIK).

The EK is the base hardware root of trust key that is burned into the TPM by the TPM manufacturer. The EK uniquely identifies each TPM, never leaves the TPM, and is never erased. The public key for the TPM is published by the manufacturer and is used to validate if the EK of a TPM is valid during node registration.

The SRK is a key generated by the platform owner during initialization and is erased every time the TPM is reset to the factory defaults. The SRK is used to protect the AIK.

Finally, the AIK is a key that is only accessible to the TPM and is used to sign the attestation quotes returned by the TPM to a user.

In order to employ Keylime, the systems it is deployed to must be trusted compute aware such that the BIOS of the systems that include TPMs support the measurement of firmware and bootloaders. There are also TPM bootloaders that can then measure hypervisors and operating systems. Additionally, the OS of the systems can support measurements of applications that are launched, which can be accomplished via the Linux Integrity Measurement Architecture (Linux IMA) or Policy reduced IMA (though these architectures are unable to measure runtime-state of applications). Red Hat Enterprise Linux, CentOS Linux, and Fedora Linux are examples of operating systems that support IMA and, further, they enable it by default.

To support the Keylime workflow, individual components need to be started in the system in a specific order. First the tenant must deploy the Registrar. The Registrar can be located in the tenant’s own infrastructure or can be deployed as a physical system in the Cloud Provider’s infrastructure. The tenant attests the Registrars’ integrity state in order to start trusting it. Once the Registrar is up, it starts accepting registration requests from Agents to store their TPM AIKs.

One of the two main services of Keylime is to support the Trusted Boot workflow. This means Keylime can be used to verify the trusted expected state of a system through its provisioning workflow, so a tenant can be sure the system they receive has not been tampered with before they deploy workloads on it. Bootstrapping the system uses all components of the Keylime system, which need to be deployed in a specific order.

A novel approach in Keylime is in its protocol to attest a booted node can be trusted prior to initializing it, i.e., before configuring it to run a workload. Using this approach, called Three Part Key Derivation (TPKD), the tenant generates a new bootstrap key for the node being provisioned and splits it into two pieces. The tenant keeps one piece for its part of the bootstrapping responsibilities, and passes the other piece to the Verifier so it can complete its part of the protocol.

In the TPKD process, the tenant interacts with the Agent to demonstrate the intent to provision the node, then the tenant and the Verifier send separate attestation requests to the Agent and validate the quotes returned using the Registrar. When a protocol participant receives a valid attestation result, that participant sends its piece of the bootstrap key to the node being provisioned. Once both parts of the key have been sent to the node (demonstrating both attestation paths have completed successfully), the Agent can recombine the key and decrypt the node’s configuration data, including private keys sent to the node via configuration services such as cloud-init that were encrypted by the tenant before the provisioning protocol began.

The TPKD is used to trustfully boot the rest of the nodes in the Keylime system. The next node booted is the Software CA node. The CA is brought up before any of the workload nodes because the CA is responsible for signing the keys sent to the workload nodes. After keys signed by the CA are reconstructed on the target node, they can be used for cryptographic tools and services like IPsec.

The final step to deploy a Keylime-based system is to boot and provision the workload nodes with the same TPKD protocol using keys signed by the CA. Once a node is fully booted and verified, it has also been enrolled with the Registrar and is available for the next Keylime service to consume.

Once the Bootstrapping has completed and an Agent is registered, the Verifier commences the other main Keylime service of continuous attestation. The Verifier continuously requests quotes from the Agent. Each request induces the Agent to retrieve a quote from that node’s TPM (or vTPM). The quote is then returned to the Verifier, which cryptographically verifies the quote is valid. If the quote is determined to be invalid (denoting that the system state of the reporting node has somehow changed), the Verifier issues a revocation notice to the CA. Once the CA receives the revocation notice, it should invalidate the affected nodes’ keys, effectively breaking all crypto related network connections and services for the node.

What is Next for Keylime

Improvements to the above system are currently being implemented. Work to support vTPMs in KVM and runC environments is also ongoing. Xen functionality is considered proof of concept only, with future vTPM related development focusing on KVM and runC environments. Work to upgrade the vTPM base to interface with hardware based on the TPM 2.0 spec is ongoing. Work to support vTPMs in KVM and runC environments is also ongoing.  We also expect to see updates to address multi-tenancy and distributed high availability scenarios for the Verifier and Registrar components.

Keylime was initially created by a team out of MIT Lincoln Labs. Since that initial inception of the technology, a small but dedicated open source community has rallied behind it. The Keylime community is currently working on packaging the project for different platforms and hardening the system by porting subsystems from Python to Rust, all while servicing bugs reported to the community by users.

In the second part of this series, we’ll dive into specific demonstrations of using Keylime.

Interested in learning more or trying it out for yourself? Come check out the community at https://keylime.dev and give the guide “Get Started With Keylime” a whirl.