Fact and Fiction In Zero Trust Architecture
Zero Trust Architecture (ZTA) is a panacea for organizational security challenges... at least that's what vendors would have you believe. There is so much confusion around what ZTA is and isn't that it's easy to believe that it's all smoke and mirrors; a grift to siphon millions from corporations desperate to shift their liability to someone else.
ZTA is a real thing but it's not a panacea or product. It's a philosophy for designing interactions between clients and services. Implementing ZTA involves tackling a mountain of individualized complexity. In this post I'll discuss some of this complexity and in doing so I hope to separate the sham from the substance and help you decide whether or not ZTA is something worth pursuing within your organization. Along the way I hope to show that implementing a Zero Trust Architecture is not an advanced technology problem, but one of integrating technologies we're already familiar with.
Much like Kubernetes if you don't know exactly why you need Zero Trust Architecture, you probably don't and the process of trying to implement it will be expensive and painful. You should know what you're getting into.
I was on the Google security team during the conception and implementation of BeyondCorp: Google's ZTA solution. I developed components related to machine inventory and device access. I know this flavor of Kool-Aid very well and I've seen things that worked very well and things that went poorly.
Everything I'm discussing here is relative to Google's BeyondCorp project. Other ZTA strategies may exist, I'm not an authority on them. Additionally, I left Google in 2016 and I'm confident that their solution has seen significant change since that time. That said, I'm sure the concepts are still valuable.
It's a common aspect of network architecture that you divide your network into different segments for different purposes. Perhaps you have a common network for clients, a restricted network for internal services, a guest network for untrusted devices, and one or more lab networks for R&D or testing.
The common client network is intended for corporate devices. Those devices need access to services that shouldn't be exposed to the Internet at large and those services are either on the same network as the clients, directly exposed to those clients, or with minimal segregation and filtering. These are company client devices and are assumed trustworthy.
The internal network may have poorly-protected ingress points such as shared WiFi passwords or network jacks with no access control. Arbitrary devices can access these internal networks, regardless of whether or not they are corporate devices.
Even devices on networks with strong mechanisms to keep out non-corporate devices (e.g. 802.1x) aren't strictly trustworthy. Compromised clients can relay traffic from external attackers onto the internal network.
In both cases the network has a hard outer shell and a soft interior. That hard shell is easily bypassed by determined attackers, bringing the bad guys onto the good network. Security practitioners have long understood that trusting the internal network is tenuous at best.
What Is Zero Trust Architecture?
Zero Trust Architecture primarily attempts to address the trust misplaced in corporate networks. The crux of ZTA is that client devices are distinctly identified and that access to the service is granted not just based on the access granted to the user but also based on the security posture of the device. In the following sections we discuss different aspects of device authentication and authorization, each progressing from the status quo to the ZTA ideal.
Distinctly identifying the client device requires each device to have unique credentials that can be used to distinguish between it and others. A simple device secret could suffice, as is the case with classic Kerberos. When these secrets are held in the filesystem, an attacker compromising the device can copy that secret and impersonate the device.
Client x509 certificates are a commonly used mechanism for device authentication with mutual TLS (mTLS). Usually a client certificate's private key is held in the filesystem and can be exported by an attacker like a simple secret. Managing an internal corporate x509 public key infrastructure (PKI) is tricky and requires internal development to manage if it is to operate successfully at a non-trivial scale. There are few standards for device enrollment and little in the way of off-the-shelf solutions.
Ideally, a certificate's private key is held in hardware in a way that it can't be exported, such as in a Trusted Platform Module (TPM). Clients with certificates based on non-exportable keys have the strongest device authentication. Issuing certificates for these devices can be tricky and require custom automation beyond conventional PKI certificate enrollment processes.
Many devices will support mTLS but can't provide certificates based on non-exportable keys. Most commonly this includes user devices such as laptops and desktops with software configurations that support mTLS but lack the crypto hardware for non-exportable keys. These devices can uniquely identify themselves but not in such a way that they can't be impersonated if compromised.
Finally, there are devices that can't authenticate themselves strongly, like simple IoT devices, or at all, like classic network printers.
Device State Based Access
Services normally don't consider the client device at all when making access decisions. In some rare cases services will attempt to identify the client device as "one of ours" with mTLS. It is almost never the case that access is granted or denied based on how secure the device is.
Corporate IT departments maintain a device inventory. If it's up to date devices can be recognized as corporate assets, assuming they can be identified.
IT departments often deploy centralized device management solutions. An agent runs on each device and reports the current state of the device to a central system and potentially corrects deviations from the desired configuration. The central management system knows whether or not a device has the properties of a secure configuration such as full disk encryption, up to date patches, client firewall, etc.
Similarly there are often endpoint protection solutions deployed to each device. These attempt to prevent intrusions and will report attempted intrusions and indications of compromise to a central monitoring system.
Ideally, services that make access decisions would identify the client device and consider its current state in granting or denying access to resources. Basic device inventory can distinguish between "friendly" and "unfriendly" devices. The inventory indicates who the device was issued to and access might only be granted to friendly devices where the authenticated user matches the user the device was issued to.
Device management knows if the device has a suitable security configuration and that the configuration has been checked recently. Similarly endpoint security knows with some certainty if the device has been compromised.
Device State Policy
When we grant users access to protected resources we do so while ensuring that those users are capable of and prepared to protect those resources. We do this through user training and hopefully not putting a user in a role where they can't fulfill their security obligations.
We don't usually take into account similar considerations for devices. A lot of valuable information about device posture is available but not taken into consideration when making an access decision. Should a device that is known to be compromised be allowed to access any resources that aren't part of the remediation process? Should a device be permitted to download intellectual property if it's not known to be able to protect that data when that device is lost or stolen?
Devices accessing the most sensitive resources should provide the best available protections. They should have full disk encryption with the company's configuration to reduce the chances that data can be recovered if the device is stolen. This configuration may be distinct to each organization. There may be multiple acceptable configurations in the device fleet simultaneously, especially when users have varied platforms. When a decision is made to grant or deny access to a resource, the policy system should understand what can constitutes appropriate full disk encryption and know how to verify it by checking with configuration management.
This extends to other security and IT controls like endpoint security, device management, automated backups, etc. These each play a role in protecting data and resources and it makes sense to deny access to devices that can't protect those resources.
It's not feasible, however, to require the strictest device configuration to access all resources. A service providing software and updates may need to provide packages and configuration required for the device to be in the desired configuration. The device can't attest to its endpoint security configuration if it doesn't have the endpoint security software installed.
In a Zero Trust Architecture the current state of the accessing device is taken into consideration as much as possible when making access decisions. Which security properties are required and which are nice to have varies from organization to organization and even from resource to resource. Access policies cannot be delivered "canned" or prescribed by an external group and can't be part of a generic solution.
Knowing and codifying device security requirements for resource access is useless if those policies cannot be enforced. While many services can authenticate a client through mTLS, authorization following from that is rare. Services will sometimes have rich languages for doing access control for users but rarely for devices. Having state data to make access decisions on is a complex problem.
Supposing the service has the device policy to enforce, it needs to know the device's state. Aspects of the device's state could be encoded into the certificate it presents. That state is then fixed for the lifetime of the certificate. If the device's state degrades the service won't know either until the certificate expires or it appears in revocation. Short lived certificates add operational overhead and if certificate enrollment requires user interaction it becomes a burden to the user. If the incident response team wants to keep a device from accessing sensitive resources, should they be content waiting for daily cert expiration? Revocation is an alternative but revocation is rarely used and the state of the device likely changes frequently.
Few services have the capability to enforce these policies. The simplest solution is to put services behind a reverse proxy that can enforce policy. This proxy then needs access to device state information. If it's not delivered as part of mTLS it needs to be available in some other way. The state information either needs to be synchronized into the proxy itself or the proxy needs a way to retrieve state information. If the proxy is retrieving the information it either needs to make queries to the sources of truth at request time which has performance cost and reliability risk or it needs to synchronize the data from those sources locally. This has better performance and reliability but will naturally provide stale results some of the time.
Given a standard way to retrieve state data, does each service have its own integration? This can represent large development or operational overhead. Device state can be centrally collected to a service of its own. Policy decisions can then be made using state data managed by a shared resource. This simplifies the individual services but adds a potential point of failure to the serving infrastructure.
Getting the state information to consult is a challenge on its own. Device management, endpoint security, and related systems often have APIs that automation can interact with but these APIs are usually unique. There are few common APIs or data schemas to allow for interoperability. For each API you wish to retrieve data from you likely need to create a custom integration.
The proxy solution is viable for services you control. Cloud SaaS services are another matter entirely. An organization can try to put a cloud service behind a reverse proxy but this is frequently brittle and users will often find a way around the proxy which will usually be faster.
All of this is for HTTP services. It's possible to create a TCP proxy that can enforce device policy but how does it deny a connection? The general solution is to send a RST. For the user, this is indistinguishable from a non-security-related service failure; they can't tell if they've been denied or if the service is broken, leading to support headaches. This TCP proxy could be aware of the protocol and attempt to send a protocol specific message, such as IMAP AUTHORIZATIONFAILED. Now the proxy needs to be aware of the protocols it's passing.
Implementing a real Zero Trust Architecture presents a lot of problems. The solution for Google is straightforward: leverage an army of world-class software engineers. Google created most of the technology it uses and has the resources to adapt them to new standards. As a web technology company it's not crazy for them to make every service a web service and put it behind a web proxy. They use very little vendor software and have little trouble building integrations between those systems that generate state data, those that house the state data, and those that need the state data.
Chances are, you're not Google or a company with similar resources. Every organization has their own security infrastructure and device configurations. Taking all of those unique properties into account requires custom integrations. The whole point of Zero Trust Architecture is to make security decisions based on how the organization's devices should be configured and how they should behave. That information is available in the disparate device management solutions but the systems that should be making use of it don't have access to it.
There aren't off the shelf solutions to pull this together so if you want it you have to build it. This was a significant undertaking even for the might of Google, but Google has orders of magnitude more resources, infrastructure, and sources of data than most companies. While you may not have the resources to throw at the problem, you also don't have the same scale of problems.
The goal of Zero Trust Architecture is to use all available data when making an access decision. The challenge is that this data isn't very available. For Zero Trust Architecture to be realistic for all but the most well-heeled organizations there needs to be new standards, particularly around interoperability. Systems collecting device state data must make that data available to automation in a common way to centralized collection systems such that integration is a matter of configuration rather than bespoke software development. Similarly, we need standards for retrieving that data by systems that have to make access decisions.
It Means Don't Trust Anything
Zero Trust is a terrible name; Zero Faith would be more accurate. It's really about being very explicit about what does and doesn't make a device trustworthy for a certain action.
It Means All Devices Use mTLS
Yes and no. Client certificates are much stronger device authentication than IP addresses, for sure. In the simple case a client certificate and private key are simply files that can be exported from the device. An attacker with these files can then impersonate a compromised device from other devices they control. Ideally, mTLS is rooted in TPM credentials which cannot be exported from the device.
A compromised device with non-exportable credentials can perform proper authentication under an attacker's control. Strong client authentication only establishes the device's identity, not that the device is trustworthy.
Some devices simply can't do mTLS on their own. A multifunction copier is never going to be subject to strong authentication or state attestation. For those devices, the services they need to access either need to have more lenient security requirements or something else must act on the device's behalf.
It Means Get Rid of Your Firewalls
We use firewalls to segregate "our network" from "not our network"; while not strictly safe the former is safer than the latter. If we don't trust the internal network there's no point in segregating it.
Having devices on an internal network doesn't make them "good", but it can make them "not obviously bad". Your firewall can keep out obviously bad stuff. If you don't run your own mail services, is there any reason to permit mail traffic onto your network? If you never expect to receive traffic from some part of the world, maybe you can afford to preemptively filter it out. Your coarse filtering devices help ensure that your network resources are devoted to your needs.
It Means Get Rid of Your VPN
VPNs are a useful tool. They also allow your users to extend the corporate network into arbitrary other networks. When services can make use of strong device authentication it becomes more reasonable to make them Internet-facing. This provides convenience for the users because it can reduce the time users need to be on the VPN. For some services it will be infeasible for the organization to restrict access with strong device authentication. For these services a VPN might be the right solution.
It's a Product
There's no shortage of vendors who will sell you a Zero Trust Architecture solution. At best they are taking creative liberties with the term. At worst they know they're selling snake oil. In between are a lot of products and services that represent a partial solution.
No vendor is offering a solution that works with all of your services and understands your all your client devices. Perhaps it's a viable offering if they provide 100% outsourced IT. Then the vendor controls all the clients and all the services and can ensure that all aspects of IT infrastructure have Zero Trust solutions.
Zero Trust Architecture is the future of network security. As is often the case the future has not been evenly distributed. If we want this capability to be commonly available we must put pressure on vendors to ensure simple, robust, efficient APIs are available for retrieving data about our devices from their products. For services, we need them to have mechanisms to make use of the data being collected. When vendors are competing for our contracts we can choose interoperability as a key deciding factor.