Thursday, July 14, 2022

We discovered a 7-year old performance Issue in Elixir (2020)

How we discovered a 7-year old performance issue in Elixir

Background

In this post, we will explore how in our efforts to optimize an internal service, we stumbled upon a performance problem that has lived in the Elixir code base for seven years. To start with, it is useful to have some context on the problem we were trying to solve. Tubi is a video streaming platform with millions of users streaming movies and TV shows for free every day. Every time a user hits the play button, the video player makes an API call to an internal service that fetches the video player manifest from S3, performs some modifications based on the request parameters, and returns the file to the user. One more thing to note is that some of these manifests are larger than others and/or may require more processing and a lot of the processing happens using regular expressions.

Our internal metrics showed that the p99 for this service was hovering around 400ms. Given that the manipulations were relatively simple, we suspected the bottleneck to be the S3 network request; a hypothesis we verified by benchmarking on local.

As this API call was in the hot path, we wanted to optimize it further, therefore we reached out for Redis to add a cache layer.

Alas, when we rolled out this cache layer to about 50% of our traffic, p99 didn’t go down as we expected. Instead, it increased by about 100ms. This was of course not what we expected, so we started digging in deeper.

Investigation

We first speculated that the increase in p99 was caused by an increase in CPU load. An important detail on our cache implementation, is that we added an extra compress/decompress step before saving/loading from Redis. Specifically, we use :erlang.term_to_binary/2with the :compressed option when we write to Redis and :erlang.binary_to_term/2 when we read. This is a technique we’ve used in the past with success. These operations both cost CPU power. And the CPU utilization did go up from ~30% to ~40% after the cache layer was enabled. But even with a higher CPU utilization, we didn’t expect p99 to go up so dramatically (by 100ms), given BEAM’s powerful concurrency model. So we dug deeper.

Then we noticed that although p99 latency and average latency increased, median (p50) latency did decrease. This meant that with this newly added cache layer, what used to be a fast request became faster, but what used to be a slow request became slower. This conclusion indicated that the real bottleneck for these slow requests might not be the S3 network request, but something else.

To identify this hidden bottleneck, we did another round of profiling. This time, we ran eprof on a real server with real requests coming in. Here is a highlight from the profiling report summary:

From this report, we can see that 89.60% of CPU time had been spent in this Regex.precompile_replacement/1 function. And in each process’ profiling report, Regex.precompile_replacement/1 was always the most CPU-time consuming function, taking 30%-90% CPU time. By reading Elixir’s source code, we found this function was called in Regex.replace/4, which was the function we used to modify the manifest file.

So, this explained why what used to be a fast request became faster, but slow requests became slower. The common pattern in all previously fast requests was that they didn’t call Regex.replace/4 so their bottleneck was indeed the S3 network request. But for slow requests, their bottleneck was actually this Regex.replace/4 call. The cache layer decreased the S3 network request time but consumed more CPU power (compress/decompress), which made Regex.replace/4 take longer, thus made slow requests slower.

Debug

Knowing that Regex.replace/4 was actually the bottleneck for our slowest requests, we started improving this endpoint performance again. Note that at this point, we didn’t realize nor assume that the real performance issues actually lay inside Regex.replace/4. So we were aiming to rewrite the function that calls Regex.replace/4 to modify the file content.

We started off by setting up a benchmark script with Benchee as the first step to improve this function’s performance. With this benchmark script, we were able to consistently get a p99 of ~100ms processing for the largest manifest file we had in our test suite. By pure chance, the file that we chose did not have any lines that needed to be replaced. The expectation was that our processing code will just read through the file very quickly. Instead, it still spent ~100ms inside Regex.replace/4 without replacing anything. So a natural improvement came to our mind: adding a Regex.match?/2 check before we call Regex.replace/4. Just by doing so, we improved this function’s performance by almost 100ms for this special file. And for files that have 50% lines to be replaced, the performance increased by ~50ms.

At this point, our optimization was pretty much done. We were planning to use something fancier like nimble_parsec to parse & modify the file content. But given that the performance after adding this trivial Regex.match?/2 precheck was good enough, this optimization seems to be an overkill.

But that didn’t stop us from further questioning: why Elixir didn’t have this trivial optimization built-in? Did :re.replace in Erlang’s re module have the same issue as well? We thus added another version to our benchmark script using Erlang’s re module. The result was almost as fast as having a Regex.match?/2 precheck. That meant Elixir’s Regex.replace/4 did have a performance issue, when there was nothing to be replaced.

Digging into Elixir’s source code, we found that Elixir did check if there is a match before replacing. But before this check, Elixir called the Regex.precompile_replacement/1 function, which led to this heavy computation work when there was nothing to be replaced. (This matched our profiling result from eprof.)

On the other hand, Erlang’s re module doesn’t have this precompile_replacement/1 logic. So when there was no match, :re.replace acted as a noop function.

After confirming this was indeed an Elixir issue, we submitted a PR to fix it: Speed up Regex.replace/4 when there is no match by dsdshcym · Pull Request #10500 · elixir-lang/elixir.

(Noticed this PR got merged in 10mins, which was quite impressive.)

Lessons learned

Although the final fix was trivial, we still learned a lot from this interesting debugging process.

  1. Profile with real data in real conditions
    The first lesson is to always profile with real data. We first benchmarked this endpoint with some trivial data on my local machine, which did prove our assumption that the S3 network request was the bottleneck. But the real world requests were more complicated than this. As we learned above, there were two types of requests, CPU intensive and I/O intensive. If we didn’t run eprof on a real server, we couldn’t identify the root cause for this performance issue.
  2. Identify the real bottleneck in the worst-case
    Similar to the first lesson, when benchmarking, we should identify the worst-case input and benchmark against that. We used some trivial input when benchmarked on my local, which led to a biased and misleading conclusion. Switching to the largest manifest file helped us identify the root cause quicker.
  3. Contribute back to open source
    Circling back to the click-bait title, this Regex.replace/4 performance issue has been introduced into Elixir since the beginning of the function, which was 7 years ago. I wondered why this bug was not found earlier by others. My guess is that nobody is using Regex.replace/4 at this scale (calling Regex.replace/4 thousands of times on no-match lines in a request cycle). Maybe it’s less of an issue for others, then it’s more important to fix this kind of rare issue upstream, so others won’t get hurt by this edge case in the future. Plus, contributing to Elixir has been the best open source experience a developer could ask for: 10mins and our code lands in master, unbelievable!

If working on Elixir at scale excites you, then join us at Tubi, where we use Elixir to power live and on demand studio content to tens of millions of users.



from Hacker News https://ift.tt/Ye9fTJp

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.