Wednesday, November 18, 2020

Indirect Effects of Allocate Direct

In Java, ByteBuffer is the class that allows you to do so. Once you’ve decided to allocate a buffer, you’ll be presented with two methods allocate() and allocateDirect(). Which one to use? The answer is, as always, it depends. If there were no tradeoffs, there wouldn’t be two methods. In this article, I’ll explore some of these tradeoffs, demystify this behaviour, and I hope that the answer will be clear for you by the end of it.

The story

Every story needs a protagonist. In this case, the protagonist was a Java application built on top of RSocket, a modern application protocol. Here’s the oversimplified version of the app which you can find on Github. Let’s call this app an echo app. The echo code isn’t trying to do anything complicated, it is a simple echo service built with an awesome rsocket-java library. All it does is spins up a client and a server, where the client sends messages, and the server echoes them back.

var server = RSocketServer.create(echo) //...
var client = RSocketConnector.create() //...
while (true) {
  assert Objects.equals(client.send(), client.receive())
}

Note

The supporting code for this article is available on Github. You can, and I highly encourage you to, choose to go through each step yourself by cloning the code and running each example with a simple bash script. All measurements were taken on an EC2 AWS m5.large instance. Unless specified otherwise, Java 13 is used. The point of this article is not to show the numbers but rather demonstrate the techniques that you can use to debug your own application.

The code is useless if it’s just sitting in the repo and doing nothing, so let’s clone the repo and start the app.

git clone https://github.com/SerCeMan/allocatedirect.git
cd allocatedirect && ./start.sh

The app starts, and you should see that the logs are flowing. As expected, now it’s processing a large number of messages. However, if the echo app was exposed to users, they would start noticing significant pauses every now and then. All Java developers know that the first thing to look at in the case of spurious pauses is GC.

Note

You can find GC logs of the app are stored in /tmp/${gcname}. The example logs for each run are also available in the repo. In this article, gceasy.io was used for visualisation. It’s a great free online tool which supports the log format of multiple garbage collectors. Even though you can always visualise GC logs using a tool like gceasy, as we’ll see later, the raw logs often contain a lot more information than most of the tools can display.

Indeed, GC logs show that GC is to blame here. The application is running under G1, which is the default collector since JDK 9. There are multiple young GC pauses on the graph. Young GC is a stop-the-world pause in GC in G1. The application stops completely to perform a cleanup. For the echo server, the graph shows multiple young GC pauses that last for 100-130ms and occur every 10 seconds.

G1 GC

g1before

Luckily for us, in the last few years, there has been an amazing development in the GC space. There are not just one but two new fully concurrent garbage collectors, ZGC and Shenandoah.

While I’ve had great success with ZGC before, Shenandoah has a great advantage of being much friendlier to application memory consumption. Many applications, especially simple JSON in, JSON out stateless services are not memory-constrained. Some application, on the other hand, especially the ones that process a large number of connections might be very sensitive to memory usage.

Even though the echo app only has a single client and a server in its current state, it could as well handle tens of thousands of connections. It’s time to enable Shenandoah, and run the echo app again.

./start.sh shen # starts with -XX:+UseShenandoahGC

After enabling Shenandoah, the GC logs start showing an interesting picture. There is definitely a huge improvement in the pause frequency. The pauses now only occur every minute or so. However, the pauses are still around 90ms long, which is far away from the desired sub-millisecond pauses.

Shenandoah GC

shenandoah

Now that the symptoms are clear, and the problem is reproducible, it’s time to look at the cause. GC graphs don’t show much more information. Looking at the raw logs directly, on the other hand, reveals the cause which is clearly stated right on the pause line.

...
[info][gc] GC(15) Pause Final Mark (process weakrefs) 86.167ms
...

Turns out, weak references are to blame. Put simply, weak references are a way to keep an object in memory until there is a demand for this memory. Large in-memory caches is a common use-case for weak references. If there is enough free heap, a weak reference cache entry can stay there. As soon as GC figures out that there is not enough memory, it’ll deallocate weak references. In most of the cases, this is a much better outcome than the application failing with an out of memory exception because of a cache.

A frantic search across the repository doesn’t show any usages of weak, soft or phantom references. Not even the search through the third party libraries can show anything. After staring at the metrics for a while, one of the graphs gives a clue! The long GC pauses correlate with a sudden drop in the number of direct byte buffers.

GC vs DirectByteBuffer count

jmc gc

Note

You can get a similar graph by running the echo app and connecting to the JMX port. For this screenshot, I used Java Mission Control (JMC). The start.sh script contains the options that you can enable to connect to an app with JMX remotely.

At first, the correlation might not make any sense. Byte buffers are not weak references, are they? They are not weak references themselves. However, you might notice, that creating a new direct byte buffer gives you back a plain ByteBuffer interface which doesn’t have a close method or any other way of deallocating the buffer.

ByteBuffer buf = ByteBuffer.allocateDirect(42);

The underlying buffer needs to go away once the last reference to this buffer goes away. The modern API for this in Java is java.lang.ref.Cleaner. As we can see, it’s exactly what DirectByteBuffer class uses to determine when the underlying buffer should be deallocated.

DirectByteBuffer(int cap) {
    // ...
    base = UNSAFE.allocateMemory(size); // malloc call
    UNSAFE.setMemory(base, size, (byte) 0);
    // ...
    cleaner = Cleaner.create(this, new Deallocator(base, size, cap));
}

Yet, there are no usages of direct buffers in the code of the echo app either, so how could we find them? One way would be to search through the third party libraries using IntelliJ. The approach would work very well for the echo example but would completely fail for any real applications of a decent size. There are just way too many places where byte buffers are used. Looking at the graphs, one can notice that the number of created buffers per minute is huge, literally millions of them.

Instead of searching through the code to find all byte buffer references, it is easier to find the place at runtime. One way to find out where the majority of the buffers is created is to fire up the async profiler and profile the malloc calls which are used by direct buffers to allocate memory.

# async profiler can be downloaded from https://github.com/jvm-profiling-tools/async-profiler
./profiler.sh -d 30 -f /tmp/flamegraph.svg $(pgrep -f java) -e malloc

While running, the profiler managed to sample more than 500000 malloc calls which non-ambiguously show where all of the buffers were created from.

Note

The flame graph above visualises the code paths where most of the captured malloc calls occur. The wider the column is, the larger the number of times the code path appeared in the sample. This graph, as well as other flame graphs in this article, is clickable. You can read more on how to read flame graphs here.

As it turned out, there was a place in the code which was using direct buffers. With this rich knowledge of where exactly the direct buffer allocations occur, creating a fix is easy. All that’s needed is to make a one line change and to replace allocateDirect with allocate and send a PR upstream.

Running the same app on shenandoah after applying the single line change produces a completely different graph which pleases the eyes with sub-millisecond GC pauses.

Shenandoah GC

shenandoah heap

Conclusion

Every non-trivial Java application directly or indirectly uses byte buffers. On the surface, ByteBuffer is a simple class. It’s just a pointer to a chunk of memory. Nevertheless, even by looking at such a simple class, you can discover a deep rabbit hole. Even though we’ve only looked at the tip of the iceberg, I hope that you have a clear idea now of when you could use a heap buffer, and when you would choose a direct buffer.

Modern JVM runtimes are complicated environments. While they provide sane defaults, they also present multiple options. The choice is always there, it’s up to you to make that choice, but it’s crucial to be aware of the consequences. Fortunatelly, JVM runtimes also come with a whole lot of various observability tools, JMX metrics, GC logs, profilers, and if you really want it’s not even that hard to look at the generated assembly code. Using techniques shown in this article, you can make a choice not for the workload of a guy from the internet, but for your workload, which can result in amazing results in production later. We have to care more about later sometimes, you know.



from Hacker News https://ift.tt/36LaSdp

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.