Last updated at Tue, 12 Dec 2023 20:53:40 GMT

Intrinsically-fast--more-JVM-performance-tinkering

I didn’t expect my last post on JVM perf to be so well received, so I thought I’d carry on digging into why your code does (or doesn’t) run fast! Let’s forget about concurrency for now and instead focus on the executable machine code that the Java Virtual Machine (and particularly HotSpot) generates.

In Java-land it’s pretty common to hear people mention stuff about ‘warmup times’, especially in the context of an incendiary micro benchmark that conclusively proves IO framework x’s Hello World is an order of magnitude quicker than that of framework y. You may also have come across tools like JMH for running these things methodically. Or you may just be confused by some guy on Stackoverflow wondering why his Sin function runs slower than the standard Java one.

The good news is that Java 6+ lets you peek under the hood at the stuff HotSpot is actually emitting so you can get concrete answers to these questions. As a motivating example, let’s take a single function bundled with the JDK and write our own implementation: Long.bitCount():

If you’re not familiar with bitCount, it’s also called the Hamming Weight or population count of a binary value. The Wikipedia entry gives a decent explanation. Actually, the listing above is the standard pure-Java OpenJDK implementation, so if and until HotSpot decides to turn it into something that may be better suited to your CPU architecture, that’s what you get (there are arguably more optimal implementations depending on the input).

Let’s run a super-naïve benchmark of myBitCount against plain old Long.bitCount to see if there’s any difference in execution time:

Wow, the built-in version is nearly three times faster than our implementation! How is this even possible if ‘my’ implementation is taken straight from OpenJDK’s in the first place? It turns out that the JVM can rewrite certain methods if it finds they’re being invoked often enough. On x86-64 architectures the whole method can actually be performed by a single ‘intrinsic’ instruction, POPCNT which Intel introduced with SSE4 (most modern Intel Core-based architectures will have this).

Intrinsic?

Think of an intrinsic as a shorthand version of a bag of CPU instructions. They tend to be massively faster than the ‘long-form’ equivalents because they can be performed in one fetch-decode-execute cycle.

POPCNT is one of those cases, but can we actually see this happening in our tiny little benchmark? With a little disassembler called hsdis, we can indeed. I’ve described previously how you set this up so I won’t duplicate it here, but suffice to say it’s not that hard. Let’s look at the pertinent assembly for the standard bitCount function:

Great- between the load / store operations for our fields, we’re clearly calculating bitCount in one shot! How about our hand-rolled contender?

Ugh, I stopped copying & pasting after a couple of lines, but you can clearly see our program has to do a hell of a lot more work: it’s responsible for all the requisite loads / stores plus arithmetic in between.

But wait, both implementations start off with the same Java code, how come our one gets the rough treatment? When HotSpot kicks in and starts traversing the AST of the Java code, it looks for call sites matching those certain functions I was talking about earlier. If you want to know what they are, they live in the OpenJDK vmsymbols header file (link here). Look for the do_intrinsic macro and you’ll find a lot of methods for operating on numbers and memory regions are optimized.

Remember that ‘warmup period’?

Just to highlight that we’re specifically talking about HotSpot optimizations here, let’s see what happens if we drop the number of loop iterations to 1 and run the standard bitCount again:

CompilerOracle: print *TinyBenchmark.standardBitCount

Java HotSpot(TM) 64-Bit Server VM warning: printing of assembly code is enabled; turning on DebugNonSafepoints to gain additional output

Took 0.055956 seconds

Process finished with exit code 0

That’s right, we get nothing at all; our loop hasn’t been run through often enough so the JVM is just interpreting the Java byte code on the fly. How about bumping the number slightly, say to 1000?

Heh, it’s started generating executable code, but it hasn’t yet compiled the Long.bitCount method, nor has it been inlined into our standardBitCount(). This is actually part of the tiered compilation feature that was introduced in Java 7 to mitigate longer startup times when the JVM is running in server mode. It’ll generate JIT’ed code earlier but avoids a heavier optimization phase. For bonus points, try adding the –XX:-TieredCompilation VM flag to disable it- a thousand iterations should also now produce no generated code for you!

Enough machine code already!

Right- like false sharing, you probably shouldn’t let tricks like this dictate design decisions in your code but if you’ve worked with the JVM for a while, it’s great to be aware of some of the compiler optimizations going on downstairs. At Logentries we have a critical set of performance challenges where experience with low-level I/O and CPU behavior is extremely important, so being able to see how a language runtime ties into this is massively beneficial.

Learn more about Rapid7's InsightIDR Product

That’s real full-stack development 😉