In the previous article we’ve implemented Conway’s game of Life
in Java using HashMap
as our main building block. We’ve discovered that the performance depends on the choice of
a hash function; some of them work much better than the others. The reason is that the keys are not truly random and
follow some patterns, which cause hash code collisions. Good functions, even in the presence of patterns, spread values
evenly over hash table slots and reduce chain sizes. Shorter lists must be traversed and fewer calls to equals()
must be made.
However, hash functions also cost CPU cycles, and this cost may kill any benefit from better hashing. We’ve looked at several hash functions and compared them regarding the quality of hashing. Now it’s time to measure their own speed as well and check if it can be improved. For now, I’ll concentrate on the first task (measuring), for it turned to be non-trivial and full of surprises. We’ll look at improvements in the next article.
Although the article is called “measuring hash functions”, most of our findings don’t depend on the nature of the functions, and are generally applicable to any Java code. I should have investigated this topic before trying any optimisations of Java programs, so this should have been the article number one. But when I started, I didn’t posess a skill, which I’m going to practice today: to print the code generated by HotSpot. This turned out to be an exciting adventure. Enjoy the ride!
The classes to test
We want to compare performance of hashCode()
methods of several classes. These classes are described in the previous article;
their code can be found in the repository.
They are used to store two integer values (x
and y
). One class (Point
) stores them individually as two separate variables;
the others pack them into a long
, high 32 bits being used for x
and low ones for y
. Here are the classes and their hashCode()
methods:
Point
java.lang.Long
LongPoint
(with the obvious definitions of hi()
and lo()
)
LongPoint3
LongPoint4
LongPoint5
LongPoint6
LongPoint7
Let’s add another one, NullPoint
:
The first test
It seems easy to compare the performance of several methods. We’ll just run each method some big number of times and measure the time (the code is in the repository):
The choice of the constants for x
and y
is completely arbitrary and has no special meaning. This is what we get when running this test
on Java 7 (1.7.0.40):
Point: 472 426 455
java.lang.Long: 427 390 365
LongPoint: 430 430 429
LongPoint3: 431 428 428
LongPoint4: 400 398 396
LongPoint5: 423 395 395
LongPoint6: 426 426 425
LongPoint7: 9823 9814 9814
NullPoint: 335 335 334
The results look a bit suspicious. Can Point
really be slower than LongPoint5
and LongPoint6
?
This suspicion grows as we look at the results on Java 8 (I’m using 1.8.0.45):
Point: 5 3 0
java.lang.Long: 93 175 141
LongPoint: 458 457 456
LongPoint3: 437 426 425
LongPoint4: 457 395 396
LongPoint5: 457 423 395
LongPoint6: 1644 1646 1645
LongPoint7: 1791 1790 1791
NullPoint: 377 365 365
Now Point
is very fast, faster even than NullPoint
. Could it be because it was the first in the list? Let’s try swapping it
with Long
:
java.lang.Long: 6 2 0
Point: 187 160 143
We see that the extraordinary speed is indeed caused by the first position in the list.
What happens if we repeat the test twice (introduce a loop into main()
)? We get the following for Java 7:
Point: 463 426 456
java.lang.Long: 427 392 365
LongPoint: 429 426 426
LongPoint3: 428 426 426
LongPoint4: 398 396 397
LongPoint5: 396 395 396
LongPoint6: 427 426 425
LongPoint7: 9824 9816 9816
NullPoint: 336 334 334
Point: 425 426 425
java.lang.Long: 364 364 365
LongPoint: 426 426 426
LongPoint3: 426 426 426
LongPoint4: 396 397 396
LongPoint5: 395 395 396
LongPoint6: 425 426 425
LongPoint7: 9816 9816 9816
NullPoint: 334 334 334
The only thing that has changed is Point
. It is now more consistent with other results but still surprisingly slower
than LongPoint4
and LongPoint5
.
Java 8 shows more variety:
Point: 6 3 0
java.lang.Long: 188 158 143
LongPoint: 458 426 426
LongPoint3: 442 426 425
LongPoint4: 457 445 395
LongPoint5: 457 412 395
LongPoint6: 1644 1645 1644
LongPoint7: 1873 1870 1871
NullPoint: 426 414 365
Point: 395 395 396
java.lang.Long: 395 395 395
LongPoint: 429 429 429
LongPoint3: 396 396 397
LongPoint4: 396 395 396
LongPoint5: 395 396 395
LongPoint6: 1627 1627 1628
LongPoint7: 1786 1786 1784
NullPoint: 364 365 364
The first two methods are incredibly fast, but they become normal the second time we run them. To find out why this happens, we’ll have to look at the generated code.
The generated code: Java 7
Java virtual machine has a command line option that causes it to print the names of the methods being compiled: -XX:+PrintCompilation
.
I won’t show the outputs here, for they are long. When using Java 7, relatively few compilations happen. All the hashCode()
methods are compiled when used. The iterate()
method is compiled twice in the beginning and never recompiled
afterwards. Java 8 behaves differently. It compiles iterate()
ten times – five times while running with Point
class,
twice with Long
, twice for LongPoint
, and once for LongPoint4
.
To look at the code, we must install a disassembler plugin (hsdis-amd64.so
,
which can be downloaded from Kenai Project). We want to see the code
for HashTest.iterate()
, so the command line must look like this:
# java -XX:+UnlockDiagnosticVMOptions -XX:+PrintCompilation -XX:PrintAssemblyOptions=intel
-XX:CompileCommand=print,*HashTime.iterate HashTime
Note the -XX:PrintAssemblyOptions=intel
part: I prefer traditional Intel notation of assembly listing to that
of the GNU assembler. Luckily, the disassembler has such option.
Java 7 compiles this method twice, but the results only differ in entry and exit sequences and in the registers used, so we’ll only look at one of them – the second one.
Java disassembly outputs are long and difficult to read; that’s why I’ll skip most of them. This one, however,
is interesting because it gives unforeseen insight into the default implementation of hashCode()
method; still,
I’m only showing the main loop here. I’ve also replaced jump targets with labels:
Some comments:
-
The
ebp
is the loop variable (i
), andebx
is the loop boundary (N
) -
The code between
L1
andL2
checks if the method we are going to call isObject.hashCode
. If it is, after some more checks the method is inlined and optimised, and the result of the inlining happens to be empty (the code atL3: jne MAIN_LOOP
). Otherwise, a virtual call is performed (atVIRTUAL_CALL
). This way of implementing virtual calls by speculative specialisation is very well described in Alexey Shipilev’s “The Black Magic of (Java) Method Dispatch”. In this specific case this optimisation happened to be harmful: it specialised the code for the case that never happened – none of our classes employshashCode()
method directly inherited fromObject
. -
The code between
L2
andL3
checks for some additional conditions on the first quadword of the object, which is called a mark word. In particular, it checks if the three least significant bits equal001
and if the bits 8 to 38 contain non-zero value. In pseudo-code this can be represented like this:
These checks are not part of a general-purpose method dispatch; for instance, they are not present in the examples
in Shipilev’s article. They are part of Object.hashCode()
inlining. This method is declared as native
, but obviously
the HotSpot has a special treatment for it (methods that enjoy such a treatment are called intrinsics). Here is what
is actually happening here. The last three bits (x.markWord & 7
) contain the lock status of the object, the value
1
meaning “not locked”. The reason these bits are checked is that they define what’s stored in the higher bits of
the mark word. When the object is locked, these bits keep the reference to the thread that holds the lock; otherwise, they keep
the hash code of the object. This hash code is based on the object’s address; however, the objects can move in memory
when the heap is compacted and must retain their hash codes. That’s why the hash code field is initialised to zero
and filled in the first time hashCode()
is called. This requires thread synchronisation, and is in general non-trivial –
that’s why it is considered too complex to be inlined and is implemented as a runtime call. In our case this is
combined with the speculative optimisation – the case when the object is locked or the hashcode isn’t initialised
is treated the same way as the case when the method isn’t Object.hashCode()
.
This pseudo-code suggests an optimisation: the if
statement can be moved out of the loop. As easy as it looks, it
is not completely trivial to implement: the call to vm.callVirtual
won’t change x.virtualTable.hashCode
, but it can
change the markWord
. Besides, the compilers are usually good at moving calculations out
of the loop, but not conditions. The typical way to move a condition out of the loop involves duplication of the loop
body with subsequent specialisation, and not every compiler writer likes this idea.
-
The main case (where the method isn’t
Object.hashCode
) is covered atVIRTUAL_CALL
by a call to the runtime (call 0x00007f592cc0fd60
). I’m not sure why the virtual call is implemented by a call to the runtime, rather than by the indirect call CPU instruction. My guess is that it may have something to do with statistics update, but I may be wrong. In any case, the CPU instruction would be faster – not only because there are fewer instructions to execute, but also because the processors have an indirect call cache and are capable to make predictions of indirect call targets based on call site addresses. -
Why is the
Object.hashCode
taken as the default case? Probably, the Java 7 VM does not have good type information at this point. It could take the slow route (a virtual call) always; alternatively, it could make an arbitrary choice of a virtual call target and generate a check and a specialised version. If the second approach is used, the natural target is that of the declared type of the object variable. It is bad for our case, but probably beneficial as a general strategy. -
The
markWord
temporary variable in our pseudo-code is important; it is introduced to indicate the fact that the Hotspot-generated code reads the mark word once and atomically, to make sure that different fields in that word are consistent with each other.
The generated code: Java 8
Why are the Java 8 results so different? Let’s look at the code.
The code samples produced by numerous compiler runs during the execution of the test with Point
class differ a lot.
These samples make very interesting reading, but to save space here we’ll only look at the last one of those. So, after
five attempts to compile the HashTime.iterate()
method, the Hotspot compiler generated the following code (this is the entire
code of the method):
This is much smaller than the code in Java 7. The checks of the mark word bits are gone. And, most importantly, the loop is gone (the only thing that’s left is the upper boundary check). I believe that the pseudo-code now looks like this:
Some comments:
-
The main case isn’t the
Object.hashCode()
anymore; it is thePoint.hashCode()
. Either the Java 8 VM collects statistics better, or it trusts its own statistics more than Java 7 did. -
The code doesn’t check if the method is the same as in
Point
; it checks if the class isPoint
. This check is cheaper, but less general: it does not cater for the case where the class extendsPoint
without re-defining thehashCode()
. Probably, the statistic engine has established that there were no such classes. -
It looks as if the optimisation I mentioned earlier has been implemented in Java 8 – the entire loop has disappeared, not just the method call. No wonder the entire loop has finished in zero milliseconds.
-
The optimiser made use of the fact that we call
hashCode()
and ignore the result. It removed the call completely. If we want to measure the performance of methods, we should either use their results or make sure the methods aren’t inlined.
The code produced while executing the test with java.lang.Long
is different. It is much longer, so I won’t quote
it here (I’ve placed it in a repository).
It is also very complex – it’s enough to mention that it contains 49 branch
targets, with a completely psychedelic structure of jumps between them. What makes
it so big is that it checks for two possible classes of x
(Point
and Long
) and unrolls the loop 16 times instead of
removing it. There is no way to write it down exactly as a pseudo-code, the closest I can come with is something like this:
This is why this code is slower than the one specialised for Point
but still faster than the general-purpose code
with a virtual call (140 ms instead of 400).
The code generated by subsequent runs of the compiler does not differ much from the code from Java 7, that’s why the results are similar.
The second test: using the result
One of the reasons why in some cases the entire method was optimised out is that the results of hashCode()
call were
not used. We can modify the code to use the results – for instance, to add them all together:
Obviously, the test ()
must be modified accordingly:
Note that we make sure we use the results of calls to sum()
by adding them together and printing the result. This is
a safety precaution against possible inlining of sum()
or compiling it with the knowledge that its results aren’t ever used –
some compilers are capable of this.
The new code is here.
This is what we get in Java 7:
Point: 491 457 489; sum=1040261632
java.lang.Long: 459 408 402; sum=-1786864128
LongPoint: 491 489 488; sum=1040261632
LongPoint3: 461 458 459; sum=-1308103168
LongPoint4: 460 459 460; sum=1455681024
LongPoint5: 432 430 431; sum=26862080
LongPoint6: 468 465 465; sum=-2103278592
LongPoint7: 9911 9905 9911; sum=1732665600
NullPoint: 374 367 367; sum=0
And this is in Java 8:
Point: 12 7 3; sum=1040261632
java.lang.Long: 213 161 153; sum=-1786864128
LongPoint: 462 458 457; sum=1040261632
LongPoint3: 518 510 457; sum=-1308103168
LongPoint4: 488 429 458; sum=1455681024
LongPoint5: 441 428 427; sum=26862080
LongPoint6: 1669 1650 1649; sum=-2103278592
LongPoint7: 1892 1892 1893; sum=1732665600
NullPoint: 405 374 374; sum=0
The results do not differ much from the previous ones. The unusually slow Point
in Java 7 and unusually
fast Point
and Long
in Java 8 are still there. The only thing that has changed is that LongPoint4
became
slower.
The Java 7 code of sum
isn’t much different from that of iterate
. The identity hash code extracted from
the mark word is now added to s
(I’ll only show the pseudo-code):
The last Java 8 code produced while running with Point
class is quite long and esoteric
(it can be seen here).
I’ll show my reconstruction of its pseudo-code:
We see that the optimisation mentioned earlier (moving an invariant condition out of a loop) has in fact happened.
The loop has been specialised for the case where x
is of type Point
. All other cases cause bailout to interpreting
mode in the loop preheader (a piece of code that is executed exactly once if the loop is executed at least once). The loop has been unrolled 16 times, It is interesting that the compiler figured out that the loop is adding
an invariant value hash
to s
and replaced 16
such additions with one addition of hash * 16
, but didn’t manage
to replace the entire loop with multiplication by hash * N
. The initialisation step (the first loop) also looks funny.
It didn’t have to be compiled as a loop at all! However, as we saw, the method executes fast, so we can ignore these abnormalities.
Finally, we’ll look at the code generated by Java 8 when the test is run with Long
class – as previously,
the compiler allows for the Point
class as well. The code is long, so I’ll only show my reconstruction
(the full code is here). The
inline_hash
method is my own creation; in the HotSpot-generated code this method is inlined each time it is used –
yes, the incoming value x
is tested for being of tyoe Point
or type Long
six times, each time with the inlined
code of the appropriate hashCode()
method.
While this code looks horrible, it doesn’t perform too badly. Even a code like this is better than an indirect call or a switch to interpreting mode.
After this point, Java 8 produces the code similar to that of Java 7 – speculative
inlining of Object.hashCode()
.
The workaround
As we see, the primary reason that the performance varies so much is that HotSpot collects type statistics and employs
speculative specialisation, which is performed especially aggressively in Java 8. The virtual methods are resolved,
their bodies inlined and moved out of loops – all these optimisations are great in general, but harmful for our
benchmarking. There is no point in comparing performance of a method in classes C1
and C2
when the C1
version
is inlined and optimised out while C2
version is not.
One possible solution is not to combine different tests in the same process run. If each VM run tests exactly one class,
all the classes will be tested in identical environment. Unfortunately, this approach is only applicable to full programs
(such as our Life simulation), but not to simplified tests. If the method is inlined and moved out of the loop in
sum()
, the nature of that method becomes irrelevant. All of them will produce the same measurable result (zero time).
A better approach is to wait until the code stabilises at the version that does not depend on the specific class and treats all the classes equally badly – performs virtual call for all of them. As we’ve seen, this happens in both versions of Java VM, very quickly in Java 7, more slowly in Java 8. After this stabilisation the measurements will produce results that we can compare to each other directly.
One way to cause this stabilisation is to repeat the entire test several times. Another way is to “pre-heat” the compiler
by running the test on the mixture of classes. In both cases we hope that the statistics engine saturates,
causing the compiler to avoid making any assumptions about the classes. I tried the second approach using the following code in main()
of the new class HashTime1
:
This is the result for Java 7:
Point: 548 547 547; sum = 1040261632
java.lang.Long: 547 547 547; sum = -1786864128
LongPoint: 548 547 547; sum = 1040261632
LongPoint3: 548 547 547; sum = -1308103168
LongPoint4: 547 547 547; sum = 1455681024
LongPoint5: 547 547 548; sum = 26862080
LongPoint6: 577 577 578; sum = -2103278592
LongPoint7: 10262 10263 10266; sum = 1732665600
NullPoint: 652 517 518; sum = 0
Point: 547 547 548; sum = 1040261632
java.lang.Long: 551 575 564; sum = -1786864128
LongPoint: 564 564 550; sum = 1040261632
LongPoint3: 548 549 575; sum = -1308103168
LongPoint4: 547 548 548; sum = 1455681024
LongPoint5: 548 547 548; sum = 26862080
LongPoint6: 591 578 578; sum = -2103278592
LongPoint7: 10403 10443 10473; sum = 1732665600
NullPoint: 517 518 517; sum = 0
And this is for Java 8:
Point: 489 488 488; sum = 1040261632
java.lang.Long: 488 488 487; sum = -1786864128
LongPoint: 488 488 488; sum = 1040261632
LongPoint3: 488 488 488; sum = -1308103168
LongPoint4: 488 487 488; sum = 1455681024
LongPoint5: 488 488 488; sum = 26862080
LongPoint6: 1660 1661 1661; sum = -2103278592
LongPoint7: 1888 1889 1888; sum = 1732665600
NullPoint: 519 487 458; sum = 0
Point: 488 487 487; sum = 1040261632
java.lang.Long: 488 488 488; sum = -1786864128
LongPoint: 488 488 488; sum = 1040261632
LongPoint3: 488 488 488; sum = -1308103168
LongPoint4: 488 489 488; sum = 1455681024
LongPoint5: 488 488 488; sum = 26862080
LongPoint6: 1664 1663 1663; sum = -2103278592
LongPoint7: 1891 1891 1892; sum = 1732665600
NullPoint: 457 458 458; sum = 0
Observations:
-
Suddenly both Java 7 and Java 8 demonstrate much more stable behaviour. No unusual super-fast or super-slow executions; the results stay roughly the same if the test is repeated twice.
-
The effect of unusually slow first call to
Point
in Java 7 has disappeared. -
One must keep in mind, however, that the trick we used (running the test on a mixture of classes before the main test) isn’t universal; it works on the present versions of Java VM, but may stop working later, if VM becomes more sensitive to changing patterns of virtual calls.
-
With the exception of
LongPoint6
, Java 8 is faster than Java 7 (this is kind of expected; why make a slower version anyway? The exception, however, is intriguing, we’ll look at it later). -
The cost of virtual method calls can be seen as the time of the
NullPoint
(458 ms on Java 8). It is much higher than the costs of the methods themselves (30 – 31 ms on Java 8). -
The times of the first six methods are virtually the same. This may seem strange, since some of them are more complex than the others. However, the difference isn’t big to begin with (multiplication instructions are quite fast on modern processors), and can be hidden by instruction scheduling and re-ordering (for instance, multiplication can be performed in parallel with the return sequence from a virtual call). In addition, the resolution of our timer is quite low (one millisecond), so we need much higher execution count to measure the differences properly.
-
The times reported are in milliseconds for 100M iterations. Dividing the result by 100 gives us nanoseconds per call, and multiplying it afterwards by 2.6 gives us CPU cycles (our processor runs at 2.6 GHz). The results on Java 8 are 11.9 for
NullPoint
, 12.7 for all simple functions, 43 forLongPoint6
and 49 cycles forLongPoint7
. The Java 7 result forLongPoint7
is 267 CPU cycles, which is very slow. -
Some features of the code (manipulations with the mark word) are caused by the fact that the method we are testing is
hashCode()
. Had we created our own method without default intrinsic implementation, the code samples would have been much simpler.
Conclusions
-
We’ve learned how to produce and read outputs of the Hotspot compiler. This is very useful knowledge, and these outputs make very interesting reading
-
We’ve found that Java 7 and Java 8 employ quite different approach to code generation. This means that future versions may change the approach again, so all our findings may end up very short-lived. We must be prepared to re-do all the tests. Such is the nature of program optimisation.
-
We’ve come with the procedure to get reliable performance measurements (pre-heating the compiler). We must be prepared for this procedure to stop working in future versions of Java VM.
-
Now we are ready to start measuring and optimising individual hash functions.
-
Two primary questions so far are why
LongPoint6
(employing a division operation) is so fast on Java 7 and so slow on Java 8, and why it is the other way around withLongPoint7
(employingCRC32
). -
Also interesting is why the performance difference of
LongPoint6
between Java 7 and Java 8 does not show on the full test (Life implementation). where the time was 1,979 ms on Java 7 and 1,553 ms on Java 8. TheLongPoint7
behaves differently: there the full test time did drop (from 10,115 ms to 3,206 ms). -
The previous observation shows that we must be careful with microbenchmarks, and always remember that our real goal is the improvement of the real program performance.
-
We’ve seen that overheads of the infrastructure (such as those of virtual calls) may greatly exceed the costs of the methods themselves. This means that often the costs of the methods are less important than the quality of their results. This also means that those infrastructural overheads must be reduced at all costs. Whenever you can replace a virtual call with a direct one (or help the compiler with it) – do it.
-
Microbenchmarking is the case where measurement is more important than the performance itself. What we’ve done is actually jumping through hoops in order to make the program slower – disable some optimisations. Obviously, for real programs we want the opposite – optimise them to the extreme. That’s why the trick used today mustn’t be used for the full test. On the contrary, we must run each test alone, providing as much help as possible to Java’s type inference engine.
-
We can now re-visit some earlier experiments in optimising Java code. The knowledge acquired today may help improve stability of the results and explain them. In the earlier articles (such as here) we made some guesses about the causes of the performance differences we observed. Now we can verify those guesses.
Coming soon
In the next articles, we’ll try using this measurement technique to optimise our hash functions to the limit, starting with the division-based one, and then check how this optimisation affects the overall performance of the original program.