In “Game of Life, hash tables and hash codes” we implemented Conway’s game of Life using Java’s HashMap
.
We discovered that performance depends on a choice of a hash function. In “Hash functions: disassembling and measuring performance” we developed a method
to measure the speed of the hash functions in separation from the rest of the code (as a micro-benchmark). We found that the function based on division (or, more
accurately, calculation of a remainder) worked much faster on Java 7 than on Java 8. In “Optimising division-based hash functions”
we found why. Java 7 optimised a constant division by replacing it with a multiplication by a reciprocal, while Java 8 didn’t. In pseudo-code this replacement looked like this.
The original hash function:
The optimised hash function:
Java does not support the operation that we indicated with **
, which multiplies two 64-bit numbers and produces a 128-bit result.
We’ve spent enormous effort to implement this multiplication in Java and optimise it. Two versions were developed in
“Optimising division-based hash functions”, two more in “More optimisations of division-based hash: Karatsuba’s method”.
Here is the entire code. The results looked good
on a micro-benchmark:
Class name | Comment | Time, Java 7 | Time, Java 8 |
---|---|---|---|
LongPoint6 | Original version | 578 | 1655 |
LongPoint60 | Optimised, unsigned | 640 | 610 |
LongPoint61 | Optimised, signed | 729 | 727 |
LongPoint62 | Optimised and specialised | 595 | 547 |
LongPoint63 | Karatsuba's multiplication | 622 | 565 |
However, the real test still showed superiority of the original version:
Class name | Comment | Time, Java 7 | Time, Java 8 |
---|---|---|---|
LongPoint6 | Original version | 2032 | 1650 |
LongPoint60 | Optimised, unsigned | 2247 | 1877 |
LongPoint61 | Optimised, signed | 2368 | 1885 |
LongPoint62 | Optimised and specialised | 2191 | 1726 |
LongPoint63 | Karatsuba's multiplication | 2239 | 1776 |
This was a real mystery: why the function performs so badly on a micro-benchmark, and can be improved a lot, while on a real test it performs well and all the improvements fail?
Analysing the disassembly
The only way to answer this question is to run the full test on Java 8 and dump the disassembly output of the code generated by HotSpot. When we look at the list of actions taken on our classes, we see quite an activity:
CompilerOracle: print *Hash_LongPoint.inc
102 27 3 LongPoint6::hashCode (10 bytes)
104 31 3 LongPoint::equals (21 bytes)
105 36 3 LongPoint6$1::create (6 bytes)
105 37 3 LongPoint6$1::create (10 bytes)
105 38 3 LongPoint6::<init> (6 bytes)
105 39 3 LongPoint6::<init> (6 bytes)
106 40 3 LongPoint::<init> (10 bytes)
110 47 3 Hash_LongPoint::inc (51 bytes)
129 49 1 LongPoint6::hashCode (10 bytes)
129 27 3 LongPoint6::hashCode (10 bytes) made not entrant
131 54 3 Hash_LongPoint::dec (59 bytes)
134 68 3 LongPoint::toPoint (8 bytes)
137 82 4 LongPoint::equals (21 bytes)
139 84 3 Hash_LongPoint::set (91 bytes)
139 31 3 LongPoint::equals (21 bytes) made not entrant
140 86 3 Hash_LongPoint::reset (91 bytes)
142 88 4 LongPoint6$1::create (6 bytes)
143 36 3 LongPoint6$1::create (6 bytes) made not entrant
147 96 4 Hash_LongPoint::inc (51 bytes)
148 101 4 Hash_LongPoint::dec (59 bytes)
166 54 3 Hash_LongPoint::dec (59 bytes) made not entrant
197 47 3 Hash_LongPoint::inc (51 bytes) made not entrant
206 121 4 LongPoint::toPoint (8 bytes)
207 68 3 LongPoint::toPoint (8 bytes) made not entrant
216 126 3 Hash_LongPoint::step (234 bytes)
225 128 4 Hash_LongPoint::set (91 bytes)
225 129 4 Hash_LongPoint::reset (91 bytes)
226 130 % 4 Hash_LongPoint::step @ 104 (234 bytes)
228 86 3 Hash_LongPoint::reset (91 bytes) made not entrant
236 84 3 Hash_LongPoint::set (91 bytes) made not entrant
280 132 4 Hash_LongPoint::step (234 bytes)
360 126 3 Hash_LongPoint::step (234 bytes) made not entrant
691 96 4 Hash_LongPoint::inc (51 bytes) made not entrant
691 135 3 Hash_LongPoint::inc (51 bytes)
703 101 4 Hash_LongPoint::dec (59 bytes) made not entrant
704 139 3 Hash_LongPoint::dec (59 bytes)
704 132 4 Hash_LongPoint::step (234 bytes) made not entrant
705 130 % 4 Hash_LongPoint::step @ -2 (234 bytes) made not entrant
705 140 3 Hash_LongPoint::step (234 bytes)
710 145 4 Hash_LongPoint::inc (51 bytes)
711 146 4 Hash_LongPoint::dec (59 bytes)
719 148 % 4 Hash_LongPoint::step @ 104 (234 bytes)
759 135 3 Hash_LongPoint::inc (51 bytes) made not entrant
761 139 3 Hash_LongPoint::dec (59 bytes) made not entrant
1131 161 4 Hash_LongPoint::step (234 bytes)
1237 140 3 Hash_LongPoint::step (234 bytes) made not entrant
Methods are compiled, then discarded, then compiled again, and all the methods of interest were compiled at least once. We’ll look at the last compilation of each method.
Let’s start with LongPoint6.hashCode()
.
The main part of the code looks familiar – we saw it when first analysing division-based hash in Java 8:
The remainder is calculated by calling a runtime routine, and the divisor is first compared with zero, which is completely unnecessary, as it is a freshly loaded known constant.
However, this method isn’t really called – or, rather, it stops being called as soon as methods such as Hash_LongPoint.inc()
get compiled.
Let’s look at this method. Here is the Java code:
The assembly code is quite long (1250 lines, 3487 bytes), and it contains a piece that also looks familiar:
This is the optimised version of calculating a remainder, which we until now only saw in the code produced by Java 7. The conclusion we made previously that this optimisation had been removed from Java 8 turned out to be incorrect. Java 8 is capable of this optimisation; it just does not perform it in the micro-benchmark test. This gives us the answer to our mystery, but let’s carry on looking at the code.
The code shown above is present twice in inc()
, which corresponds to two HashMap
operations in this method (get()
and put()
). It’s a pity the compiler hasn’t figured out
that both operate on the same object (key
), and the first result could have been used in the second case as well.
Apart from that, the code of this method is quite impressive. The compiler have done a lot of work to produce this code. It performed speculative inlining of factory.create()
, HashMap.get()
and HashMap.put()
(speculative inlining means that the compiler, before performing true inlining, emits the code that checks if the object is of the expected class and falls back to the interpreting mode if not). Then it inlined all relevant methods from
HashMap
: get()
, put()
, hash()
, getNode()
and others. This in turn allowed inlining hashCode()
and equals()
, this time in absolute, not speculative, way, because
the compiler knows the exact type of a key object: it is LongPoint6
. That’s why the entire remainder-calculating sequence shown above is placed there. That’s also why there
is no virtual call to equals()
– this operation is fully inlined and replaced by a comparison of two numbers. The designers of
HashMap
helped there, too. The main loop of HashMap.getNode()
looks like this:
The type detection and call de-virtualisation and inlining wouldn’t have happened if the equals()
call looked the other way around:
because we can’t make any assumptions about the type of the object that’s just been extracted from a data structure. The missing null check is hardly a compensation for that. This is something to keep in mind when writing code with virtual calls.
Some exotic methods, such as
TreeNode.getTreeNode()
, treeifyBin()
, resize()
were not inlined – probably, they were considered cold code.
Similar work is done for counts.put()
– all relevant methods are inlined.
By the way, a rather long piece of code is generated for c+1
, because both c
and the result are Integer
, not int
:
The first instruction in this listing is unboxing (a field is read from the Integer
object), the rest is boxing (or, rather, the main path of boxing).
The reason the code is so long is that boxing isn’t implemented as just new Integer(i)
. Instead, Integer.valueOf(i)
is called, which is optimised by using a cache:
Usually IntegerCache.low
is −128 and IntegerCache.high
is 127 (there is a way to change the high value, but not the low one). This is where comparisons with 0x7f and 0x80
come from. The next four lines perform the array index check:
Note how clever the compiler is to check both array boundaries using just one comparison – the trick is to use the unsigned branch instruction jae
, which
fires for both negative and big positive values. Unfortunately, the compiler isn’t clever enough to figure out that this index checking isn’t necessary at all – two previous
checks have already done the job. Even adding 0x80 is unnecessary, since this operation has been incorporated into the addressing mode when accessing the array ([r11+r10*4+0x210]
).
The Hash_LongPoint.dec()
looks very similar to inc()
in Java:
and is compiled in a similar way, too. The assembly contains three occurrences of multiplication by 0x2449f0232c624b0b – one for each HashMap
operation.
All important hash map operations have been inlined (speculatively). Both LongPoint6.hashCode()
and LongPoint.equals()
have been fully inlined.
Again, boxing required some extra code.
The Hash_LongPoint.set()
and Hash_LongPoint.reset()
also demonstrate some degree of inlining. This is what they look like in Java:
They look very similar to each other in assembly, too (see here
and here). Eight calls to inc()
and dec()
have not been inlined
(however, the calls are made direct, not virtual), while the trailing
hash map operation is inlined and produces one multiplication by 0x2449f0232c624b0b.
Finally, there is Hash_LongPoint.step()
:
The assembly output for this method is very long (more than 5000 assembly lines). It contains three inlined hash map operations, each with inlined and optimised hashCode()
.
It also contains one non-virtual call to set()
and eight calls to dec()
, also non-virtual. This means that reset()
was inlined while set()
was not.
The code also contains a lot of artefacts of ArrayList
, KeySet
, HashMap.HashIterator
, ArrayList.Itr
and other classes. While ArrayList
operations, such as array
resizing, are expected (I can’t think of an optimisation that would manage to optimise them out), it should be possible to remove iterators, at least in simple cases,
such as iterating ArrayList
s. The compiler doesn’t do it – the iterator objects are there, and all required virtual calls are made. Perhaps, we should consider replacing those ArrayList
s with plain arrays, although the best solution would
be to get rid of them altogether. Some better way to iterate over our sets might also help, but this is a topic for future consideration. For now, it’s enough to know
that Java 8 is in fact perfectly capable of optimising division, and all our previous attempts to do it manually were unnecessary.
Conclusions
-
HotSpot demonstrates impressive optimisation capabilities. It is capable of deep inlining, which allows de-virtualisation and inlining of callback methods.
-
It still misses some little optimisation opportunities here and there.
-
Contrary to our previous belief, Java 8 is capable of replacing division by a constant with multiplication by a reciprocal.
-
This optimisation, however, isn’t performed always; it requires some additional conditions, such as very hot code, massive inlining, or something else of this kind.
-
Some other optimisations may follow this pattern.
-
In our case this optimisation pattern caused different behaviour of the optimiser in the case of a micro-benchmark and in the case of a real program – surprisingly, in favour of the real program. We are so used to a cheating behaviour when benchmarks perform much better than real products, that it is very pleasant to discover an example of the opposite.
-
Still, it means that micro-benchmarks are misleading and must be used with caution, if at all. Nothing replaces a full program performance testing.
-
Don’t optimise what’s already optimised.
Coming soon
Another candidate for optimisation is the hash function based on CRC32
. We’ll see if we can avoid the same pitfall.