In “Game of Life, hash tables and hash codes” we implemented Conway’s game of Life using Java’s HashMap
.
We discovered that performance depends on a choice of a hash function. In “Hash functions: disassembling and measuring performance” we developed a method
to measure the speed of the hash functions in separation from the rest of the code (as a micro-benchmark). We found that the function based on division (or, more
accurately, calculation of a remainder) worked much faster on Java 7 than on Java 8. In “Optimising division-based hash functions”
we found why. Java 7 optimised a constant division by replacing it with a multiplication by a reciprocal, while Java 8 didn’t. In pseudo-code this replacement looked like this.
The original hash function:
public int hashCode ()
{
return (int) (v % 946840871);
}
The optimised hash function:
public int hashCode ()
{
long sign = v >> 63;
long div = ((v ** 2614885092524444427L) >> 91) - sign;
return (int) (v - div * 946840871);
}
Java does not support the operation that we indicated with **
, which multiplies two 64-bit numbers and produces a 128-bit result.
We’ve spent enormous effort to implement this multiplication in Java and optimise it. Two versions were developed in
“Optimising division-based hash functions”, two more in “More optimisations of division-based hash: Karatsuba’s method”.
Here is the entire code. The results looked good
on a micro-benchmark:
Class name | Comment | Time, Java 7 | Time, Java 8 |
---|---|---|---|
LongPoint6 | Original version | 578 | 1655 |
LongPoint60 | Optimised, unsigned | 640 | 610 |
LongPoint61 | Optimised, signed | 729 | 727 |
LongPoint62 | Optimised and specialised | 595 | 547 |
LongPoint63 | Karatsuba's multiplication | 622 | 565 |
However, the real test still showed superiority of the original version:
Class name | Comment | Time, Java 7 | Time, Java 8 |
---|---|---|---|
LongPoint6 | Original version | 2032 | 1650 |
LongPoint60 | Optimised, unsigned | 2247 | 1877 |
LongPoint61 | Optimised, signed | 2368 | 1885 |
LongPoint62 | Optimised and specialised | 2191 | 1726 |
LongPoint63 | Karatsuba's multiplication | 2239 | 1776 |
This was a real mystery: why the function performs so badly on a micro-benchmark, and can be improved a lot, while on a real test it performs well and all the improvements fail?
Analysing the disassembly
The only way to answer this question is to run the full test on Java 8 and dump the disassembly output of the code generated by HotSpot. When we look at the list of actions taken on our classes, we see quite an activity:
CompilerOracle: print *Hash_LongPoint.inc
102 27 3 LongPoint6::hashCode (10 bytes)
104 31 3 LongPoint::equals (21 bytes)
105 36 3 LongPoint6$1::create (6 bytes)
105 37 3 LongPoint6$1::create (10 bytes)
105 38 3 LongPoint6::<init> (6 bytes)
105 39 3 LongPoint6::<init> (6 bytes)
106 40 3 LongPoint::<init> (10 bytes)
110 47 3 Hash_LongPoint::inc (51 bytes)
129 49 1 LongPoint6::hashCode (10 bytes)
129 27 3 LongPoint6::hashCode (10 bytes) made not entrant
131 54 3 Hash_LongPoint::dec (59 bytes)
134 68 3 LongPoint::toPoint (8 bytes)
137 82 4 LongPoint::equals (21 bytes)
139 84 3 Hash_LongPoint::set (91 bytes)
139 31 3 LongPoint::equals (21 bytes) made not entrant
140 86 3 Hash_LongPoint::reset (91 bytes)
142 88 4 LongPoint6$1::create (6 bytes)
143 36 3 LongPoint6$1::create (6 bytes) made not entrant
147 96 4 Hash_LongPoint::inc (51 bytes)
148 101 4 Hash_LongPoint::dec (59 bytes)
166 54 3 Hash_LongPoint::dec (59 bytes) made not entrant
197 47 3 Hash_LongPoint::inc (51 bytes) made not entrant
206 121 4 LongPoint::toPoint (8 bytes)
207 68 3 LongPoint::toPoint (8 bytes) made not entrant
216 126 3 Hash_LongPoint::step (234 bytes)
225 128 4 Hash_LongPoint::set (91 bytes)
225 129 4 Hash_LongPoint::reset (91 bytes)
226 130 % 4 Hash_LongPoint::step @ 104 (234 bytes)
228 86 3 Hash_LongPoint::reset (91 bytes) made not entrant
236 84 3 Hash_LongPoint::set (91 bytes) made not entrant
280 132 4 Hash_LongPoint::step (234 bytes)
360 126 3 Hash_LongPoint::step (234 bytes) made not entrant
691 96 4 Hash_LongPoint::inc (51 bytes) made not entrant
691 135 3 Hash_LongPoint::inc (51 bytes)
703 101 4 Hash_LongPoint::dec (59 bytes) made not entrant
704 139 3 Hash_LongPoint::dec (59 bytes)
704 132 4 Hash_LongPoint::step (234 bytes) made not entrant
705 130 % 4 Hash_LongPoint::step @ -2 (234 bytes) made not entrant
705 140 3 Hash_LongPoint::step (234 bytes)
710 145 4 Hash_LongPoint::inc (51 bytes)
711 146 4 Hash_LongPoint::dec (59 bytes)
719 148 % 4 Hash_LongPoint::step @ 104 (234 bytes)
759 135 3 Hash_LongPoint::inc (51 bytes) made not entrant
761 139 3 Hash_LongPoint::dec (59 bytes) made not entrant
1131 161 4 Hash_LongPoint::step (234 bytes)
1237 140 3 Hash_LongPoint::step (234 bytes) made not entrant
Methods are compiled, then discarded, then compiled again, and all the methods of interest were compiled at least once. We’ll look at the last compilation of each method.
Let’s start with LongPoint6.hashCode()
.
The main part of the code looks familiar – we saw it when first analysing division-based hash in Java 8:
0x00007f49692c150c: mov rdi,QWORD PTR [rsi+0x10] ;*getfield v
; - LongPoint6::hashCode@1 (line 21)
0x00007f49692c1510: mov rsi,rdi
0x00007f49692c1513: mov rdi,0x386fa527
0x00007f49692c151d: mov rbx,rdi
0x00007f49692c1520: mov rdi,rbx
0x00007f49692c1523: cmp rbx,0x0
0x00007f49692c1527: je 0x00007f49692c1540
0x00007f49692c152d: call 0x00007f497e8e9960 ;*lrem
; - LongPoint6::hashCode@7 (line 21)
; {runtime_call}
The remainder is calculated by calling a runtime routine, and the divisor is first compared with zero, which is completely unnecessary, as it is a freshly loaded known constant.
However, this method isn’t really called – or, rather, it stops being called as soon as methods such as Hash_LongPoint.inc()
get compiled.
Let’s look at this method. Here is the Java code:
private void inc (long w)
{
LongPoint key = factory.create (w);
Integer c = counts.get (key);
counts.put (key, c == null ? 1 : c+1);
}
The assembly code is quite long (1250 lines, 3487 bytes), and it contains a piece that also looks familiar:
0x00007f9d85334035: mov r11,r9
0x00007f9d85334038: sar r11,0x3f
0x00007f9d8533403c: mov rax,0x2449f0232c624b0b
0x00007f9d85334046: imul r9
0x00007f9d85334049: sar rdx,0x1b
0x00007f9d8533404d: sub rdx,r11
0x00007f9d85334050: imul r11,rdx,0x386fa527
0x00007f9d85334057: mov r8,r9
0x00007f9d8533405a: sub r8,r11
This is the optimised version of calculating a remainder, which we until now only saw in the code produced by Java 7. The conclusion we made previously that this optimisation had been removed from Java 8 turned out to be incorrect. Java 8 is capable of this optimisation; it just does not perform it in the micro-benchmark test. This gives us the answer to our mystery, but let’s carry on looking at the code.
The code shown above is present twice in inc()
, which corresponds to two HashMap
operations in this method (get()
and put()
). It’s a pity the compiler hasn’t figured out
that both operate on the same object (key
), and the first result could have been used in the second case as well.
Apart from that, the code of this method is quite impressive. The compiler have done a lot of work to produce this code. It performed speculative inlining of factory.create()
, HashMap.get()
and HashMap.put()
(speculative inlining means that the compiler, before performing true inlining, emits the code that checks if the object is of the expected class and falls back to the interpreting mode if not). Then it inlined all relevant methods from
HashMap
: get()
, put()
, hash()
, getNode()
and others. This in turn allowed inlining hashCode()
and equals()
, this time in absolute, not speculative, way, because
the compiler knows the exact type of a key object: it is LongPoint6
. That’s why the entire remainder-calculating sequence shown above is placed there. That’s also why there
is no virtual call to equals()
– this operation is fully inlined and replaced by a comparison of two numbers. The designers of
HashMap
helped there, too. The main loop of HashMap.getNode()
looks like this:
do {
if (e.hash == hash &&
((k = e.key) == key || (key != null && key.equals(k))))
return e;
} while ((e = e.next) != null);
The type detection and call de-virtualisation and inlining wouldn’t have happened if the equals()
call looked the other way around:
((k = e.key) == key || && k.equals(key)))
because we can’t make any assumptions about the type of the object that’s just been extracted from a data structure. The missing null check is hardly a compensation for that. This is something to keep in mind when writing code with virtual calls.
Some exotic methods, such as
TreeNode.getTreeNode()
, treeifyBin()
, resize()
were not inlined – probably, they were considered cold code.
Similar work is done for counts.put()
– all relevant methods are inlined.
By the way, a rather long piece of code is generated for c+1
, because both c
and the result are Integer
, not int
:
0x00007f9d85334105: mov r10d,DWORD PTR [r10+0xc]
0x00007f9d85334109: inc r10d ;*invokestatic valueOf
; - Hash_LongPoint::inc@43 (line 49)
0x00007f9d8533410c: cmp r10d,0xffffffffffffff80
0x00007f9d85334110: jl 0x00007f9d8533481d ;*if_icmplt
; - java.lang.Integer::valueOf@3 (line 830)
; - Hash_LongPoint::inc@43 (line 49)
0x00007f9d85334116: cmp r10d,0x7f
0x00007f9d8533411a: jg 0x00007f9d853346a9 ;*if_icmpgt
; - java.lang.Integer::valueOf@10 (line 830)
; - Hash_LongPoint::inc@43 (line 49)
0x00007f9d85334120: mov r8d,r10d
0x00007f9d85334123: add r8d,0x80 ;*iadd
; - java.lang.Integer::valueOf@20 (line 831)
; - Hash_LongPoint::inc@43 (line 49)
0x00007f9d8533412a: cmp r8d,0x100
0x00007f9d85334131: jae 0x00007f9d853347c5
0x00007f9d85334137: movsxd r10,r10d
0x00007f9d8533413a: mov r11,0x67092cd58 ; {oop(a 'java/lang/Integer'[256] )}
0x00007f9d85334144: mov r10d,DWORD PTR [r11+r10*4+0x210]
The first instruction in this listing is unboxing (a field is read from the Integer
object), the rest is boxing (or, rather, the main path of boxing).
The reason the code is so long is that boxing isn’t implemented as just new Integer(i)
. Instead, Integer.valueOf(i)
is called, which is optimised by using a cache:
public static Integer valueOf(int i) {
if (i >= IntegerCache.low && i <= IntegerCache.high)
return IntegerCache.cache[i + (-IntegerCache.low)];
return new Integer(i);
}
Usually IntegerCache.low
is −128 and IntegerCache.high
is 127 (there is a way to change the high value, but not the low one). This is where comparisons with 0x7f and 0x80
come from. The next four lines perform the array index check:
0x00007f9d85334120: mov r8d,r10d
0x00007f9d85334123: add r8d,0x80 ;*iadd
; - java.lang.Integer::valueOf@20 (line 831)
; - Hash_LongPoint::inc@43 (line 49)
0x00007f9d8533412a: cmp r8d,0x100
0x00007f9d85334131: jae 0x00007f9d853347c5
Note how clever the compiler is to check both array boundaries using just one comparison – the trick is to use the unsigned branch instruction jae
, which
fires for both negative and big positive values. Unfortunately, the compiler isn’t clever enough to figure out that this index checking isn’t necessary at all – two previous
checks have already done the job. Even adding 0x80 is unnecessary, since this operation has been incorporated into the addressing mode when accessing the array ([r11+r10*4+0x210]
).
The Hash_LongPoint.dec()
looks very similar to inc()
in Java:
private void dec (long w)
{
LongPoint key = factory.create (w);
int c = counts.get (key)-1;
if (c != 0) {
counts.put (key, c);
} else {
counts.remove (key);
}
}
and is compiled in a similar way, too. The assembly contains three occurrences of multiplication by 0x2449f0232c624b0b – one for each HashMap
operation.
All important hash map operations have been inlined (speculatively). Both LongPoint6.hashCode()
and LongPoint.equals()
have been fully inlined.
Again, boxing required some extra code.
The Hash_LongPoint.set()
and Hash_LongPoint.reset()
also demonstrate some degree of inlining. This is what they look like in Java:
void set (LongPoint k)
{
long w = k.v;
inc (w-DX-DY);
inc (w-DX);
inc (w-DX+DY);
inc (w-DY);
inc (w+DY);
inc (w+DX-DY);
inc (w+DX);
inc (w+DX+DY);
field.add (k);
}
void reset (LongPoint k)
{
long w = k.v;
dec (w-DX-DY);
dec (w-DX);
dec (w-DX+DY);
dec (w-DY);
dec (w+DY);
dec (w+DX-DY);
dec (w+DX);
dec (w+DX+DY);
field.remove (k);
}
They look very similar to each other in assembly, too (see here
and here). Eight calls to inc()
and dec()
have not been inlined
(however, the calls are made direct, not virtual), while the trailing
hash map operation is inlined and produces one multiplication by 0x2449f0232c624b0b.
Finally, there is Hash_LongPoint.step()
:
@Override
public void step ()
{
ArrayList<LongPoint> toReset = new ArrayList<LongPoint> ();
ArrayList<LongPoint> toSet = new ArrayList<LongPoint> ();
for (LongPoint w : field) {
Integer c = counts.get (w);
if (c == null || c < 2 || c > 3) toReset.add (w);
}
for (LongPoint w : counts.keySet ()) {
if (counts.get (w) == 3 && ! field.contains (w)) toSet.add (w);
}
for (LongPoint w : toSet) {
set (w);
}
for (LongPoint w : toReset) {
reset (w);
}
}
The assembly output for this method is very long (more than 5000 assembly lines). It contains three inlined hash map operations, each with inlined and optimised hashCode()
.
It also contains one non-virtual call to set()
and eight calls to dec()
, also non-virtual. This means that reset()
was inlined while set()
was not.
The code also contains a lot of artefacts of ArrayList
, KeySet
, HashMap.HashIterator
, ArrayList.Itr
and other classes. While ArrayList
operations, such as array
resizing, are expected (I can’t think of an optimisation that would manage to optimise them out), it should be possible to remove iterators, at least in simple cases,
such as iterating ArrayList
s. The compiler doesn’t do it – the iterator objects are there, and all required virtual calls are made. Perhaps, we should consider replacing those ArrayList
s with plain arrays, although the best solution would
be to get rid of them altogether. Some better way to iterate over our sets might also help, but this is a topic for future consideration. For now, it’s enough to know
that Java 8 is in fact perfectly capable of optimising division, and all our previous attempts to do it manually were unnecessary.
Conclusions
-
HotSpot demonstrates impressive optimisation capabilities. It is capable of deep inlining, which allows de-virtualisation and inlining of callback methods.
-
It still misses some little optimisation opportunities here and there.
-
Contrary to our previous belief, Java 8 is capable of replacing division by a constant with multiplication by a reciprocal.
-
This optimisation, however, isn’t performed always; it requires some additional conditions, such as very hot code, massive inlining, or something else of this kind.
-
Some other optimisations may follow this pattern.
-
In our case this optimisation pattern caused different behaviour of the optimiser in the case of a micro-benchmark and in the case of a real program – surprisingly, in favour of the real program. We are so used to a cheating behaviour when benchmarks perform much better than real products, that it is very pleasant to discover an example of the opposite.
-
Still, it means that micro-benchmarks are misleading and must be used with caution, if at all. Nothing replaces a full program performance testing.
-
Don’t optimise what’s already optimised.
Coming soon
Another candidate for optimisation is the hash function based on CRC32
. We’ll see if we can avoid the same pitfall.