Don't trust a micro-benchmark

In “Game of Life, hash tables and hash codes” we implemented Conway’s game of Life using Java’s HashMap. We discovered that performance depends on a choice of a hash function. In “Hash functions: disassembling and measuring performance” we developed a method to measure the speed of the hash functions in separation from the rest of the code (as a micro-benchmark). We found that the function based on division (or, more accurately, calculation of a remainder) worked much faster on Java 7 than on Java 8. In “Optimising division-based hash functions” we found why. Java 7 optimised a constant division by replacing it with a multiplication by a reciprocal, while Java 8 didn’t. In pseudo-code this replacement looked like this. The original hash function:

public int hashCode ()
{
    return (int) (v % 946840871);
}

The optimised hash function:

public int hashCode ()
{
    long sign = v >> 63;
    long div = ((v ** 2614885092524444427L) >> 91) - sign;
    return (int) (v - div * 946840871);
}

Java does not support the operation that we indicated with **, which multiplies two 64-bit numbers and produces a 128-bit result. We’ve spent enormous effort to implement this multiplication in Java and optimise it. Two versions were developed in “Optimising division-based hash functions”, two more in “More optimisations of division-based hash: Karatsuba’s method”. Here is the entire code. The results looked good on a micro-benchmark:

Class name	Comment	Time, Java 7	Time, Java 8
LongPoint6	Original version	578	1655
LongPoint60	Optimised, unsigned	640	610
LongPoint61	Optimised, signed	729	727
LongPoint62	Optimised and specialised	595	547
LongPoint63	Karatsuba's multiplication	622	565

However, the real test still showed superiority of the original version:

Class name	Comment	Time, Java 7	Time, Java 8
LongPoint6	Original version	2032	1650
LongPoint60	Optimised, unsigned	2247	1877
LongPoint61	Optimised, signed	2368	1885
LongPoint62	Optimised and specialised	2191	1726
LongPoint63	Karatsuba's multiplication	2239	1776

This was a real mystery: why the function performs so badly on a micro-benchmark, and can be improved a lot, while on a real test it performs well and all the improvements fail?

Analysing the disassembly

The only way to answer this question is to run the full test on Java 8 and dump the disassembly output of the code generated by HotSpot. When we look at the list of actions taken on our classes, we see quite an activity:

CompilerOracle: print *Hash_LongPoint.inc
 27       3       LongPoint6::hashCode (10 bytes)
 31       3       LongPoint::equals (21 bytes)
 36       3       LongPoint6$1::create (6 bytes)
 37       3       LongPoint6$1::create (10 bytes)
 38       3       LongPoint6::<init> (6 bytes)
 39       3       LongPoint6::<init> (6 bytes)
 40       3       LongPoint::<init> (10 bytes)
 47       3       Hash_LongPoint::inc (51 bytes)
 49       1       LongPoint6::hashCode (10 bytes)
 27       3       LongPoint6::hashCode (10 bytes)   made not entrant
 54       3       Hash_LongPoint::dec (59 bytes)
 68       3       LongPoint::toPoint (8 bytes)
 82       4       LongPoint::equals (21 bytes)
 84       3       Hash_LongPoint::set (91 bytes)
 31       3       LongPoint::equals (21 bytes)   made not entrant
 86       3       Hash_LongPoint::reset (91 bytes)
 88       4       LongPoint6$1::create (6 bytes)
 36       3       LongPoint6$1::create (6 bytes)   made not entrant
 96       4       Hash_LongPoint::inc (51 bytes)
101       4       Hash_LongPoint::dec (59 bytes)
 54       3       Hash_LongPoint::dec (59 bytes)   made not entrant
 47       3       Hash_LongPoint::inc (51 bytes)   made not entrant
121       4       LongPoint::toPoint (8 bytes)
 68       3       LongPoint::toPoint (8 bytes)   made not entrant
126       3       Hash_LongPoint::step (234 bytes)
128       4       Hash_LongPoint::set (91 bytes)
129       4       Hash_LongPoint::reset (91 bytes)
130 %     4       Hash_LongPoint::step @ 104 (234 bytes)
 86       3       Hash_LongPoint::reset (91 bytes)   made not entrant
 84       3       Hash_LongPoint::set (91 bytes)   made not entrant
132       4       Hash_LongPoint::step (234 bytes)
126       3       Hash_LongPoint::step (234 bytes)   made not entrant
 96       4       Hash_LongPoint::inc (51 bytes)   made not entrant
135       3       Hash_LongPoint::inc (51 bytes)
101       4       Hash_LongPoint::dec (59 bytes)   made not entrant
139       3       Hash_LongPoint::dec (59 bytes)
132       4       Hash_LongPoint::step (234 bytes)   made not entrant
130 %     4       Hash_LongPoint::step @ -2 (234 bytes)  made not entrant
140       3       Hash_LongPoint::step (234 bytes)
145       4       Hash_LongPoint::inc (51 bytes)
146       4       Hash_LongPoint::dec (59 bytes)
148 %     4       Hash_LongPoint::step @ 104 (234 bytes)
135       3       Hash_LongPoint::inc (51 bytes)   made not entrant
139       3       Hash_LongPoint::dec (59 bytes)   made not entrant
161       4       Hash_LongPoint::step (234 bytes)
140       3       Hash_LongPoint::step (234 bytes)   made not entrant

Methods are compiled, then discarded, then compiled again, and all the methods of interest were compiled at least once. We’ll look at the last compilation of each method.

Let’s start with LongPoint6.hashCode(). The main part of the code looks familiar – we saw it when first analysing division-based hash in Java 8:

0x00007f49692c150c: mov    rdi,QWORD PTR [rsi+0x10]  ;*getfield v
                                              ; - LongPoint6::hashCode@1 (line 21)
0x00007f49692c1510: mov    rsi,rdi
0x00007f49692c1513: mov    rdi,0x386fa527
0x00007f49692c151d: mov    rbx,rdi
0x00007f49692c1520: mov    rdi,rbx
0x00007f49692c1523: cmp    rbx,0x0
0x00007f49692c1527: je     0x00007f49692c1540
0x00007f49692c152d: call   0x00007f497e8e9960  ;*lrem
                                              ; - LongPoint6::hashCode@7 (line 21)
                                              ;   {runtime_call}

The remainder is calculated by calling a runtime routine, and the divisor is first compared with zero, which is completely unnecessary, as it is a freshly loaded known constant. However, this method isn’t really called – or, rather, it stops being called as soon as methods such as Hash_LongPoint.inc() get compiled.

Let’s look at this method. Here is the Java code:

private void inc (long w)
{
    LongPoint key = factory.create (w);
    Integer c = counts.get (key);
    counts.put (key, c == null ? 1 : c+1);
}

The assembly code is quite long (1250 lines, 3487 bytes), and it contains a piece that also looks familiar:

0x00007f9d85334035: mov    r11,r9
0x00007f9d85334038: sar    r11,0x3f
0x00007f9d8533403c: mov    rax,0x2449f0232c624b0b
0x00007f9d85334046: imul   r9
0x00007f9d85334049: sar    rdx,0x1b
0x00007f9d8533404d: sub    rdx,r11
0x00007f9d85334050: imul   r11,rdx,0x386fa527
0x00007f9d85334057: mov    r8,r9
0x00007f9d8533405a: sub    r8,r11

This is the optimised version of calculating a remainder, which we until now only saw in the code produced by Java 7. The conclusion we made previously that this optimisation had been removed from Java 8 turned out to be incorrect. Java 8 is capable of this optimisation; it just does not perform it in the micro-benchmark test. This gives us the answer to our mystery, but let’s carry on looking at the code.

The code shown above is present twice in inc(), which corresponds to two HashMap operations in this method (get() and put()). It’s a pity the compiler hasn’t figured out that both operate on the same object (key), and the first result could have been used in the second case as well.

Apart from that, the code of this method is quite impressive. The compiler have done a lot of work to produce this code. It performed speculative inlining of factory.create(), HashMap.get() and HashMap.put() (speculative inlining means that the compiler, before performing true inlining, emits the code that checks if the object is of the expected class and falls back to the interpreting mode if not). Then it inlined all relevant methods from HashMap: get(), put(), hash(), getNode() and others. This in turn allowed inlining hashCode() and equals(), this time in absolute, not speculative, way, because the compiler knows the exact type of a key object: it is LongPoint6. That’s why the entire remainder-calculating sequence shown above is placed there. That’s also why there is no virtual call to equals() – this operation is fully inlined and replaced by a comparison of two numbers. The designers of HashMap helped there, too. The main loop of HashMap.getNode() looks like this:

do {
    if (e.hash == hash &&
        ((k = e.key) == key || (key != null && key.equals(k))))
        return e;
} while ((e = e.next) != null);

The type detection and call de-virtualisation and inlining wouldn’t have happened if the equals() call looked the other way around:

((k = e.key) == key || && k.equals(key)))

because we can’t make any assumptions about the type of the object that’s just been extracted from a data structure. The missing null check is hardly a compensation for that. This is something to keep in mind when writing code with virtual calls.

Some exotic methods, such as TreeNode.getTreeNode(), treeifyBin(), resize() were not inlined – probably, they were considered cold code.

Similar work is done for counts.put() – all relevant methods are inlined.

By the way, a rather long piece of code is generated for c+1, because both c and the result are Integer, not int:

0x00007f9d85334105: mov    r10d,DWORD PTR [r10+0xc]
0x00007f9d85334109: inc    r10d               ;*invokestatic valueOf
                                              ; - Hash_LongPoint::inc@43 (line 49)

0x00007f9d8533410c: cmp    r10d,0xffffffffffffff80
0x00007f9d85334110: jl     0x00007f9d8533481d  ;*if_icmplt
                                              ; - java.lang.Integer::valueOf@3 (line 830)
                                              ; - Hash_LongPoint::inc@43 (line 49)

0x00007f9d85334116: cmp    r10d,0x7f
0x00007f9d8533411a: jg     0x00007f9d853346a9  ;*if_icmpgt
                                              ; - java.lang.Integer::valueOf@10 (line 830)
                                              ; - Hash_LongPoint::inc@43 (line 49)

0x00007f9d85334120: mov    r8d,r10d
0x00007f9d85334123: add    r8d,0x80           ;*iadd
                                              ; - java.lang.Integer::valueOf@20 (line 831)
                                              ; - Hash_LongPoint::inc@43 (line 49)

0x00007f9d8533412a: cmp    r8d,0x100
0x00007f9d85334131: jae    0x00007f9d853347c5
0x00007f9d85334137: movsxd r10,r10d
0x00007f9d8533413a: mov    r11,0x67092cd58    ;   {oop(a 'java/lang/Integer'[256] )}
0x00007f9d85334144: mov    r10d,DWORD PTR [r11+r10*4+0x210]

The first instruction in this listing is unboxing (a field is read from the Integer object), the rest is boxing (or, rather, the main path of boxing). The reason the code is so long is that boxing isn’t implemented as just new Integer(i). Instead, Integer.valueOf(i) is called, which is optimised by using a cache:

public static Integer valueOf(int i) {
    if (i >= IntegerCache.low && i <= IntegerCache.high)
        return IntegerCache.cache[i + (-IntegerCache.low)];
    return new Integer(i);
}

Usually IntegerCache.low is −128 and IntegerCache.high is 127 (there is a way to change the high value, but not the low one). This is where comparisons with 0x7f and 0x80 come from. The next four lines perform the array index check:

0x00007f9d85334120: mov    r8d,r10d
0x00007f9d85334123: add    r8d,0x80           ;*iadd
                                              ; - java.lang.Integer::valueOf@20 (line 831)
                                              ; - Hash_LongPoint::inc@43 (line 49)

0x00007f9d8533412a: cmp    r8d,0x100
0x00007f9d85334131: jae    0x00007f9d853347c5

Note how clever the compiler is to check both array boundaries using just one comparison – the trick is to use the unsigned branch instruction jae, which fires for both negative and big positive values. Unfortunately, the compiler isn’t clever enough to figure out that this index checking isn’t necessary at all – two previous checks have already done the job. Even adding 0x80 is unnecessary, since this operation has been incorporated into the addressing mode when accessing the array ([r11+r10*4+0x210]).

The Hash_LongPoint.dec() looks very similar to inc() in Java:

private void dec (long w)
{
    LongPoint key = factory.create (w);
    int c = counts.get (key)-1;
    if (c != 0) {
        counts.put (key, c);
    } else {
        counts.remove (key);
    }
}

and is compiled in a similar way, too. The assembly contains three occurrences of multiplication by 0x2449f0232c624b0b – one for each HashMap operation. All important hash map operations have been inlined (speculatively). Both LongPoint6.hashCode() and LongPoint.equals() have been fully inlined. Again, boxing required some extra code.

The Hash_LongPoint.set() and Hash_LongPoint.reset() also demonstrate some degree of inlining. This is what they look like in Java:

void set (LongPoint k)
{
    long w = k.v;
    inc (w-DX-DY);
    inc (w-DX);
    inc (w-DX+DY);
    inc (w-DY);
    inc (w+DY);
    inc (w+DX-DY);
    inc (w+DX);
    inc (w+DX+DY);
    field.add (k);
}
    
void reset (LongPoint k)
{
    long w = k.v;
    dec (w-DX-DY);
    dec (w-DX);
    dec (w-DX+DY);
    dec (w-DY);
    dec (w+DY);
    dec (w+DX-DY);
    dec (w+DX);
    dec (w+DX+DY);
    field.remove (k);
}

They look very similar to each other in assembly, too (see here and here). Eight calls to inc() and dec() have not been inlined (however, the calls are made direct, not virtual), while the trailing hash map operation is inlined and produces one multiplication by 0x2449f0232c624b0b.

Finally, there is Hash_LongPoint.step():

@Override
public void step ()
{
    ArrayList<LongPoint> toReset = new ArrayList<LongPoint> ();
    ArrayList<LongPoint> toSet = new ArrayList<LongPoint> ();
    for (LongPoint w : field) {
        Integer c = counts.get (w);
        if (c == null || c < 2 || c > 3) toReset.add (w);
    }
    for (LongPoint w : counts.keySet ()) {
        if (counts.get (w) == 3 && ! field.contains (w)) toSet.add (w);
    }
    for (LongPoint w : toSet) {
        set (w);
    }
    for (LongPoint w : toReset) {
        reset (w);
    }
}

The assembly output for this method is very long (more than 5000 assembly lines). It contains three inlined hash map operations, each with inlined and optimised hashCode(). It also contains one non-virtual call to set() and eight calls to dec(), also non-virtual. This means that reset() was inlined while set() was not.

The code also contains a lot of artefacts of ArrayList, KeySet, HashMap.HashIterator, ArrayList.Itr and other classes. While ArrayList operations, such as array resizing, are expected (I can’t think of an optimisation that would manage to optimise them out), it should be possible to remove iterators, at least in simple cases, such as iterating ArrayLists. The compiler doesn’t do it – the iterator objects are there, and all required virtual calls are made. Perhaps, we should consider replacing those ArrayLists with plain arrays, although the best solution would be to get rid of them altogether. Some better way to iterate over our sets might also help, but this is a topic for future consideration. For now, it’s enough to know that Java 8 is in fact perfectly capable of optimising division, and all our previous attempts to do it manually were unnecessary.

Conclusions

HotSpot demonstrates impressive optimisation capabilities. It is capable of deep inlining, which allows de-virtualisation and inlining of callback methods.
It still misses some little optimisation opportunities here and there.
Contrary to our previous belief, Java 8 is capable of replacing division by a constant with multiplication by a reciprocal.
This optimisation, however, isn’t performed always; it requires some additional conditions, such as very hot code, massive inlining, or something else of this kind.
Some other optimisations may follow this pattern.
In our case this optimisation pattern caused different behaviour of the optimiser in the case of a micro-benchmark and in the case of a real program – surprisingly, in favour of the real program. We are so used to a cheating behaviour when benchmarks perform much better than real products, that it is very pleasant to discover an example of the opposite.
Still, it means that micro-benchmarks are misleading and must be used with caution, if at all. Nothing replaces a full program performance testing.
Don’t optimise what’s already optimised.

Coming soon

Another candidate for optimisation is the hash function based on CRC32. We’ll see if we can avoid the same pitfall.

Experiments in program optimisation

Don't trust a micro-benchmark

Analysing the disassembly

Conclusions

Coming soon