Experiments in program optimisation

Don't trust a micro-benchmark

Story: Conway's Game of Life

Tags: Java optimisation hashmap life

28 Oct 2015

In “Game of Life, hash tables and hash codes” we implemented Conway’s game of Life using Java’s HashMap. We discovered that performance depends on a choice of a hash function. In “Hash functions: disassembling and measuring performance” we developed a method to measure the speed of the hash functions in separation from the rest of the code (as a micro-benchmark). We found that the function based on division (or, more accurately, calculation of a remainder) worked much faster on Java 7 than on Java 8. In “Optimising division-based hash functions” we found why. Java 7 optimised a constant division by replacing it with a multiplication by a reciprocal, while Java 8 didn’t. In pseudo-code this replacement looked like this. The original hash function:

public int hashCode ()
{
    return (int) (v % 946840871);
}

The optimised hash function:

public int hashCode ()
{
    long sign = v >> 63;
    long div = ((v ** 2614885092524444427L) >> 91) - sign;
    return (int) (v - div * 946840871);
}

Java does not support the operation that we indicated with **, which multiplies two 64-bit numbers and produces a 128-bit result. We’ve spent enormous effort to implement this multiplication in Java and optimise it. Two versions were developed in “Optimising division-based hash functions”, two more in “More optimisations of division-based hash: Karatsuba’s method”. Here is the entire code. The results looked good on a micro-benchmark:

Class name Comment Time, Java 7Time, Java 8
LongPoint6 Original version 578 1655
LongPoint60 Optimised, unsigned 640 610
LongPoint61 Optimised, signed 729 727
LongPoint62 Optimised and specialised 595 547
LongPoint63 Karatsuba's multiplication 622 565

However, the real test still showed superiority of the original version:

Class name Comment Time, Java 7Time, Java 8
LongPoint6 Original version 2032 1650
LongPoint60 Optimised, unsigned 2247 1877
LongPoint61 Optimised, signed 2368 1885
LongPoint62 Optimised and specialised 2191 1726
LongPoint63 Karatsuba's multiplication 2239 1776

This was a real mystery: why the function performs so badly on a micro-benchmark, and can be improved a lot, while on a real test it performs well and all the improvements fail?

Analysing the disassembly

The only way to answer this question is to run the full test on Java 8 and dump the disassembly output of the code generated by HotSpot. When we look at the list of actions taken on our classes, we see quite an activity:

CompilerOracle: print *Hash_LongPoint.inc
 102   27       3       LongPoint6::hashCode (10 bytes)
 104   31       3       LongPoint::equals (21 bytes)
 105   36       3       LongPoint6$1::create (6 bytes)
 105   37       3       LongPoint6$1::create (10 bytes)
 105   38       3       LongPoint6::<init> (6 bytes)
 105   39       3       LongPoint6::<init> (6 bytes)
 106   40       3       LongPoint::<init> (10 bytes)
 110   47       3       Hash_LongPoint::inc (51 bytes)
 129   49       1       LongPoint6::hashCode (10 bytes)
 129   27       3       LongPoint6::hashCode (10 bytes)   made not entrant
 131   54       3       Hash_LongPoint::dec (59 bytes)
 134   68       3       LongPoint::toPoint (8 bytes)
 137   82       4       LongPoint::equals (21 bytes)
 139   84       3       Hash_LongPoint::set (91 bytes)
 139   31       3       LongPoint::equals (21 bytes)   made not entrant
 140   86       3       Hash_LongPoint::reset (91 bytes)
 142   88       4       LongPoint6$1::create (6 bytes)
 143   36       3       LongPoint6$1::create (6 bytes)   made not entrant
 147   96       4       Hash_LongPoint::inc (51 bytes)
 148  101       4       Hash_LongPoint::dec (59 bytes)
 166   54       3       Hash_LongPoint::dec (59 bytes)   made not entrant
 197   47       3       Hash_LongPoint::inc (51 bytes)   made not entrant
 206  121       4       LongPoint::toPoint (8 bytes)
 207   68       3       LongPoint::toPoint (8 bytes)   made not entrant
 216  126       3       Hash_LongPoint::step (234 bytes)
 225  128       4       Hash_LongPoint::set (91 bytes)
 225  129       4       Hash_LongPoint::reset (91 bytes)
 226  130 %     4       Hash_LongPoint::step @ 104 (234 bytes)
 228   86       3       Hash_LongPoint::reset (91 bytes)   made not entrant
 236   84       3       Hash_LongPoint::set (91 bytes)   made not entrant
 280  132       4       Hash_LongPoint::step (234 bytes)
 360  126       3       Hash_LongPoint::step (234 bytes)   made not entrant
 691   96       4       Hash_LongPoint::inc (51 bytes)   made not entrant
 691  135       3       Hash_LongPoint::inc (51 bytes)
 703  101       4       Hash_LongPoint::dec (59 bytes)   made not entrant
 704  139       3       Hash_LongPoint::dec (59 bytes)
 704  132       4       Hash_LongPoint::step (234 bytes)   made not entrant
 705  130 %     4       Hash_LongPoint::step @ -2 (234 bytes)  made not entrant
 705  140       3       Hash_LongPoint::step (234 bytes)
 710  145       4       Hash_LongPoint::inc (51 bytes)
 711  146       4       Hash_LongPoint::dec (59 bytes)
 719  148 %     4       Hash_LongPoint::step @ 104 (234 bytes)
 759  135       3       Hash_LongPoint::inc (51 bytes)   made not entrant
 761  139       3       Hash_LongPoint::dec (59 bytes)   made not entrant
1131  161       4       Hash_LongPoint::step (234 bytes)
1237  140       3       Hash_LongPoint::step (234 bytes)   made not entrant

Methods are compiled, then discarded, then compiled again, and all the methods of interest were compiled at least once. We’ll look at the last compilation of each method.

Let’s start with LongPoint6.hashCode(). The main part of the code looks familiar – we saw it when first analysing division-based hash in Java 8:

0x00007f49692c150c: mov    rdi,QWORD PTR [rsi+0x10]  ;*getfield v
                                              ; - LongPoint6::hashCode@1 (line 21)
0x00007f49692c1510: mov    rsi,rdi
0x00007f49692c1513: mov    rdi,0x386fa527
0x00007f49692c151d: mov    rbx,rdi
0x00007f49692c1520: mov    rdi,rbx
0x00007f49692c1523: cmp    rbx,0x0
0x00007f49692c1527: je     0x00007f49692c1540
0x00007f49692c152d: call   0x00007f497e8e9960  ;*lrem
                                              ; - LongPoint6::hashCode@7 (line 21)
                                              ;   {runtime_call}

The remainder is calculated by calling a runtime routine, and the divisor is first compared with zero, which is completely unnecessary, as it is a freshly loaded known constant. However, this method isn’t really called – or, rather, it stops being called as soon as methods such as Hash_LongPoint.inc() get compiled.

Let’s look at this method. Here is the Java code:

private void inc (long w)
{
    LongPoint key = factory.create (w);
    Integer c = counts.get (key);
    counts.put (key, c == null ? 1 : c+1);
}

The assembly code is quite long (1250 lines, 3487 bytes), and it contains a piece that also looks familiar:

0x00007f9d85334035: mov    r11,r9
0x00007f9d85334038: sar    r11,0x3f
0x00007f9d8533403c: mov    rax,0x2449f0232c624b0b
0x00007f9d85334046: imul   r9
0x00007f9d85334049: sar    rdx,0x1b
0x00007f9d8533404d: sub    rdx,r11
0x00007f9d85334050: imul   r11,rdx,0x386fa527
0x00007f9d85334057: mov    r8,r9
0x00007f9d8533405a: sub    r8,r11

This is the optimised version of calculating a remainder, which we until now only saw in the code produced by Java 7. The conclusion we made previously that this optimisation had been removed from Java 8 turned out to be incorrect. Java 8 is capable of this optimisation; it just does not perform it in the micro-benchmark test. This gives us the answer to our mystery, but let’s carry on looking at the code.

The code shown above is present twice in inc(), which corresponds to two HashMap operations in this method (get() and put()). It’s a pity the compiler hasn’t figured out that both operate on the same object (key), and the first result could have been used in the second case as well.

Apart from that, the code of this method is quite impressive. The compiler have done a lot of work to produce this code. It performed speculative inlining of factory.create(), HashMap.get() and HashMap.put() (speculative inlining means that the compiler, before performing true inlining, emits the code that checks if the object is of the expected class and falls back to the interpreting mode if not). Then it inlined all relevant methods from HashMap: get(), put(), hash(), getNode() and others. This in turn allowed inlining hashCode() and equals(), this time in absolute, not speculative, way, because the compiler knows the exact type of a key object: it is LongPoint6. That’s why the entire remainder-calculating sequence shown above is placed there. That’s also why there is no virtual call to equals() – this operation is fully inlined and replaced by a comparison of two numbers. The designers of HashMap helped there, too. The main loop of HashMap.getNode() looks like this:

do {
    if (e.hash == hash &&
        ((k = e.key) == key || (key != null && key.equals(k))))
        return e;
} while ((e = e.next) != null);

The type detection and call de-virtualisation and inlining wouldn’t have happened if the equals() call looked the other way around:

((k = e.key) == key || && k.equals(key)))

because we can’t make any assumptions about the type of the object that’s just been extracted from a data structure. The missing null check is hardly a compensation for that. This is something to keep in mind when writing code with virtual calls.

Some exotic methods, such as TreeNode.getTreeNode(), treeifyBin(), resize() were not inlined – probably, they were considered cold code.

Similar work is done for counts.put() – all relevant methods are inlined.

By the way, a rather long piece of code is generated for c+1, because both c and the result are Integer, not int:

0x00007f9d85334105: mov    r10d,DWORD PTR [r10+0xc]
0x00007f9d85334109: inc    r10d               ;*invokestatic valueOf
                                              ; - Hash_LongPoint::inc@43 (line 49)

0x00007f9d8533410c: cmp    r10d,0xffffffffffffff80
0x00007f9d85334110: jl     0x00007f9d8533481d  ;*if_icmplt
                                              ; - java.lang.Integer::valueOf@3 (line 830)
                                              ; - Hash_LongPoint::inc@43 (line 49)

0x00007f9d85334116: cmp    r10d,0x7f
0x00007f9d8533411a: jg     0x00007f9d853346a9  ;*if_icmpgt
                                              ; - java.lang.Integer::valueOf@10 (line 830)
                                              ; - Hash_LongPoint::inc@43 (line 49)

0x00007f9d85334120: mov    r8d,r10d
0x00007f9d85334123: add    r8d,0x80           ;*iadd
                                              ; - java.lang.Integer::valueOf@20 (line 831)
                                              ; - Hash_LongPoint::inc@43 (line 49)

0x00007f9d8533412a: cmp    r8d,0x100
0x00007f9d85334131: jae    0x00007f9d853347c5
0x00007f9d85334137: movsxd r10,r10d
0x00007f9d8533413a: mov    r11,0x67092cd58    ;   {oop(a 'java/lang/Integer'[256] )}
0x00007f9d85334144: mov    r10d,DWORD PTR [r11+r10*4+0x210]

The first instruction in this listing is unboxing (a field is read from the Integer object), the rest is boxing (or, rather, the main path of boxing). The reason the code is so long is that boxing isn’t implemented as just new Integer(i). Instead, Integer.valueOf(i) is called, which is optimised by using a cache:

public static Integer valueOf(int i) {
    if (i >= IntegerCache.low && i <= IntegerCache.high)
        return IntegerCache.cache[i + (-IntegerCache.low)];
    return new Integer(i);
}

Usually IntegerCache.low is −128 and IntegerCache.high is 127 (there is a way to change the high value, but not the low one). This is where comparisons with 0x7f and 0x80 come from. The next four lines perform the array index check:

0x00007f9d85334120: mov    r8d,r10d
0x00007f9d85334123: add    r8d,0x80           ;*iadd
                                              ; - java.lang.Integer::valueOf@20 (line 831)
                                              ; - Hash_LongPoint::inc@43 (line 49)

0x00007f9d8533412a: cmp    r8d,0x100
0x00007f9d85334131: jae    0x00007f9d853347c5

Note how clever the compiler is to check both array boundaries using just one comparison – the trick is to use the unsigned branch instruction jae, which fires for both negative and big positive values. Unfortunately, the compiler isn’t clever enough to figure out that this index checking isn’t necessary at all – two previous checks have already done the job. Even adding 0x80 is unnecessary, since this operation has been incorporated into the addressing mode when accessing the array ([r11+r10*4+0x210]).

The Hash_LongPoint.dec() looks very similar to inc() in Java:

private void dec (long w)
{
    LongPoint key = factory.create (w);
    int c = counts.get (key)-1;
    if (c != 0) {
        counts.put (key, c);
    } else {
        counts.remove (key);
    }
}

and is compiled in a similar way, too. The assembly contains three occurrences of multiplication by 0x2449f0232c624b0b – one for each HashMap operation. All important hash map operations have been inlined (speculatively). Both LongPoint6.hashCode() and LongPoint.equals() have been fully inlined. Again, boxing required some extra code.

The Hash_LongPoint.set() and Hash_LongPoint.reset() also demonstrate some degree of inlining. This is what they look like in Java:

void set (LongPoint k)
{
    long w = k.v;
    inc (w-DX-DY);
    inc (w-DX);
    inc (w-DX+DY);
    inc (w-DY);
    inc (w+DY);
    inc (w+DX-DY);
    inc (w+DX);
    inc (w+DX+DY);
    field.add (k);
}
    
void reset (LongPoint k)
{
    long w = k.v;
    dec (w-DX-DY);
    dec (w-DX);
    dec (w-DX+DY);
    dec (w-DY);
    dec (w+DY);
    dec (w+DX-DY);
    dec (w+DX);
    dec (w+DX+DY);
    field.remove (k);
}

They look very similar to each other in assembly, too (see here and here). Eight calls to inc() and dec() have not been inlined (however, the calls are made direct, not virtual), while the trailing hash map operation is inlined and produces one multiplication by 0x2449f0232c624b0b.

Finally, there is Hash_LongPoint.step():

@Override
public void step ()
{
    ArrayList<LongPoint> toReset = new ArrayList<LongPoint> ();
    ArrayList<LongPoint> toSet = new ArrayList<LongPoint> ();
    for (LongPoint w : field) {
        Integer c = counts.get (w);
        if (c == null || c < 2 || c > 3) toReset.add (w);
    }
    for (LongPoint w : counts.keySet ()) {
        if (counts.get (w) == 3 && ! field.contains (w)) toSet.add (w);
    }
    for (LongPoint w : toSet) {
        set (w);
    }
    for (LongPoint w : toReset) {
        reset (w);
    }
}

The assembly output for this method is very long (more than 5000 assembly lines). It contains three inlined hash map operations, each with inlined and optimised hashCode(). It also contains one non-virtual call to set() and eight calls to dec(), also non-virtual. This means that reset() was inlined while set() was not.

The code also contains a lot of artefacts of ArrayList, KeySet, HashMap.HashIterator, ArrayList.Itr and other classes. While ArrayList operations, such as array resizing, are expected (I can’t think of an optimisation that would manage to optimise them out), it should be possible to remove iterators, at least in simple cases, such as iterating ArrayLists. The compiler doesn’t do it – the iterator objects are there, and all required virtual calls are made. Perhaps, we should consider replacing those ArrayLists with plain arrays, although the best solution would be to get rid of them altogether. Some better way to iterate over our sets might also help, but this is a topic for future consideration. For now, it’s enough to know that Java 8 is in fact perfectly capable of optimising division, and all our previous attempts to do it manually were unnecessary.

Conclusions

Coming soon

Another candidate for optimisation is the hash function based on CRC32. We’ll see if we can avoid the same pitfall.

comments powered by Disqus