Hash functions: disassembling and measuring performance

In the previous article we’ve implemented Conway’s game of Life in Java using HashMap as our main building block. We’ve discovered that the performance depends on the choice of a hash function; some of them work much better than the others. The reason is that the keys are not truly random and follow some patterns, which cause hash code collisions. Good functions, even in the presence of patterns, spread values evenly over hash table slots and reduce chain sizes. Shorter lists must be traversed and fewer calls to equals() must be made.

However, hash functions also cost CPU cycles, and this cost may kill any benefit from better hashing. We’ve looked at several hash functions and compared them regarding the quality of hashing. Now it’s time to measure their own speed as well and check if it can be improved. For now, I’ll concentrate on the first task (measuring), for it turned to be non-trivial and full of surprises. We’ll look at improvements in the next article.

Although the article is called “measuring hash functions”, most of our findings don’t depend on the nature of the functions, and are generally applicable to any Java code. I should have investigated this topic before trying any optimisations of Java programs, so this should have been the article number one. But when I started, I didn’t posess a skill, which I’m going to practice today: to print the code generated by HotSpot. This turned out to be an exciting adventure. Enjoy the ride!

The classes to test

We want to compare performance of hashCode() methods of several classes. These classes are described in the previous article; their code can be found in the repository. They are used to store two integer values (x and y). One class (Point) stores them individually as two separate variables; the others pack them into a long, high 32 bits being used for x and low ones for y. Here are the classes and their hashCode() methods:

Point

public final int hashCode ()
{
    return x * 3 + y * 5;
}

java.lang.Long

public int hashCode() {
    return (int)(value ^ (value >>> 32));
}

LongPoint

public int hashCode ()
{
    return hi(v) * 3 + lo(v) * 5;
}

(with the obvious definitions of hi() and lo())

LongPoint3

public int hashCode ()
{
    return hi(v) * 11 + lo(v) * 17;
}

LongPoint4

public int hashCode ()
{
    return hi(v) * 1735499 + lo(v) * 7436369;
}

LongPoint5

public int hashCode ()
{
    long x = v * 541725397157L;
    return lo(x) ^ hi(x);
}

LongPoint6

public int hashCode ()
{
    return (int) (v % 946840871);
}

LongPoint7

public int hashCode ()
{
    java.util.zip.CRC32 crc = new java.util.zip.CRC32 ();
    crc.update ((int)(v>>> 0) & 0xFF);
    crc.update ((int)(v >>> 8) & 0xFF);
    crc.update ((int)(v >>> 16) & 0xFF);
    crc.update ((int)(v >>> 24) & 0xFF);
    crc.update ((int)(v >>> 32) & 0xFF);
    crc.update ((int)(v >>> 40) & 0xFF);
    crc.update ((int)(v >>> 48) & 0xFF);
    crc.update ((int)(v >>> 56) & 0xFF);
    return (int) crc.getValue ();
}

Let’s add another one, NullPoint:

public int hashCode ()
{
    return 0;
}

The first test

It seems easy to compare the performance of several methods. We’ll just run each method some big number of times and measure the time (the code is in the repository):

public class HashTime
{
    static void iterate (int N, Object x)
    {
        for (int i = 0; i < N; i++)
            x.hashCode ();
    }
 
    static void test (Object x)
    {
        System.out.printf ("%20s: ", x.getClass ().getName ());
        int N = 100000000;
        
        for (int iter = 0; iter < 3; iter ++) {
            long t1 = System.currentTimeMillis ();
            iterate (N, x);
            long t2 = System.currentTimeMillis ();
            long t = t2-t1;
            System.out.printf (" %5d", t);
        }
        System.out.println ();
    }

    public static void main (String argv[])
    {
        int x = 531;
        int y = -295;

        test (new Point (x, y));
        test (LongUtil.fromPoint (x, y));
        test (new LongPoint (x, y));
        test (new LongPoint3 (x, y));
        test (new LongPoint4 (x, y));
        test (new LongPoint5 (x, y));
        test (new LongPoint6 (x, y));
        test (new LongPoint7 (x, y));
    }
}

The choice of the constants for x and y is completely arbitrary and has no special meaning. This is what we get when running this test on Java 7 (1.7.0.40):

           Point:    472   426   455
  java.lang.Long:    427   390   365
       LongPoint:    430   430   429
      LongPoint3:    431   428   428
      LongPoint4:    400   398   396
      LongPoint5:    423   395   395
      LongPoint6:    426   426   425
      LongPoint7:   9823  9814  9814
       NullPoint:    335   335   334

The results look a bit suspicious. Can Point really be slower than LongPoint5 and LongPoint6? This suspicion grows as we look at the results on Java 8 (I’m using 1.8.0.45):

           Point:      5     3     0
  java.lang.Long:     93   175   141
       LongPoint:    458   457   456
      LongPoint3:    437   426   425
      LongPoint4:    457   395   396
      LongPoint5:    457   423   395
      LongPoint6:   1644  1646  1645
      LongPoint7:   1791  1790  1791
       NullPoint:    377   365   365

Now Point is very fast, faster even than NullPoint. Could it be because it was the first in the list? Let’s try swapping it with Long:

  java.lang.Long:      6     2     0
           Point:    187   160   143

We see that the extraordinary speed is indeed caused by the first position in the list.

What happens if we repeat the test twice (introduce a loop into main())? We get the following for Java 7:

           Point:    463   426   456
  java.lang.Long:    427   392   365
       LongPoint:    429   426   426
      LongPoint3:    428   426   426
      LongPoint4:    398   396   397
      LongPoint5:    396   395   396
      LongPoint6:    427   426   425
      LongPoint7:   9824  9816  9816
       NullPoint:    336   334   334
           Point:    425   426   425
  java.lang.Long:    364   364   365
       LongPoint:    426   426   426
      LongPoint3:    426   426   426
      LongPoint4:    396   397   396
      LongPoint5:    395   395   396
      LongPoint6:    425   426   425
      LongPoint7:   9816  9816  9816
       NullPoint:    334   334   334

The only thing that has changed is Point. It is now more consistent with other results but still surprisingly slower than LongPoint4 and LongPoint5.

Java 8 shows more variety:

           Point:      6     3     0
  java.lang.Long:    188   158   143
       LongPoint:    458   426   426
      LongPoint3:    442   426   425
      LongPoint4:    457   445   395
      LongPoint5:    457   412   395
      LongPoint6:   1644  1645  1644
      LongPoint7:   1873  1870  1871
       NullPoint:    426   414   365
           Point:    395   395   396
  java.lang.Long:    395   395   395
       LongPoint:    429   429   429
      LongPoint3:    396   396   397
      LongPoint4:    396   395   396
      LongPoint5:    395   396   395
      LongPoint6:   1627  1627  1628
      LongPoint7:   1786  1786  1784
       NullPoint:    364   365   364

The first two methods are incredibly fast, but they become normal the second time we run them. To find out why this happens, we’ll have to look at the generated code.

The generated code: Java 7

Java virtual machine has a command line option that causes it to print the names of the methods being compiled: -XX:+PrintCompilation. I won’t show the outputs here, for they are long. When using Java 7, relatively few compilations happen. All the hashCode() methods are compiled when used. The iterate() method is compiled twice in the beginning and never recompiled afterwards. Java 8 behaves differently. It compiles iterate() ten times – five times while running with Point class, twice with Long, twice for LongPoint, and once for LongPoint4.

To look at the code, we must install a disassembler plugin (hsdis-amd64.so, which can be downloaded from Kenai Project). We want to see the code for HashTest.iterate(), so the command line must look like this:

# java -XX:+UnlockDiagnosticVMOptions -XX:+PrintCompilation  -XX:PrintAssemblyOptions=intel
-XX:CompileCommand=print,*HashTime.iterate HashTime

Note the -XX:PrintAssemblyOptions=intel part: I prefer traditional Intel notation of assembly listing to that of the GNU assembler. Luckily, the disassembler has such option.

Java 7 compiles this method twice, but the results only differ in entry and exit sequences and in the registers used, so we’ll only look at one of them – the second one.

Java disassembly outputs are long and difficult to read; that’s why I’ll skip most of them. This one, however, is interesting because it gives unforeseen insight into the default implementation of hashCode() method; still, I’m only showing the main loop here. I’ve also replaced jump targets with labels:

VIRTUAL_CALL:
0x00007f592cc39257: mov    QWORD PTR [rsp+0x8],r13
0x00007f592cc3925c: mov    DWORD PTR [rsp],ebx
0x00007f592cc3925f: mov    rsi,r13
0x00007f592cc39262: xchg   ax,ax
0x00007f592cc39265: mov    rax,0xffffffffffffffff  ;   {oop(NULL)}
0x00007f592cc3926f: call   0x00007f592cc0fd60  ; OopMap{[8]=Oop off=84}
                                              ;*invokevirtual hashCode
                                              ; - HashTime::iterate@8 (line 8)
                                              ;   {virtual_call}
0x00007f592cc39274: mov    ebx,DWORD PTR [rsp]
0x00007f592cc39277: mov    r13,QWORD PTR [rsp+0x8]
0x00007f592cc3927c: nop    DWORD PTR [rax+0x0]  ;*invokevirtual hashCode
                                              ; - HashTime::iterate@8 (line 8)
MAIN_LOOP:
0x00007f592cc39280: inc    ebp                ; OopMap{r13=Oop off=98}
                                              ;*goto
                                              ; - HashTime::iterate@15 (line 7)
0x00007f592cc39282: test   DWORD PTR [rip+0x8b13d78],eax        # 0x00007f593574d000
                                              ;*iload_3
                                              ; - HashTime::iterate@2 (line 7)
                                              ;   {poll}
0x00007f592cc39288: cmp    ebp,ebx
0x00007f592cc3928a: jge    END_OF_LOOP        ;*if_icmpge
                                              ; - HashTime::iterate@4 (line 7)
L1:
0x00007f592cc3928c: mov    r11d,DWORD PTR [r13+0x8]  ; implicit exception: dispatches to 0x00007f592cc392d6
0x00007f592cc39290: mov    r10,QWORD PTR [r12+r11*8+0x208]
0x00007f592cc39298: mov    r11,0x40c802248    ;   {oop({method} 'hashCode' '()I' in 'java/lang/Object')}
0x00007f592cc392a2: cmp    r10,r11
0x00007f592cc392a5: jne    VIRTUAL_CALL
L2:
0x00007f592cc392a7: mov    r10,QWORD PTR [r13+0x0]
0x00007f592cc392ab: mov    r11,r10
0x00007f592cc392ae: and    r11,0x7
0x00007f592cc392b2: cmp    r11,0x1
0x00007f592cc392b6: jne    VIRTUAL_CALL
0x00007f592cc392b8: shr    r10,0x8
0x00007f592cc392bc: mov    r10d,r10d
0x00007f592cc392bf: test   r10d,0x7fffffff
L3:
0x00007f592cc392c6: jne    MAIN_LOOP
0x00007f592cc392c8: jmp    VIRTUAL_CALL
END_OF_LOOP:

Some comments:

The ebp is the loop variable (i), and ebx is the loop boundary (N)
The code between L1 and L2 checks if the method we are going to call is Object.hashCode. If it is, after some more checks the method is inlined and optimised, and the result of the inlining happens to be empty (the code at L3: jne MAIN_LOOP). Otherwise, a virtual call is performed (at VIRTUAL_CALL). This way of implementing virtual calls by speculative specialisation is very well described in Alexey Shipilev’s “The Black Magic of (Java) Method Dispatch”. In this specific case this optimisation happened to be harmful: it specialised the code for the case that never happened – none of our classes employs hashCode() method directly inherited from Object.
The code between L2 and L3 checks for some additional conditions on the first quadword of the object, which is called a mark word. In particular, it checks if the three least significant bits equal 001 and if the bits 8 to 38 contain non-zero value. In pseudo-code this can be represented like this:

static void iterate (int N, Object x)
{
    for (int i = 0; i < N; i++) {
        long markWord;
        if (x.virtualTable.hashCode == Object.hashCode
            && ((markWord = x.markWord) & 7) == 1
            && ((markWord >> 8) & 0x7FFFFFFF) != 0
           )
            ; // do nothing
        else
            vm.callVirtual (x, hashCode);
    }
}

These checks are not part of a general-purpose method dispatch; for instance, they are not present in the examples in Shipilev’s article. They are part of Object.hashCode() inlining. This method is declared as native, but obviously the HotSpot has a special treatment for it (methods that enjoy such a treatment are called intrinsics). Here is what is actually happening here. The last three bits (x.markWord & 7) contain the lock status of the object, the value 1 meaning “not locked”. The reason these bits are checked is that they define what’s stored in the higher bits of the mark word. When the object is locked, these bits keep the reference to the thread that holds the lock; otherwise, they keep the hash code of the object. This hash code is based on the object’s address; however, the objects can move in memory when the heap is compacted and must retain their hash codes. That’s why the hash code field is initialised to zero and filled in the first time hashCode() is called. This requires thread synchronisation, and is in general non-trivial – that’s why it is considered too complex to be inlined and is implemented as a runtime call. In our case this is combined with the speculative optimisation – the case when the object is locked or the hashcode isn’t initialised is treated the same way as the case when the method isn’t Object.hashCode().

This pseudo-code suggests an optimisation: the if statement can be moved out of the loop. As easy as it looks, it is not completely trivial to implement: the call to vm.callVirtual won’t change x.virtualTable.hashCode, but it can change the markWord. Besides, the compilers are usually good at moving calculations out of the loop, but not conditions. The typical way to move a condition out of the loop involves duplication of the loop body with subsequent specialisation, and not every compiler writer likes this idea.

The main case (where the method isn’t Object.hashCode) is covered at VIRTUAL_CALL by a call to the runtime (call 0x00007f592cc0fd60). I’m not sure why the virtual call is implemented by a call to the runtime, rather than by the indirect call CPU instruction. My guess is that it may have something to do with statistics update, but I may be wrong. In any case, the CPU instruction would be faster – not only because there are fewer instructions to execute, but also because the processors have an indirect call cache and are capable to make predictions of indirect call targets based on call site addresses.
Why is the Object.hashCode taken as the default case? Probably, the Java 7 VM does not have good type information at this point. It could take the slow route (a virtual call) always; alternatively, it could make an arbitrary choice of a virtual call target and generate a check and a specialised version. If the second approach is used, the natural target is that of the declared type of the object variable. It is bad for our case, but probably beneficial as a general strategy.
The markWord temporary variable in our pseudo-code is important; it is introduced to indicate the fact that the Hotspot-generated code reads the mark word once and atomically, to make sure that different fields in that word are consistent with each other.

The generated code: Java 8

Why are the Java 8 results so different? Let’s look at the code.

The code samples produced by numerous compiler runs during the execution of the test with Point class differ a lot. These samples make very interesting reading, but to save space here we’ll only look at the last one of those. So, after five attempts to compile the HashTime.iterate() method, the Hotspot compiler generated the following code (this is the entire code of the method):

0x00007fbff92b7e00: mov    DWORD PTR [rsp-0x14000],eax
0x00007fbff92b7e07: push   rbp
0x00007fbff92b7e08: sub    rsp,0x20           ;*synchronization entry
                                              ; - HashTime::iterate@-1 (line 7)

0x00007fbff92b7e0c: test   edx,edx
0x00007fbff92b7e0e: jle    RETURN             ;*if_icmpge
                                              ; - HashTime::iterate@4 (line 7)

0x00007fbff92b7e10: mov    r10d,DWORD PTR [rcx+0x8]  ; implicit exception: dispatches to 0x00007fbff92b7e29
0x00007fbff92b7e14: cmp    r10d,0xf800c105    ;   {metadata('Point')}
0x00007fbff92b7e1b: jne    BAIL_OUT           ;*return
                                              ; - HashTime::iterate@18 (line 9)

RETURN:
0x00007fbff92b7e1d: add    rsp,0x20
0x00007fbff92b7e21: pop    rbp
0x00007fbff92b7e22: test   DWORD PTR [rip+0x16c151d8],eax        # 0x00007fc00fecd000
                                              ;   {poll_return}
0x00007fbff92b7e28: ret    
BAIL_OUT:
0x00007fbff92b7e29: mov    esi,0xffffff86
0x00007fbff92b7e2e: mov    ebp,edx
0x00007fbff92b7e30: mov    QWORD PTR [rsp],rcx
0x00007fbff92b7e34: xchg   ax,ax
0x00007fbff92b7e37: call   0x00007fbff90051a0  ; OopMap{[0]=Oop off=92}
                                              ;*aload_2
                                              ; - HashTime::iterate@7 (line 8)
                                              ;   {runtime_call}
0x00007fbff92b7e3c: call   0x00007fc00f5d6320  ;*aload_2
                                              ; - HashTime::iterate@7 (line 8)
                                              ;   {runtime_call}

This is much smaller than the code in Java 7. The checks of the mark word bits are gone. And, most importantly, the loop is gone (the only thing that’s left is the upper boundary check). I believe that the pseudo-code now looks like this:

static void iterate (int N, Object x)
{
    if (N > 0) {
        if (x.getClass() == Point.class)
            ; // do nothing
        else
            vm.returnToInterpretingMode();
    }
}

Some comments:

The main case isn’t the Object.hashCode() anymore; it is the Point.hashCode(). Either the Java 8 VM collects statistics better, or it trusts its own statistics more than Java 7 did.
The code doesn’t check if the method is the same as in Point; it checks if the class is Point. This check is cheaper, but less general: it does not cater for the case where the class extends Point without re-defining the hashCode(). Probably, the statistic engine has established that there were no such classes.
It looks as if the optimisation I mentioned earlier has been implemented in Java 8 – the entire loop has disappeared, not just the method call. No wonder the entire loop has finished in zero milliseconds.
The optimiser made use of the fact that we call hashCode() and ignore the result. It removed the call completely. If we want to measure the performance of methods, we should either use their results or make sure the methods aren’t inlined.

The code produced while executing the test with java.lang.Long is different. It is much longer, so I won’t quote it here (I’ve placed it in a repository). It is also very complex – it’s enough to mention that it contains 49 branch targets, with a completely psychedelic structure of jumps between them. What makes it so big is that it checks for two possible classes of x (Point and Long) and unrolls the loop 16 times instead of removing it. There is no way to write it down exactly as a pseudo-code, the closest I can come with is something like this:

static void iterate (int N, Object x)
{
    int i;
    for (i = 0; i < N-16; i+=16) {
        if (x.getClass() != Point.class && x.getClass() != Long.getClass()) {          vm.returnToInterpretingMode (); }
        if (x.getClass() != Point.class && x.getClass() != Long.getClass()) { i +=  1; vm.returnToInterpretingMode (); }
        if (x.getClass() != Point.class && x.getClass() != Long.getClass()) { i +=  2; vm.returnToInterpretingMode (); }
        if (x.getClass() != Point.class && x.getClass() != Long.getClass()) { i +=  3; vm.returnToInterpretingMode (); }
        if (x.getClass() != Point.class && x.getClass() != Long.getClass()) { i +=  4; vm.returnToInterpretingMode (); }
        if (x.getClass() != Point.class && x.getClass() != Long.getClass()) { i +=  5; vm.returnToInterpretingMode (); }
        if (x.getClass() != Point.class && x.getClass() != Long.getClass()) { i +=  6; vm.returnToInterpretingMode (); }
        if (x.getClass() != Point.class && x.getClass() != Long.getClass()) { i +=  7; vm.returnToInterpretingMode (); }
        if (x.getClass() != Point.class && x.getClass() != Long.getClass()) { i +=  8; vm.returnToInterpretingMode (); }
        if (x.getClass() != Point.class && x.getClass() != Long.getClass()) { i +=  9; vm.returnToInterpretingMode (); }
        if (x.getClass() != Point.class && x.getClass() != Long.getClass()) { i += 10; vm.returnToInterpretingMode (); }
        if (x.getClass() != Point.class && x.getClass() != Long.getClass()) { i += 11; vm.returnToInterpretingMode (); }
        if (x.getClass() != Point.class && x.getClass() != Long.getClass()) { i += 12; vm.returnToInterpretingMode (); }
        if (x.getClass() != Point.class && x.getClass() != Long.getClass()) { i += 13; vm.returnToInterpretingMode (); }
        if (x.getClass() != Point.class && x.getClass() != Long.getClass()) { i += 14; vm.returnToInterpretingMode (); }
        if (x.getClass() != Point.class && x.getClass() != Long.getClass()) { i += 15; vm.returnToInterpretingMode (); }
    }
    for (; i < N; i++) {
        if (x.getClass() != Point.class && x.getClass() != Long.getClass()) vm.returnToInterpretingMode ();
    }
}

This is why this code is slower than the one specialised for Point but still faster than the general-purpose code with a virtual call (140 ms instead of 400).

The code generated by subsequent runs of the compiler does not differ much from the code from Java 7, that’s why the results are similar.

The second test: using the result

One of the reasons why in some cases the entire method was optimised out is that the results of hashCode() call were not used. We can modify the code to use the results – for instance, to add them all together:

int sum (int N, Object x)
{
    int s = 0;
    for (int i = 0; i < N; i++)
        s += x.hashCode ();
    return s;
}

Obviously, the test () must be modified accordingly:

void test (Object x)
{
    System.out.printf ("%20s: ", x.getClass ().getName ());
    int N = 100000000;
    int s = 0;
        
    for (int iter = 0; iter < 3; iter ++) {
        long t1 = System.currentTimeMillis ();
        s += sum (N, x);
        long t2 = System.currentTimeMillis ();
        long t = t2-t1;
        System.out.printf (" %5d", t);
    }
    System.out.println ("; sum=" + s);
}

Note that we make sure we use the results of calls to sum() by adding them together and printing the result. This is a safety precaution against possible inlining of sum() or compiling it with the knowledge that its results aren’t ever used – some compilers are capable of this.

The new code is here.

This is what we get in Java 7:

           Point:    491   457   489; sum=1040261632
  java.lang.Long:    459   408   402; sum=-1786864128
       LongPoint:    491   489   488; sum=1040261632
      LongPoint3:    461   458   459; sum=-1308103168
      LongPoint4:    460   459   460; sum=1455681024
      LongPoint5:    432   430   431; sum=26862080
      LongPoint6:    468   465   465; sum=-2103278592
      LongPoint7:   9911  9905  9911; sum=1732665600
       NullPoint:    374   367   367; sum=0

And this is in Java 8:

           Point:     12     7     3; sum=1040261632
  java.lang.Long:    213   161   153; sum=-1786864128
       LongPoint:    462   458   457; sum=1040261632
      LongPoint3:    518   510   457; sum=-1308103168
      LongPoint4:    488   429   458; sum=1455681024
      LongPoint5:    441   428   427; sum=26862080
      LongPoint6:   1669  1650  1649; sum=-2103278592
      LongPoint7:   1892  1892  1893; sum=1732665600
       NullPoint:    405   374   374; sum=0

The results do not differ much from the previous ones. The unusually slow Point in Java 7 and unusually fast Point and Long in Java 8 are still there. The only thing that has changed is that LongPoint4 became slower.

The Java 7 code of sum isn’t much different from that of iterate. The identity hash code extracted from the mark word is now added to s (I’ll only show the pseudo-code):

static int sum (int N, Object x)
{
    int s = 0;
    int hash;
    long markword;
    for (int i = 0; i < N; i++) {
        if (x.virtualTable.hashCode == Object.hashCode
            && ((markWord = x.markWord) & 7) == 1
            && (hash = (markWord >> 8) & 0x7FFFFFFF) != 0
           )
            s += hash;
        else
            s += vm.callVirtual (x, x.hashCode);
    }
    return s;
}

The last Java 8 code produced while running with Point class is quite long and esoteric (it can be seen here). I’ll show my reconstruction of its pseudo-code:

static int sum (int N, Object x)
{
    int s = 0;
    int i = 0;
    int hash;
    
    if (N > i) {
        for (int limit = i + 1;  i < limit; i ++) {
            if (x.getClass() != Point.class) {
                vm.returnToInterpretingMode();
            }
            Point p = (Point) x;
            hash = p.y * 5 + p.x * 3;
            s += hash;
        }
    
        int N1 = N - 15;
        if (N1 > N) N1 = Integer.MIN_VALUE;
        
        if (i < N1) {
            int hash16 = hash * 16;
            do {
               s += hash16;
               i += 16;
            } while (i < N1);
        }
        if (i < N) {
            do {
                s += hash;
                ++ i;
            } while (i < N);
        }
    }
    return s;
}

We see that the optimisation mentioned earlier (moving an invariant condition out of a loop) has in fact happened. The loop has been specialised for the case where x is of type Point. All other cases cause bailout to interpreting mode in the loop preheader (a piece of code that is executed exactly once if the loop is executed at least once). The loop has been unrolled 16 times, It is interesting that the compiler figured out that the loop is adding an invariant value hash to s and replaced 16 such additions with one addition of hash * 16, but didn’t manage to replace the entire loop with multiplication by hash * N. The initialisation step (the first loop) also looks funny. It didn’t have to be compiled as a loop at all! However, as we saw, the method executes fast, so we can ignore these abnormalities.

Finally, we’ll look at the code generated by Java 8 when the test is run with Long class – as previously, the compiler allows for the Point class as well. The code is long, so I’ll only show my reconstruction (the full code is here). The inline_hash method is my own creation; in the HotSpot-generated code this method is inlined each time it is used – yes, the incoming value x is tested for being of tyoe Point or type Long six times, each time with the inlined code of the appropriate hashCode() method.

static int inline_hash (Object x)
{
    if (x.getClass() == Point.class) {
       return ((Point) x).y * 5 + ((Point) x).x * 3;
    }
    if (x.getClass() == Long.class) {
       long v = ((Long) x).value;
       return  (int) (v ^ (v >>> 32));
    }
    vm.returnToInterpretingMode();
}    
    
static int sum (int N, Object x)
{
    int s = 0;
    
    if (N > 0) {
        s = inline_hash (x);

        int N1 = N - 3;
        if (N1 > N) N1 = Integer.MIN_VALUE;
        
        int i = 1;
        if (i < N1) {
            do {
               s += inline_hash (x);
               s += inline_hash (x);
               s += inline_hash (x);
               s += inline_hash (x);
               i += 4;
            } while (i < N1);
        }
        if (i < N) {
            do {
                s += inline_hash (x);
                ++ i;
            } while (i < N);
        }
    }
    return s;
}

While this code looks horrible, it doesn’t perform too badly. Even a code like this is better than an indirect call or a switch to interpreting mode.

After this point, Java 8 produces the code similar to that of Java 7 – speculative inlining of Object.hashCode().

The workaround

As we see, the primary reason that the performance varies so much is that HotSpot collects type statistics and employs speculative specialisation, which is performed especially aggressively in Java 8. The virtual methods are resolved, their bodies inlined and moved out of loops – all these optimisations are great in general, but harmful for our benchmarking. There is no point in comparing performance of a method in classes C1 and C2 when the C1 version is inlined and optimised out while C2 version is not.

One possible solution is not to combine different tests in the same process run. If each VM run tests exactly one class, all the classes will be tested in identical environment. Unfortunately, this approach is only applicable to full programs (such as our Life simulation), but not to simplified tests. If the method is inlined and moved out of the loop in sum(), the nature of that method becomes irrelevant. All of them will produce the same measurable result (zero time).

A better approach is to wait until the code stabilises at the version that does not depend on the specific class and treats all the classes equally badly – performs virtual call for all of them. As we’ve seen, this happens in both versions of Java VM, very quickly in Java 7, more slowly in Java 8. After this stabilisation the measurements will produce results that we can compare to each other directly.

One way to cause this stabilisation is to repeat the entire test several times. Another way is to “pre-heat” the compiler by running the test on the mixture of classes. In both cases we hope that the statistics engine saturates, causing the compiler to avoid making any assumptions about the classes. I tried the second approach using the following code in main() of the new class HashTime1:

static int sumArray (ArrayList<Object> a)
{
    int sum = 0;
    for (int i = 0; i < a.size (); i++) {
        sum += sum (1, a.get(i));
    }
    return sum;
}
 
public static void main (String argv[])
{
    int x = 531;
    int y = -295;
    ArrayList<Object> a = new ArrayList<Object> ();
    int sum = 0;

    for (int i = 0; i < 100000; i++) {
        a.add (new Point (x, y));
        a.add (LongUtil.fromPoint (x, y));
        a.add (new LongPoint (x, y));
        a.add (new LongPoint3 (x, y));
        a.add (new LongPoint4 (x, y));
        a.add (new LongPoint5 (x, y));
        a.add (new LongPoint6 (x, y));
        a.add (new LongPoint7 (x, y));
        a.add (new NullPoint (x, y));
    }

    for (int j = 0; j < 100; j++) {
        sum += sumArray (a);
    }
    System.out.println (" Sum = " + sum);

    for (int i = 0; i < 2; i++) {
        test (new Point (x, y));
        test (LongUtil.fromPoint (x, y));
        test (new LongPoint (x, y));
        test (new LongPoint3 (x, y));
        test (new LongPoint4 (x, y));
        test (new LongPoint5 (x, y));
        test (new LongPoint6 (x, y));
        test (new LongPoint7 (x, y));
        test (new NullPoint (x, y));
    }
}

This is the result for Java 7:

            Point:    548   547   547; sum = 1040261632
   java.lang.Long:    547   547   547; sum = -1786864128
        LongPoint:    548   547   547; sum = 1040261632
       LongPoint3:    548   547   547; sum = -1308103168
       LongPoint4:    547   547   547; sum = 1455681024
       LongPoint5:    547   547   548; sum = 26862080
       LongPoint6:    577   577   578; sum = -2103278592
       LongPoint7:  10262 10263 10266; sum = 1732665600
        NullPoint:    652   517   518; sum = 0
            Point:    547   547   548; sum = 1040261632
   java.lang.Long:    551   575   564; sum = -1786864128
        LongPoint:    564   564   550; sum = 1040261632
       LongPoint3:    548   549   575; sum = -1308103168
       LongPoint4:    547   548   548; sum = 1455681024
       LongPoint5:    548   547   548; sum = 26862080
       LongPoint6:    591   578   578; sum = -2103278592
       LongPoint7:  10403 10443 10473; sum = 1732665600
        NullPoint:    517   518   517; sum = 0

And this is for Java 8:

            Point:    489   488   488; sum = 1040261632
   java.lang.Long:    488   488   487; sum = -1786864128
        LongPoint:    488   488   488; sum = 1040261632
       LongPoint3:    488   488   488; sum = -1308103168
       LongPoint4:    488   487   488; sum = 1455681024
       LongPoint5:    488   488   488; sum = 26862080
       LongPoint6:   1660  1661  1661; sum = -2103278592
       LongPoint7:   1888  1889  1888; sum = 1732665600
        NullPoint:    519   487   458; sum = 0
            Point:    488   487   487; sum = 1040261632
   java.lang.Long:    488   488   488; sum = -1786864128
        LongPoint:    488   488   488; sum = 1040261632
       LongPoint3:    488   488   488; sum = -1308103168
       LongPoint4:    488   489   488; sum = 1455681024
       LongPoint5:    488   488   488; sum = 26862080
       LongPoint6:   1664  1663  1663; sum = -2103278592
       LongPoint7:   1891  1891  1892; sum = 1732665600
        NullPoint:    457   458   458; sum = 0

Observations:

Suddenly both Java 7 and Java 8 demonstrate much more stable behaviour. No unusual super-fast or super-slow executions; the results stay roughly the same if the test is repeated twice.
The effect of unusually slow first call to Point in Java 7 has disappeared.
One must keep in mind, however, that the trick we used (running the test on a mixture of classes before the main test) isn’t universal; it works on the present versions of Java VM, but may stop working later, if VM becomes more sensitive to changing patterns of virtual calls.
With the exception of LongPoint6, Java 8 is faster than Java 7 (this is kind of expected; why make a slower version anyway? The exception, however, is intriguing, we’ll look at it later).
The cost of virtual method calls can be seen as the time of the NullPoint (458 ms on Java 8). It is much higher than the costs of the methods themselves (30 – 31 ms on Java 8).
The times of the first six methods are virtually the same. This may seem strange, since some of them are more complex than the others. However, the difference isn’t big to begin with (multiplication instructions are quite fast on modern processors), and can be hidden by instruction scheduling and re-ordering (for instance, multiplication can be performed in parallel with the return sequence from a virtual call). In addition, the resolution of our timer is quite low (one millisecond), so we need much higher execution count to measure the differences properly.
The times reported are in milliseconds for 100M iterations. Dividing the result by 100 gives us nanoseconds per call, and multiplying it afterwards by 2.6 gives us CPU cycles (our processor runs at 2.6 GHz). The results on Java 8 are 11.9 for NullPoint, 12.7 for all simple functions, 43 for LongPoint6 and 49 cycles for LongPoint7. The Java 7 result for LongPoint7 is 267 CPU cycles, which is very slow.
Some features of the code (manipulations with the mark word) are caused by the fact that the method we are testing is hashCode(). Had we created our own method without default intrinsic implementation, the code samples would have been much simpler.

Conclusions

We’ve learned how to produce and read outputs of the Hotspot compiler. This is very useful knowledge, and these outputs make very interesting reading
We’ve found that Java 7 and Java 8 employ quite different approach to code generation. This means that future versions may change the approach again, so all our findings may end up very short-lived. We must be prepared to re-do all the tests. Such is the nature of program optimisation.
We’ve come with the procedure to get reliable performance measurements (pre-heating the compiler). We must be prepared for this procedure to stop working in future versions of Java VM.
Now we are ready to start measuring and optimising individual hash functions.
Two primary questions so far are why LongPoint6 (employing a division operation) is so fast on Java 7 and so slow on Java 8, and why it is the other way around with LongPoint7 (employing CRC32).
Also interesting is why the performance difference of LongPoint6 between Java 7 and Java 8 does not show on the full test (Life implementation). where the time was 1,979 ms on Java 7 and 1,553 ms on Java 8. The LongPoint7 behaves differently: there the full test time did drop (from 10,115 ms to 3,206 ms).
The previous observation shows that we must be careful with microbenchmarks, and always remember that our real goal is the improvement of the real program performance.
We’ve seen that overheads of the infrastructure (such as those of virtual calls) may greatly exceed the costs of the methods themselves. This means that often the costs of the methods are less important than the quality of their results. This also means that those infrastructural overheads must be reduced at all costs. Whenever you can replace a virtual call with a direct one (or help the compiler with it) – do it.
Microbenchmarking is the case where measurement is more important than the performance itself. What we’ve done is actually jumping through hoops in order to make the program slower – disable some optimisations. Obviously, for real programs we want the opposite – optimise them to the extreme. That’s why the trick used today mustn’t be used for the full test. On the contrary, we must run each test alone, providing as much help as possible to Java’s type inference engine.
We can now re-visit some earlier experiments in optimising Java code. The knowledge acquired today may help improve stability of the results and explain them. In the earlier articles (such as here) we made some guesses about the causes of the performance differences we observed. Now we can verify those guesses.

Coming soon

In the next articles, we’ll try using this measurement technique to optimise our hash functions to the limit, starting with the division-based one, and then check how this optimisation affects the overall performance of the original program.

Experiments in program optimisation