In “Game of Life, hash tables and hash codes” we implemented Conway’s game of Life
using Java’s HashMap
and discovered that some hash functions cause higher execution speed than the others.
We introduced several hash functions and measured their distribution of hash codes. In the next article
(“Hash functions: disassembling and measuring performance”) we learned how to achieve stable performance of
microbenchmarks (those that measure execution speed of individual hash functions), and measured the speed of all those
functions. We found several abnormalities, one of which was a big difference in performance of the divisionbased
hash function between Java 7 and Java 8. Let’s look at this case.
The function in question
Let’s repeat the source code of LongPoint6.hashCode()
:
The constant 946840871 was a completely arbitrary choice of a divisor. It’s a prime number and it is relatively big (close to one billion), bigger than any realistic hash table size. There is nothing special about this number, some other number may work better. It is also possible that some other number will allow faster calculations. We don’t use any specific properties of this number here, all following arguments are equally applicable to any number.
The test involving this method ran for 578 ms on Java 7 and 1661 ms on Java 8. The only way to find out why is by looking at the generated code (see the previous article for the instructions how to do it).
Let’s start with the Java 7:
There aren’t any division instructions in the code. The pseudocode looks like this:
I call this pseudocode rather than just code because it contains an operation that is not accessible in Java:
a 128bit multiplication, which takes two 64bit numbers and produces a 128bit result (I wrote it as **
). There is such operation
in the instruction set – both MUL
and IMUL
have 128bit versions. The result
is placed into rdx
and rax
, we are using rdx
, that’s why the total shift is 0x1B + 64 = 91.
What we see is the result of a wellknown optimisation: replacing of division by a constant divisor with a multiplication and some shifts. It is well described in Agner Fog. Optimizing subroutines in assembly language. An optimization guide for x86 platforms. The HotSpot uses slightly different algorithm (they use two bits less, and compensate for signed values). If the code was written for unsigned numbers, it could save three instructions here.
The Java 8 code looks like this:
This code does not employ any optimisations. It does not even compile the remainder operation into DIV
or IDIV
instructions. It calls a runtime routine. It even goes as far as checking that freshly loaded constant isn’t zero.
In short, this is really awful code, no wonder it runs slowly. Perhaps, HotSpot developers considered division by
a constant a rare and unimportant case, and removed corresponding optimisation to keep their code short.
Optimising division: unsigned case
The code Java 7 generates is perfect. The only way it could be improved is by using unsigned division instead of signed one, but this can’t be done in Java.
Is there any way to improve the Java 8 case? If our numbers were 32bit, there would be an obvious way – to write
code similar to the pseudocode above; instead of 128bit multiplication, we would need a 64bit one, and this is a
normal long
operation. Unfortunately, we can’t do it that easily in the 64bit case, because, as I said, there is
no **
operation in Java.
This doesn’t mean we must stop here: after all, we have a generalpurpose computer, and there is nothing mystical about 128bit multiplication. It can be programmed in Java using available arithmetic.
Let’s first simplify the problem by considering our numbers unsigned.
Let x
and y
be unsigned 64bit values, and A
, B
, C
and D
be unsigned 32bit values,
where
y = 2^{32}C + D
Then
or, graphically,
high 64 bits of xy  low 64 bits of xy  

hi (A*C)  lo (A*C)  
hi (A*D)  lo (A*D)  
hi (B*C)  lo (B*C)  
hi (B*D)  lo (B*D) 
This is a 128bit unsigned number, which we will represent as two long
s, one for high and one for low part of the result.
Both will be interpreted as unsigned 64bit values. Each of the terms AC
, AD
, BC
, BD
is a 64bit number.
The low part of the result can be calculated by the obvious code:
Note the uhi
and ulo
functions (“unsigned hi
” and “unsigned lo
”) and their difference from hi
and lo
that
we used to extract integer parts of a value stored as long
:
These functions return int
, which, if directly converted to long
, will be signextended, which will corrupt the results.
Here is a simple example. Let x
be 0x0000000000000001 and y
be 0xFFFFFFFFFFFFFFFF. Obviously, the 128bit result should be
0x0000000000000000FFFFFFFFFFFFFFFF, and mult_unsigned_lopart()
must return 0xFFFFFFFFFFFFFFFF (which will be printed as 1).
However, with the signextended versions of hi
and lo
we would have
B = 1
C = −1
D = −1
BD + ((AD + BC) << 32)
 = −1 + ((0 + −1) << 32)
= 0xFFFFFFFFFFFFFFFF + 0xFFFFFFFF00000000
= 0xFFFFFFFEFFFFFFFF
The code using uhi
, ulo
works correctly:
B = 1
C = 0xFFFFFFFF
D = 0xFFFFFFFF
BD + ((AD + BC) << 32)
 = 0xFFFFFFFF + ((0 + 0xFFFFFFFF) << 32)
= 0xFFFFFFFF + 0xFFFFFFFF00000000
= 0xFFFFFFFFFFFFFFF
The mult_unsigned_lopart
was shown here for completeness; we don’t need it for our remainder problem.
The pseudocode of hashCode()
above (the one using the **
operation) shifts
the multiplication result by 91, which totally ignores the low part of the product.
It may seem that the following code will calculate the high part:
This, however, does not work – for two reasons. One is that (A*D + B*C)
can occupy 65 bits (a carry bit may be produced
during addition), and another one is that, although B*D
term is not included directly into the result, it can
produce carry when its high part is added to the low part of A*D + B*C
. We have to perform addition in 32bit blocks
and add all the carry bits manually, according to the picture above:
Now it is very easy to program the hash function:
It seems very unlikely that this monster will work faster than the original code. It contains five multiplications, one shift and six additions. This makes it a very interesting test: how fast can it possibly run?
Signed or unsigned?
We have just calculated an unsigned version of the remainderbased hash function. However, the code of LongPoint6.hashCode()
uses regular Java’s arithmetic, which is signed. How important is that? What impact does it have on the quality of hashing?
In Java dividing a negative number by a positive one produces negative remainder:
−25 % 3 = −1
Strictly speaking, this isn’t what is usually called “a remainder” in mathematics. Proper mathematical remainder isn’t ever negative:
and should be 2 for −25 % 3.
To check the quality of unsigned hashing, we must emulate unsigned calculations in Java. A positive signed 64bit number keeps its value when interpreted as unsigned. A negative number becomes a big positive, 2^{64} added to its value, so we must add (2^{64} mod 946840871) = 518863773 to the remainder. Since the remainder was zero or negative before, it won’t exceed the divisor, but it may stay negative, in which case we need to add a divisor again:
Here are the slot counts we got with signed division (see here):
Field size: 1034; slots used: 982; avg= 1.05
Count size: 3938; slots used: 3236; avg= 1.22
And here are the results for unsigned division:
Field size: 1034; slots used: 968; avg= 1.07
Count size: 3938; slots used: 3133; avg= 1.26
The results haven’t changed much for field
but have changed for count
: they became worse. It does not mean they became
very bad. Previously, the result was equal to E + 5.28σ, where E is the expected value 3126.69, and σ
is the standard deviation 20.68. Now the result is E + 0.3σ, which still falls well within good statistical
limits and performs better than some other hash functions. It just does not
perform as well as the signed division, which I remarked on back then as demonstrating unusually good behaviour.
This means that if we manage to write an incredibly fast version of the hash function based on the unsigned division, we have chances to improve the overall speed, otherwise the signed version is a better choice.
Another approach is to change our OFFSET
, which is now 0x80000000, to some positive value. This will leave less bits
for x
and y
parts, but the long
values will always be positive, and unsigned version will produce the same result as
the signed one. Some of these values perform better than the others: for instance, 0x40000000 gives 3118 (E − 0.42σ)
as count
size, while 0x8000000 gives 3234 (E + 5.18σ). We won’t do it to keep the story short.
Optimising division: signed case
We’ve seen that the signed version has some potential advantages. Can we modify the code to work in signed case?
The formulae we used above:
y = 2^{32}C + D
xy = (2^{32}A + B) (2^{32}C + D)
 = 2^{64}AC + 2^{32}(AD + BC) + BD
are still applicable if x and y are signed 64bit numbers. In this case A and C are signed 32bit values, while B and D are unsigned. The picture stays the same, except for important change: AD and BC can now be negative, so the values must be signextended before addition:
high 64 bits of xy  low 64 bits of xy  

hi (A*C)  lo (A*C)  
sign (A*D)  hi (A*D)  lo (A*D)  
sign (B*C)  hi (B*C)  lo (B*C)  
hi (B*D)  lo (B*D) 
Here sign(v) is a 32bit sign extension of v. Note that BD must not be signextended, since it is a positive value.
These sign manipulations can be implemented very easily by using hi
and lo
instead of uhi
and ulo
where appropriate:
Finally, the signed version of the hash function:
Performance measurements
Let’s add new unsigned and signed versions to our hash function test suit (they’ll be called
LongPoint60
and
LongPoint61
to indicate the fact that they are modified versions of LongPoint6
).
The full new test suite is here.
The class to run is HashTime1
.
Here are the results (in milliseconds for 100M iterations):
Class name  Comment  Time, Java 7  Time, Java 8 

Point  Multiply by 3, 5  548  485 
Long  Long default  547  486 
LongPoint  Multiply by 3, 5  547  486 
LongPoint3  Multiply by 11, 17  547  485 
LongPoint4  Multiply by two big primes  547  486 
LongPoint5  Multiply by one big prime  547  486 
LongPoint6  Modulo big prime  578  1655 
LongPoint60  Modulo optimised, unsigned  640  610 
LongPoint61  Modulo optimised, signed  729  727 
LongPoint7  java.util.zip.CRC32  10109  1882 
NullPoint  Null function (does nothing)  532  455 
The results are surprising. Our optimisation did work! The code above, being rather long and complex, still runs faster than
the modulo operation as implemented by Java 8. And quite a bit faster: if we subtract the calling overhead
(measured as the time of NullPoint
), we get:
Class name  Comment  Time, Java 7  Time, Java 8 

LongPoint6  Modulo big prime  46  1200 
LongPoint60  Modulo optimised, unsigned  108  155 
LongPoint61  Modulo optimised, signed  197  272 
Some observations:
 On Java 8, the optimised code, with its five multiplications, is still 7.4 times faster than the original code in unsigned case and 4.4 times faster in signed case.
 The signed case is quite a bit slower than the unsigned case (I’m not sure why).
 Java 8 is quite a bit slower than Java 7 in all cases; it just happens that the optimised versions in Java 8 are faster than the original one. Nothing beats unoptimised Java 7 version.
 In CPU cycles the times are:
Class name  Comment  Cycles, Java 7  Cycles, Java 8 

LongPoint6  Modulo big prime  1.2  31.2 
LongPoint60  Modulo optimised, unsigned  2.8  4.0 
LongPoint61  Modulo optimised, signed  5.1  7.0 
Even though Java 8 figures are bigger than those of Java 7, they are still surprisingly small. The processor must be very advanced to calculate our unsigned version in 4 cycles.
The disassembly
Just out of curiosity, let’s look at the code generated for our functions. Each of them is compiled twice by Java 8,
we’ll look at the second output. This is the code for the unsigned version (LongPoint60
):
This is the code for the signed version (LongPoint61
):
Both of these pieces look like a very good code for me. It is a mystery why the second one runs so much slower than the
first one. It is still amazing that both run much faster than a runtime call, which performs s DIV
instruction.
A full test
How do the optimised versions perform in the full test (a Life application)? Let’s try them. Here are the times (in milliseconds for 10,000 iterations). The times for other versions are taken from the original article:
Class name  Comment  Time, Java 7  Time, Java 8 

Point  Multiply by 3, 5  2743  4961 
Long  Long default  4273  6755 
LongPoint  Multiply by 3, 5  2602  4836 
LongPoint3  Multiply by 11, 17  2074  2028 
LongPoint4  Multiply by two big primes  2000  1585 
LongPoint5  Multiply by one big prime  1979  1553 
LongPoint6  Modulo big prime  1890  1550 
LongPoint60  Modulo optimised, unsigned  2124  1608 
LongPoint61  Modulo optimised, signed  2198  1689 
LongPoint7  java.util.zip.CRC32  10115  3206 
The new versions do not perform badly, compared to other hash functions, but they are not nearly as fast as the microbenchmarks suggest. In particular, they don’t outperform the original divisionbased version. Why this happend is a mystery, perhaps Java 8 is capable of some magic we don’t know about yet.
Conclusions

A division operation is very slow compared to other arithmetic; it should be avoided in performancecritical code unless there is evidence that Java VM can optimise it.

On the contrary, a multiplication operation is relatively fast; it still makes sense to replace multiplication by small constants with shifts, additions or
LEA
instructions, but compilers are usually good at that. 
Replacing division by a constant with a multiplication by a reciprocal is a very useful optimisation; it’s a pity it has been removed from Java VM in version 8.

This optimisation requires a 128bit multiplication, which isn’t available in Java directly. It can be implemented from scratch, although in some ugly way.

The code we produced was very long. It contained five multiplications and several shifts and additions; still, it was faster than the original code using division (on Java 8) – although, only in a microbenchmark test.

The optimisation didn’t increase the speed of the full test, or, simply speaking, failed.
Coming soon
Is there any way to improve division even more? Can we outperform the original divisionbased version on the full test this way, and if not, why? We’ll see soon.