In “Game of Life, hash tables and hash codes” we implemented Conway’s game of Life
using Java’s HashMap
and discovered that some hash functions cause higher execution speed than the others.
We introduced several hash functions and measured their distribution of hash codes. In the next article
(“Hash functions: disassembling and measuring performance”) we learned how to achieve stable performance of
micro-benchmarks (those that measure execution speed of individual hash functions), and measured the speed of all those
functions. We found several abnormalities, one of which was a big difference in performance of the division-based
hash function between Java 7 and Java 8. Let’s look at this case.
The function in question
Let’s repeat the source code of LongPoint6.hashCode()
:
The constant 946840871 was a completely arbitrary choice of a divisor. It’s a prime number and it is relatively big (close to one billion), bigger than any realistic hash table size. There is nothing special about this number, some other number may work better. It is also possible that some other number will allow faster calculations. We don’t use any specific properties of this number here, all following arguments are equally applicable to any number.
The test involving this method ran for 578 ms on Java 7 and 1661 ms on Java 8. The only way to find out why is by looking at the generated code (see the previous article for the instructions how to do it).
Let’s start with the Java 7:
There aren’t any division instructions in the code. The pseudo-code looks like this:
I call this pseudo-code rather than just code because it contains an operation that is not accessible in Java:
a 128-bit multiplication, which takes two 64-bit numbers and produces a 128-bit result (I wrote it as **
). There is such operation
in the instruction set – both MUL
and IMUL
have 128-bit versions. The result
is placed into rdx
and rax
, we are using rdx
, that’s why the total shift is 0x1B + 64 = 91.
What we see is the result of a well-known optimisation: replacing of division by a constant divisor with a multiplication and some shifts. It is well described in Agner Fog. Optimizing subroutines in assembly language. An optimization guide for x86 platforms. The HotSpot uses slightly different algorithm (they use two bits less, and compensate for signed values). If the code was written for unsigned numbers, it could save three instructions here.
The Java 8 code looks like this:
This code does not employ any optimisations. It does not even compile the remainder operation into DIV
or IDIV
instructions. It calls a runtime routine. It even goes as far as checking that freshly loaded constant isn’t zero.
In short, this is really awful code, no wonder it runs slowly. Perhaps, HotSpot developers considered division by
a constant a rare and unimportant case, and removed corresponding optimisation to keep their code short.
Optimising division: unsigned case
The code Java 7 generates is perfect. The only way it could be improved is by using unsigned division instead of signed one, but this can’t be done in Java.
Is there any way to improve the Java 8 case? If our numbers were 32-bit, there would be an obvious way – to write
code similar to the pseudo-code above; instead of 128-bit multiplication, we would need a 64-bit one, and this is a
normal long
operation. Unfortunately, we can’t do it that easily in the 64-bit case, because, as I said, there is
no **
operation in Java.
This doesn’t mean we must stop here: after all, we have a general-purpose computer, and there is nothing mystical about 128-bit multiplication. It can be programmed in Java using available arithmetic.
Let’s first simplify the problem by considering our numbers unsigned.
Let x
and y
be unsigned 64-bit values, and A
, B
, C
and D
be unsigned 32-bit values,
where
y = 232C + D
Then
or, graphically,
high 64 bits of xy | low 64 bits of xy | ||
---|---|---|---|
hi (A*C) | lo (A*C) | ||
hi (A*D) | lo (A*D) | ||
hi (B*C) | lo (B*C) | ||
hi (B*D) | lo (B*D) |
This is a 128-bit unsigned number, which we will represent as two long
s, one for high and one for low part of the result.
Both will be interpreted as unsigned 64-bit values. Each of the terms AC
, AD
, BC
, BD
is a 64-bit number.
The low part of the result can be calculated by the obvious code:
Note the uhi
and ulo
functions (“unsigned hi
” and “unsigned lo
”) and their difference from hi
and lo
that
we used to extract integer parts of a value stored as long
:
These functions return int
, which, if directly converted to long
, will be sign-extended, which will corrupt the results.
Here is a simple example. Let x
be 0x0000000000000001 and y
be 0xFFFFFFFFFFFFFFFF. Obviously, the 128-bit result should be
0x0000000000000000FFFFFFFFFFFFFFFF, and mult_unsigned_lopart()
must return 0xFFFFFFFFFFFFFFFF (which will be printed as -1).
However, with the sign-extended versions of hi
and lo
we would have
B = 1
C = −1
D = −1
BD + ((AD + BC) << 32)
- = −1 + ((0 + −1) << 32)
= 0xFFFFFFFFFFFFFFFF + 0xFFFFFFFF00000000
= 0xFFFFFFFEFFFFFFFF
The code using uhi
, ulo
works correctly:
B = 1
C = 0xFFFFFFFF
D = 0xFFFFFFFF
BD + ((AD + BC) << 32)
- = 0xFFFFFFFF + ((0 + 0xFFFFFFFF) << 32)
= 0xFFFFFFFF + 0xFFFFFFFF00000000
= 0xFFFFFFFFFFFFFFF
The mult_unsigned_lopart
was shown here for completeness; we don’t need it for our remainder problem.
The pseudo-code of hashCode()
above (the one using the **
operation) shifts
the multiplication result by 91, which totally ignores the low part of the product.
It may seem that the following code will calculate the high part:
This, however, does not work – for two reasons. One is that (A*D + B*C)
can occupy 65 bits (a carry bit may be produced
during addition), and another one is that, although B*D
term is not included directly into the result, it can
produce carry when its high part is added to the low part of A*D + B*C
. We have to perform addition in 32-bit blocks
and add all the carry bits manually, according to the picture above:
Now it is very easy to program the hash function:
It seems very unlikely that this monster will work faster than the original code. It contains five multiplications, one shift and six additions. This makes it a very interesting test: how fast can it possibly run?
Signed or unsigned?
We have just calculated an unsigned version of the remainder-based hash function. However, the code of LongPoint6.hashCode()
uses regular Java’s arithmetic, which is signed. How important is that? What impact does it have on the quality of hashing?
In Java dividing a negative number by a positive one produces negative remainder:
−25 % 3 = −1
Strictly speaking, this isn’t what is usually called “a remainder” in mathematics. Proper mathematical remainder isn’t ever negative:
and should be 2 for −25 % 3.
To check the quality of unsigned hashing, we must emulate unsigned calculations in Java. A positive signed 64-bit number keeps its value when interpreted as unsigned. A negative number becomes a big positive, 264 added to its value, so we must add (264 mod 946840871) = 518863773 to the remainder. Since the remainder was zero or negative before, it won’t exceed the divisor, but it may stay negative, in which case we need to add a divisor again:
Here are the slot counts we got with signed division (see here):
Field size: 1034; slots used: 982; avg= 1.05
Count size: 3938; slots used: 3236; avg= 1.22
And here are the results for unsigned division:
Field size: 1034; slots used: 968; avg= 1.07
Count size: 3938; slots used: 3133; avg= 1.26
The results haven’t changed much for field
but have changed for count
: they became worse. It does not mean they became
very bad. Previously, the result was equal to E + 5.28σ, where E is the expected value 3126.69, and σ
is the standard deviation 20.68. Now the result is E + 0.3σ, which still falls well within good statistical
limits and performs better than some other hash functions. It just does not
perform as well as the signed division, which I remarked on back then as demonstrating unusually good behaviour.
This means that if we manage to write an incredibly fast version of the hash function based on the unsigned division, we have chances to improve the overall speed, otherwise the signed version is a better choice.
Another approach is to change our OFFSET
, which is now 0x80000000, to some positive value. This will leave less bits
for x
and y
parts, but the long
values will always be positive, and unsigned version will produce the same result as
the signed one. Some of these values perform better than the others: for instance, 0x40000000 gives 3118 (E − 0.42σ)
as count
size, while 0x8000000 gives 3234 (E + 5.18σ). We won’t do it to keep the story short.
Optimising division: signed case
We’ve seen that the signed version has some potential advantages. Can we modify the code to work in signed case?
The formulae we used above:
y = 232C + D
xy = (232A + B) (232C + D)
- = 264AC + 232(AD + BC) + BD
are still applicable if x and y are signed 64-bit numbers. In this case A and C are signed 32-bit values, while B and D are unsigned. The picture stays the same, except for important change: AD and BC can now be negative, so the values must be sign-extended before addition:
high 64 bits of xy | low 64 bits of xy | ||
---|---|---|---|
hi (A*C) | lo (A*C) | ||
sign (A*D) | hi (A*D) | lo (A*D) | |
sign (B*C) | hi (B*C) | lo (B*C) | |
hi (B*D) | lo (B*D) |
Here sign(v) is a 32-bit sign extension of v. Note that BD must not be sign-extended, since it is a positive value.
These sign manipulations can be implemented very easily by using hi
and lo
instead of uhi
and ulo
where appropriate:
Finally, the signed version of the hash function:
Performance measurements
Let’s add new unsigned and signed versions to our hash function test suit (they’ll be called
LongPoint60
and
LongPoint61
to indicate the fact that they are modified versions of LongPoint6
).
The full new test suite is here.
The class to run is HashTime1
.
Here are the results (in milliseconds for 100M iterations):
Class name | Comment | Time, Java 7 | Time, Java 8 |
---|---|---|---|
Point | Multiply by 3, 5 | 548 | 485 |
Long | Long default | 547 | 486 |
LongPoint | Multiply by 3, 5 | 547 | 486 |
LongPoint3 | Multiply by 11, 17 | 547 | 485 |
LongPoint4 | Multiply by two big primes | 547 | 486 |
LongPoint5 | Multiply by one big prime | 547 | 486 |
LongPoint6 | Modulo big prime | 578 | 1655 |
LongPoint60 | Modulo optimised, unsigned | 640 | 610 |
LongPoint61 | Modulo optimised, signed | 729 | 727 |
LongPoint7 | java.util.zip.CRC32 | 10109 | 1882 |
NullPoint | Null function (does nothing) | 532 | 455 |
The results are surprising. Our optimisation did work! The code above, being rather long and complex, still runs faster than
the modulo operation as implemented by Java 8. And quite a bit faster: if we subtract the calling overhead
(measured as the time of NullPoint
), we get:
Class name | Comment | Time, Java 7 | Time, Java 8 |
---|---|---|---|
LongPoint6 | Modulo big prime | 46 | 1200 |
LongPoint60 | Modulo optimised, unsigned | 108 | 155 |
LongPoint61 | Modulo optimised, signed | 197 | 272 |
Some observations:
- On Java 8, the optimised code, with its five multiplications, is still 7.4 times faster than the original code in unsigned case and 4.4 times faster in signed case.
- The signed case is quite a bit slower than the unsigned case (I’m not sure why).
- Java 8 is quite a bit slower than Java 7 in all cases; it just happens that the optimised versions in Java 8 are faster than the original one. Nothing beats unoptimised Java 7 version.
- In CPU cycles the times are:
Class name | Comment | Cycles, Java 7 | Cycles, Java 8 |
---|---|---|---|
LongPoint6 | Modulo big prime | 1.2 | 31.2 |
LongPoint60 | Modulo optimised, unsigned | 2.8 | 4.0 |
LongPoint61 | Modulo optimised, signed | 5.1 | 7.0 |
Even though Java 8 figures are bigger than those of Java 7, they are still surprisingly small. The processor must be very advanced to calculate our unsigned version in 4 cycles.
The disassembly
Just out of curiosity, let’s look at the code generated for our functions. Each of them is compiled twice by Java 8,
we’ll look at the second output. This is the code for the unsigned version (LongPoint60
):
This is the code for the signed version (LongPoint61
):
Both of these pieces look like a very good code for me. It is a mystery why the second one runs so much slower than the
first one. It is still amazing that both run much faster than a runtime call, which performs s DIV
instruction.
A full test
How do the optimised versions perform in the full test (a Life application)? Let’s try them. Here are the times (in milliseconds for 10,000 iterations). The times for other versions are taken from the original article:
Class name | Comment | Time, Java 7 | Time, Java 8 |
---|---|---|---|
Point | Multiply by 3, 5 | 2743 | 4961 |
Long | Long default | 4273 | 6755 |
LongPoint | Multiply by 3, 5 | 2602 | 4836 |
LongPoint3 | Multiply by 11, 17 | 2074 | 2028 |
LongPoint4 | Multiply by two big primes | 2000 | 1585 |
LongPoint5 | Multiply by one big prime | 1979 | 1553 |
LongPoint6 | Modulo big prime | 1890 | 1550 |
LongPoint60 | Modulo optimised, unsigned | 2124 | 1608 |
LongPoint61 | Modulo optimised, signed | 2198 | 1689 |
LongPoint7 | java.util.zip.CRC32 | 10115 | 3206 |
The new versions do not perform badly, compared to other hash functions, but they are not nearly as fast as the microbenchmarks suggest. In particular, they don’t outperform the original division-based version. Why this happend is a mystery, perhaps Java 8 is capable of some magic we don’t know about yet.
Conclusions
-
A division operation is very slow compared to other arithmetic; it should be avoided in performance-critical code unless there is evidence that Java VM can optimise it.
-
On the contrary, a multiplication operation is relatively fast; it still makes sense to replace multiplication by small constants with shifts, additions or
LEA
instructions, but compilers are usually good at that. -
Replacing division by a constant with a multiplication by a reciprocal is a very useful optimisation; it’s a pity it has been removed from Java VM in version 8.
-
This optimisation requires a 128-bit multiplication, which isn’t available in Java directly. It can be implemented from scratch, although in some ugly way.
-
The code we produced was very long. It contained five multiplications and several shifts and additions; still, it was faster than the original code using division (on Java 8) – although, only in a microbenchmark test.
-
The optimisation didn’t increase the speed of the full test, or, simply speaking, failed.
Coming soon
Is there any way to improve division even more? Can we outperform the original division-based version on the full test this way, and if not, why? We’ll see soon.