Today we are going to try to optimise a hash-function based on CRC32 calculation. Here is some background.
In “Game of Life, hash tables and hash codes” we implemented Conway’s game of Life using Java’s HashMap
.
We discovered that performance depends on a choice of a hash function and identified some of them that work well (spread hash values evenly). Those included a function based
on calculating a remainder and another one that used CRC32.
Then we measured the speed of the hash functions and tried optimising
the remainder-based one (in “Optimising division-based hash functions” and “More optimisations of division-based hash: Karatsuba’s method”).
These attempts failed, because the optimisation performed by the HotSpot compiler was better than anything we could do by hand. However, I don’t regret that effort –
we’ve learned something interesting in the process. Among other things, we discovered (see
“Don’t trust a micro-benchmark”) that the results of micro-benchmark tests do not always correlate with the results
of full program tests, so micro-benchmarks must be used with care.
Nevertheless, today we are going to run micro-benchmarks again, this time on a CRC32-based hash function. Having learned the lesson, we’ll later verify the results on the full program test. Obviously, the same may happen that happened with the remainder-based function: the compiler may already be very good in compiling it. We can only hope that even in this case we’ll learn something interesting. Let’s go.
The method in question
The full code of the project is here. This is the function we are talking about – LongPoint7.hashCode()
:
Here are the execution times, in milliseconds for 100M iterations (for comparison, I added times for one other hash function and for the null function, which does nothing):
Class name | Comment | Time, Java 7 | Time, Java 8 |
---|---|---|---|
LongPoint5 | Multiply by one big prime | 547 | 486 |
LongPoint7 | java.util.zip.CRC32 | 10109 | 1882 |
NullPoint | Null function (does nothing) | 532 | 455 |
Our hash function is much slower than the others, especially on Java 7. Let’s see what can be done here.
The first ideas
The CRC32
class we are using comes from java.util.zip
. Let’s look inside this class.
This is the code of update()
in both Java 7 and Java 8:
A native call is made for every input byte, and we know that native calls are expensive.
This suggests our first idea for improvement: to make just one native call for the entire long
.
The CRC32
class contains a method we can use for that (at a cost of an array allocation):
Here is the new code (class LongPoint71
):
One immediate observation about this code is that we don’t have to allocate a new byte array each time and can re-use an existing one. This may seem obvious that this
way it will be faster but I won’t be convinced until I try. Here is the new class LongPoint72
:
Another improvement that seems obvious is to re-use the CRC32
object as well. We’ll try this, too, but there is one important point to note: after this the hashCode()
function
stops being thread-safe. Two threads using the same CRC32
object may corrupt each other’s calculations. Sharing the byte array does not cause this, because the functions write
the same values into the elements of this array. Our program is single-threaded, but it is preferable to write more generally applicable code. Still, we’ll try this option, too
(see LongPoint73
):
Now it’s time to start running the programs (the main class is HashTime
):
Here are the results:
Name | Description | Time, Java 7 | Time, Java 8 |
---|---|---|---|
LongPoint7 | CRC32, 8 calls one byte each | 10109 | 1882 |
LongPoint71 | CRC32, one call with an array | 6730 | 2056 |
LongPoint72 | CRC32, one call with re-used array | 6422 | 1901 |
LongPoint73 | CRC32, one call with re-used array and CRC | 6566 | 2009 |
-
The optimisation helped on Java 7 (we’ve got 33% improvement), but not on Java 8, where the time got worse (it resembles the story with the remainder-based hash, doesn’t it?)
-
Re-using the array have helped a bit
-
Not only is re-using of CRC object thread-unsafe, it is also slower than allocating new one each time. Object allocation isn’t always slow!
From arrays to buffers
Our next step is replacing an array with a direct byte buffer. This offers two potential advantages:
-
Direct byte buffers support fast decomposition of primitive values into bytes. The entire
long
can be written into a buffer using just one CPU instruction -
JNI has built-in support for direct buffers, and passing them to the native code is faster than passing arrays.
Unfortunately, only Java 8 contains a method that applies CRC32 to a byte buffer. Here it is:
This method doesn’t even use the JNI support for direct buffers. Instead, it takes the address of the underlying memory and passes it to the native code as a number. This trick promises very high performance.
Let’s write a version that uses this method and run it on Java 8.
Here is the code (LongPoint74
):
Note that this code went one step further in re-using the same object than LongPoint73
: it made the buffer a static object. There is a reason for it: direct buffers are allocated
in separate (native) memory, and the minimal quantum of allocation is 4096 bytes. They are freed when their Java objects get garbage-collected, over which we have
no control. If Java heap is big enough, too many Java objects may be allocated before the garbage collector is called, and we may run out of native memory.
Obviously, the static buffer causes this method to be thread-unsafe.
The results, however, are not impressive:
Name | Description | Time, Java 7 | Time, Java 8 |
---|---|---|---|
LongPoint7 | CRC32, 8 calls one byte each | 10109 | 1882 |
LongPoint72 | CRC32, one call with re-used array | 6422 | 1901 |
LongPoint74 | CRC32, one call with a direct byte buffer | 1903 |
We won’t even try re-using the CRC object.
Writing in Java
What else can we do? It seems like all our attempts have failed. No, there are other options to try. The first one is to write the entire CRC32 calculation in Java. It is unlikely to
be fast (if it was fast enough, the JDK designers would have used this approach), but completeness requires us to test this anyway. Fortunately, we don’t need to implement
CRC32 according to its strict definition (division of polynoms). There is a fast table-based implementation
(see TableCRC32.java
). This is what the application of CRC to a byte
and to a long
look like:
And this is our new hash function:
The results look much better:
Name | Description | Time, Java 7 | Time, Java 8 |
---|---|---|---|
LongPoint7 | CRC32, 8 calls one byte each | 10109 | 1882 |
LongPoint72 | CRC32, one call with re-used array | 6422 | 1901 |
LongPoint75 | CRC32, written in Java | 1505 | 1420 |
Our prediction was wrong – this is definitely an improvement. The time on Java 7 has improved dramatically, while on Java 8 we’ve improved over the original time for the first time. It is unclear why JDK does not use Java implementation of CRC32.
Writing in C
The next thing we can try is rewrite the same code in C. Most probably, the original native code of CRC32 is very similar to what we are going to write,
but one trick may give us an advantage: we are going to introduce a JNI call that operates on a long
. No arrays or byte buffers necessary. For the purpose of this article
we’ll create a simple class with a single native method:
Obviously, the proper implementation must have many other useful methods, but we’ll implement only this one, to keep things short.
We’ll need a C file, called NativeCRC32.c
,
with the table-based CRC implementation and a JNI wrapper:
To build it, we use a rather sophisticated command line:
cc -m64 -Wall -O3 -fPIC -I${JAVA_HOME}/include -I${JAVA_HOME}/include/linux
-shared -o libNativeCRC32.so NativeCRC32.c
The compiler produces library libNativeCRC32.so
, which will be loaded when the class NativeCRC32
is initialised.
Here are the results:
Name | Description | Time, Java 7 | Time, Java 8 |
---|---|---|---|
LongPoint72 | CRC32, one call with re-used array | 6422 | 1901 |
LongPoint75 | CRC32, written in Java | 1505 | 1420 |
LongPoint76 | CRC32, written in C, one JNI call | 1890 | 1898 |
The results are not great, in the case of Java 8 they don’t differ much from those of our version using an array. In both cases they are worse than our results in Java. This disproves a popular belief that C is always faster than Java – at least, when JNI is involved.
Writing in assembly
It looks like we’ve reached a limit. What more can one do when we already tried to write everything in C? Only one thing: write in assembly. OK, it won’t be a true assembly; C with SSE intrinsics will be sufficient for our needs.
Intel processors (starting at SSE 4.2) have an instruction calculating CRC32. This, however, is a different version of CRC32, called CRC32C (Castagnolli). It differs
from our CRC32 in the polynom it uses, but the general principles are the same, and it can be calculated by the same table-based code (except the table is different).
We didn’t measure hash values distribution when using CRC32C, but let’s hope it is good. We’ll add another function to our NativeCRC32
class:
We’ll also need another class, LongPoint77
, to use this function:
Finally, we need to add -msse4.2
to our build command, and we are ready to run:
Name | Description | Time, Java 7 | Time, Java 8 |
---|---|---|---|
LongPoint75 | CRC32, written in Java | 1505 | 1420 |
LongPoint76 | CRC32, written in C, one JNI call | 1890 | 1898 |
LongPoint77 | CRC32C, written in C using SSE, one JNI call | 1257 | 1214 |
This is much better than anything we’ve seen before. On Java 7 this version runs faster by 87% (that is, eight times faster) than the original one. On Java 8 the speedup is smaller, only by 35%. Unfortunately, even such a speedup can’t be considered a success: other hash functions take 400-600 ms with similar quality of hashing, and we’ve exhausted our list of ideas for CRC32 optimisation.
The cost of JNI
It still looks strange that we’ve achieved such a low speed for a hash function consisting of just one CPU instruction (CRC32
). Maybe, this is a very slow instruction?
No, it has the latency of 3 cycles (according to Agner Fog). The only possible explanation can be the cost of the JNI
call. Let’s check this. We’ll add another function to our native code:
Here are the results:
Name | Description | Time, Java 7 | Time, Java 8 |
---|---|---|---|
LongPoint75 | CRC32, written in Java | 1505 | 1420 |
LongPoint76 | CRC32, written in C, one JNI call | 1890 | 1898 |
LongPoint77 | CRC32C, written in C using SSE, one JNI call | 1257 | 1214 |
LongPoint78 | JNI call to an empty function | 1256 | 1214 |
This looks shocking at first: the results look as if the CRC calculation didn’t require any CPU cycles. This can be explained by the parallel structure of our processor: while CRC32 is busy executing and as long as its result isn’t used, the rest of the program can carry on. The JNI exit sequence may take longer than 3 cycles, so the CRC32 instruction can totally disappear from our timing.
Our measurements also show that JNI isn’t cheap. Let’s look at Java 8. The test with empty JNI call (LongPoint78
) executed for 1214 ms. The test with the empty function
(NullPoint
) took 455 ms (these are the expenses of the virtual call plus the test overheads). This means that the JNI call to the static method taking a long
as a parameter
and returning an int
, executed 100 million times, took 759 ms, or about 7.6 ns per execution. On this processor it corresponds to 20 CPU cycles. This is relatively small overhead for
methods that take thousands of cycles to execute, but it is too big for short methods. In the case of CRC32C the overheads of JNI were much bigger than the cost of the method
itself, while in the case of the CRC32 written in C they were roughly the same (the calculation took about 18 cycles). In short, nothing based on JNI is a good solution for a
hash function – unless we need to hash objects of kilobyte or megabyte sizes.
We can now see why the original version took so long in Java 7. Eight JNI calls alone take 5800 ms. Why is then Java 8’s version faster than that?
The mystery of fast CRC32 on Java 8
To answer this question, we’ll have to look at the assembly listings of LongPoint7.hashCode
. In Java 7 the code of this method contains eight calls to CRC32.update
:
This fragment shows only two calls to update()
. The entire code is here.
The code for CRC32.update
contains calls to the runtime, which probably represent JNI call.
The code on Java 8 looks completely different. It is much longer and more complex. The entire code is here, I’ll show the Java reconstruction:
Here table
is some statically defined array of type int[]
. This code looks odd at first, but if we simplify it, it will look very familiar:
This is exactly the code of crc32 (long)
in TableCRC32
. It starts at crc = -1
and applies the function
to all eight bytes of v
.
What we have just found out if that in Java 8 the method CRC32.update (int)
is intrinsic. It is defined as native
but is not compiled as a native call. Instead,
its code is placed directly at the call site. All the usual optimisations are performed after such substitution. For instance, in our case the constant propagation has happened –
the first crc >>> 8
was replaced with 0xFFFFFF
. Instruction re-ordering has also been performed – extraction of bytes from v
happens before the table access.
No wonder we haven’t achieved any speed improvements when calling versions of update
with arrays or byte buffers – we eventually executed the same code, but carried the
additional costs of manipulating those arrays or buffers.
What is more surprising is that the pure Java version is performing faster than the one using intrinsic code (1420 ms instead of 1802). The detailed study of this phenomenon
falls outside the scope of this article, but a brief examination shows that the Java
version is in fact compiled better.
Here is the code. It is shorter: 71 instruction instead
of 87. For instance, a single table access takes four instructions:
while in the intrinsic version it took six:
Interestingly, the access to the array in both cases is done using the static address rather than indirectly using a reference. While in the intrinsic case this may be
explained by the code not being written in Java, in Java case the compiler clearly made use of the fact that the
array is static final
, so its address is known at the time of compilation. The same applies to its size. The compiler knew the array size was 256 and it was accessed
using byte-sized indices, which allowed to optimise out the index checking. This resulted in a nearly optimal code.
The full test
Let’s now run the full test (a Life simulation). The entire code for the test is here. This table shows both micro-benchmark and full-test times for Java 7 and Java 8. Two other hash functions added for comparison:
Class name | Comment | Time, micro-benchmark | Time, full test | ||
---|---|---|---|---|---|
Java 7 | Java 8 | Java 7 | Java 8 | ||
LongPoint5 | Multiply by one big prime | 547 | 486 | 1979 | 1553 |
LongPoint6 | Modulo big prime | 578 | 1655 | 1890 | 1550 |
LongPoint7 | java.util.zip.CRC32 | 10109 | 1882 | 10115 | 3206 |
LongPoint71 | CRC32, one call with an array | 6730 | 2056 | 7416 | 3515 |
LongPoint72 | CRC32, one call with re-used array | 6422 | 1901 | 7196 | 3235 |
LongPoint73 | CRC32, one call with re-used array and CRC | 6566 | 2009 | 7771 | 3603 |
LongPoint74 | CRC32, one call with a direct byte buffer | 1903 | 3310 | ||
LongPoint75 | CRC32, written in Java | 1505 | 1420 | 3522 | 2664 |
LongPoint76 | CRC32, written in C, one JNI call | 1890 | 1898 | 3584 | 2915 |
LongPoint77 | CRC32C, written in C using SSE, one JNI call | 1257 | 1214 | 2409 | 2045 |
LongPoint78 | JNI call to an empty function | 1256 | 1214 | ||
NullPoint | Null function (does nothing) | 532 | 455 |
This time the results are predictable: what is faster in the micro-benchmark, is also faster in the full test.
Conclusions
-
We managed to optimise CRC32 calculation quite a bit, especially on Java 7
-
java.util.zip.CRC32.update
is an intrinsic on Java 8, so there is no point reducing the JNI call overheads -
Surprisingly, the version in pure Java works faster than that intrinsic. It is also faster than the version in C called via JNI
-
The only way to perform calculations faster in C is by using the new CRC32 instruction available via SSE intrinsic set. One must remember that this is a different version of CRC, although it is also well suitable as a hash function.
-
With all our optimisations, the hash function based on CRC32 still works slower than one based on a single multiplication or division. While CRC may be useful for hashing longer byte sequences, or for data distributed in some special way, it isn’t the best hash function for our problem. We’ll use the division-based function from now on.
Coming soon
We’ve paid enough attention to the hash functions. Now it’s time to optimise the algorithm itself. That’s what we’ll be doing in the next article.