This is the third article in a row that has a title starting with “Home-made”. While for the first one (on binary numbers) this is just a coincidence, for the other two (on hash tables) this illustrates specialisation as an optimisation idea. The code specially written for the task may perform better than the generic code. I say “may”, because this isn’t an absolute rule. The cases when this rule does not apply include:
-
very complex algorithms where re-implementing them is likely to cause plain bugs (it is surprising how many people make mistakes even writing a binary search from scratch);
-
algorithms that contain special performance cases, for which the developers have already provided a special treatment;
-
cases where the language is powerful enough to provide ways to implement generic code that efficiently performs any specific tasks.
I’m assuming here that we need optimisation in the first place, that’s why I don’t mention obvious cases where this particular piece of code isn’t a bottleneck, or the performance isn’t critical at all.
I always believed that the last of the three points above is applicable to C++. Let’s verify this using the problem that we have just solved in Java: a Game of Life simulation using hash maps.
The first implementation was done here: “Game of Life, hash tables and hash codes”. We studied the effect of the hash functions and found some that worked well.
Then we applied some improvements to the program: “Optimising the hash-based Life”. Most of the improvements dealt with replacing of Java
standard classes with the classes of our own, which helped improve performance. We didn’t, however, replace the standard HashMap
class.
Finally, in “Home-made hash tables in Java” we implemented our own version of the HashMap
class, which gave additional boost of performance.
The overall increase from when we started was about 15 times; the last 80% came from this rewriting. The article contains a full list of applied improvements and
their effect.
Now I’d like to apply similar approach to C++. Two major questions for today’s study are:
-
Does C++ templating make custom-made implementation of the standard classes completely unnecessary or even harmful?
-
How does performance of C++ code compare with that of the similar Java code? Most people (me included) expect much higher performance from C++. How correct is this expectation?
We’ll be comparing the results achieved on JRE 1.8.0.45 with those of GCC 4.5.4.
The latest version of Java code used as the basis of this development is here. The new C++ code is here.
The test
The articles mentioned above give full explanation of the hash-based approach to the Life simulation. The last one (“Home-made hash tables in Java”) contains the full list of performance figures.
We’re using the ACORN
pattern (see the previous article for the demonstration) and run its simulation
for 10,000 steps. The original time for this simulation in Java was 4961 ms. The improvements reduced it to 327 ms, so the simulation rate went up from 2,015 to 30,500 steps
per second.
We’ll do the same for C++. We’ll make a corresponding C++ version for each Java one, except for the cases where it is not applicable. This becomes an exciting race. Who comes first, Java or C++? And finally, after the finish of this race, we’ll have another one, the super-race: we’ll run the program with the Gosper Glider Gun as our original pattern. What makes this one special is that it produces more and more gliders and thus makes the colony grow indefinitely.
Let’s go.
The reference implementation
The general organisation of the new C++ program follows that of the Java program. We have a class Point
that represents co-ordinates, and an interface Worker
for Life implementations. All the implementations will be checked against the default one, called Hash_Reference
. This implementation works directly on Point
values
and uses the direct C++ equivalents of Java classes HashMap
and HashSet
:
The conversion of Java code is straightforward. This is what setting and resetting of cells looks like:
This relies on the Point
class to implement method
Here is the result:
Class name | Comment | Time, Java 8 | Time, C++ |
---|---|---|---|
Hash_Reference | Initial version with a trivial hash function | 4961 | 1372 |
C++ won the start of the race with a big advantage.
Hash_LongPoint
: Storing positions as 64-bit numbers
Just like in Java, we switch from using a class (Point
) to using a 64-bit number, only in C++ case it will be an unsigned number:
In Java we had to wrap this value in a special class, so that HashMap
could call hashCode()
on the instances of that class. In C++ this isn’t necessary,
because we can specify a hash function as a template parameter of unordered_map
. We can also make this function a template parameter of our Life implementation,
which makes a factory class unnecessary:
We can now test Hash_LongPoint
on the same set of hash functions as we did in Java (this set includes the standard hash function for a 64-bit number).
Here are the results:
Class name | Comment | Time, Java 8 | Time, C++ |
---|---|---|---|
Long | Long default | 5486 | 1043 |
LongPoint | Multiply by 3, 5 | 4809 | 1355 |
LongPoint3 | Multiply by 11, 17 | 1928 | 1183 |
LongPoint4 | Multiply by two big primes | 1611 | 1302 |
LongPoint5 | Multiply by one big prime | 1744 | 1256 |
LongPoint6 | Modulo big prime | 1650 | 1201 |
LongPoint7 | CRC32 | 3254 | 1244 |
The C++ equivalent of the Long
default hash function is the default one for LongPoint
, which is uint64_t
. It is available via the standard class
std::hash<LongPoint>
. This function performs much better than its Java counterpart. It is the best of the lot, which is impressive.
Our best option in Java (LongPoint6
, which divides the number by a big prime number and returns the reminder) also performs reasonably well, but, surprisingly,
LongPoint3
performs better.
The CRC-based function isn’t as bad as in Java’s case, because this time we use the hardware-based implementation:
We’ll test all our implementations with the three best hash functions: the default one for int64_t
, the LongPoint3
(multiplying by 11 and 17), and our Java favourite,
the division-based LongPoint6
.
Hash1
: using mutable objects as hash table values
In Java we used values of class Integer
as our initial values, because this class is standard in Java and allows boxing and unboxing, which makes the code clearer.
In C++ we can make hash maps of anything, including type int
. No boxing is needed, and the values are immediately mutable, which makes the code very nice:
(indexation of the hash map creates an empty element if the index is not found. In the case of int
it is zero, which is exactly what we need).
This means that the C++ version of Hash_LongPoint
already employs the Hash1
improvement, so there is no equivalent of Hash1
in C++.
In Java this change improved time from 1650 to 1437 ms; all our C++ versions are still faster.
Hash2
: a unified hash map
We merge two hash maps (a real hash map and a hash set) into one. For that, we create the class Value
, exactly like we did in Java.
All the other changes are also straightforward.
Description | Time, Java 8 | Time, C++ |
---|---|---|
A unified hash map, default hash | 845 | |
A unified hash map, multiply by 11, 17 | 961 | |
A unified hash map, modulo big prime | 1163 | 1003 |
The standard hash function is still the winner, and C++ is still ahead (although Java is catching up).
Hash3
: a check in set()
removed
In Java, the code of set()
looked like this:
The improvement made use of the fact that c == null
condition is never true. The corresponding fragment in C++ just looks like this:
The check happens inside the implementation of unordered_map
, so we can’t remove it. As a result, the Hash3
improvement is not applicable.
In Java it reduced the time from 1163 ms to 1151 ms; C++ is still ahead.
Hash4
: Entry
iterator in the loop
This improvement was very specific for Java. The Java’s hash map has a concept of a key set and that of an entry set, so there are two ways of iterating the map:
and
Replacing the first one with the second one produced big effect: the time went down from 1151 ms to 976 ms.
In C++ there is only one way to iterate the map, and that is the one we’ve already been using:
where auto
actually stands for std::pair<LongPoint, Value&>
.
This means that we are already iterating over hash map entries, so this improvement is not applicable. For the first time, Java runs faster than C++ (the C++ time is 1003 ms). However, this applies to the division-based hash function, which was the best in Java. The C++ still has two reserves: two other hash functions that made the time 845 ms and 961 ms.
Hash5
: ArrayList<Entry>
introduced
This has a direct equivalent in C++. Instead of collecting LongPoint
values into a std::vector
, we can collect values produced by the map iterator,
which are std::pair<LongPoint, Value&>
. The rest of the program must be modified exactly as in Java.
Description | Time, Java 8 | Time, C++ |
---|---|---|
ArrayList<Entry> , default hash | 812 | |
ArrayList<Entry> , multiply by 11, 17 | 909 | |
ArrayList<Entry> , modulo big prime | 975 | 923 |
C++ has regained leadership. The race becomes more and more interesting.
Hash6
: re-using ArrayList
objects
In C++ the equivalent of ArrayList
is std::vector
, and it benefits from the same modification as the ArrayList
: re-using the object may reduce the
amount of memory allocations. The change is very similar to that in Java: toSet
and toReset
become fields, and clear()
is called on them in step()
.
Description | Time, Java 8 | Time, C++ |
---|---|---|
Re-used ArrayList<Entry> , default hash | 726 | |
Re-used ArrayList<Entry> , multiply by 11, 17 | 842 | |
Re-used ArrayList<Entry> , modulo big prime | 947 | 857 |
This change really helped in C++ – even more than it did in Java. The C++ is gaining advantage in this race.
Hash7
: replacing ArrayList
with a plain array
This change helped in Java; let’s see if the equivalent change helps in C++. However, when trying to implement a similar scheme, we immediately
discover that there isn’t a fully equivalent change. The C++’s version of the Entry
class is std::pair<LongPoint, Value&>
, which does not have
a default constructor, so we can’t allocate arrays of this type:
just won’t compile. We could resolve this by using low-level memory manipulation functions in a C-like manner, but that would be too much for our task. Besides,
this is what the standard library already does inside its implementation of std::vector
.
We’ll have to create another Entry
class:
and then convert from one to another:
Here is the result:
Description | Time, Java 8 | Time, C++ |
---|---|---|
ArrayList replaced with an array, default hash | 753 | |
ArrayList replaced with an array, multiply by 11, 17 | 869 | |
ArrayList replaced with an array, modulo big prime | 904 | 891 |
This is a disappointment. The times got worse, although still better than those in Java. This is the example of the idea opposite to the one we started with: in C++ using the standard classes may be more beneficial than re-implementing them. I wish it stays like that.
Hash8
: using the LinkedHashMap
and increasing the hash map capacity
The standard C++ library does not have an equivalent of the LinkedHashMap
class. Probably, the library designers decided that whoever needs to arrange data in some other
container, such as a linked list, can do so easily. We can test what happens when we increase the hash map capacity of our existing solution (we’ll use the Hash6
as the best so far):
Description | Time, Java 8 | Time, C++ |
---|---|---|
Capacity made bigger, default hash | 2075 | |
Capacity made bigger, multiply by 11, 17 | 1847 | |
Capacity made bigger, modulo big prime | 591 | 2090 |
This is the first time where Java performance is much higher than that of the C++. Of course, the comparison isn’t completely fair, since we compare different algorithms, but still in Java the improved algorithm is available as a part of the standard library, while in C++ it is not.
The first home-made implementation
Now we are going to replace the standard hash map implementation with our own one. We’ll copy the Java implementation almost exactly.
The class Field
implements the hash table functionality, while Hash_MomeMade
uses it to implement Life.
The most notable differences are:
-
in Java we had to pass a hasher object into a constructor, while in C++ it became a template parameter, removing any overheads when using it;
-
in Java we passed the hash map capacity to the constructor, while in C++ we also made it a template parameter – a constant is always more efficient to use than a variable;
-
Java version employs the optimisation of
Hash7
(replacing ofArrayList
collections with plain arrays), while the C++ version does not – this change didn’t improve performance in C++; -
C++ does not have a garbage collection, so we must explicitly delete all objects that become unused. We must also delete the main array in the destructor.
Since we were testing on both Java 7 and Java 8, we weren’t using the lambda syntax in Java. As a result the main method (step
) looked a bit ugly, In C++ we
can use this syntax. Here is what the step
method looks like:
Here Cell
is the class that represents the value stored in the hash table, together with the key and hash table artefacts (the index value and the table_next
pointer
to arrange values in a linked list). Our vectors are defined as
As in Java, we first test it on a small table capacity (8192). Here are the results:
Description | Time, Java 8 | Time, C++ |
---|---|---|
The first home-made hash map implementation (small capacity), default hash | 1143 | |
The first home-made hash map implementation (small capacity), multiply by 11, 17 | 629 | |
The first home-made hash map implementation (small capacity), modulo big prime | 465 | 611 |
This result is sensational: Java runs faster than C++ with exactly the same algorithm. The only explanation I can think of is the garbage collector. While usually we tend to think about it as a performance-hindering feature, here it can actually help, provided that the heap size is big enough. Instead of deallocating every single unused object, we get rid of all of them, also at some cost, but rather infrequently.
Another interesting feature of this result is that for the first time the hash function that was the best in Java also became the best in C++. Moreover, the previous C++’s best (the default 64-bit integer hash) is now performing worse than before. This is something to investigate later.
Hash_HomeMade2
: the hash map code is built into the main code
In Java this helped because it removed callback-based iteration. Virtual calls in Java are fast but directly-inserted code is even faster. Does it work the same way in C++?
Mostly, the code transformation is about merging the two classes (Field
and Hash_HomeMade
), and moving all the relevant methods into one combined class. We can then
provide some degree of encapsulation by collecting all the Field
methods together and visually separating them from the code using them. It works with all the methods except
for iteration (and more efficient iteration was the point of this change). In Java we didn’t have a choice but to put the iteration code directly where it was used:
In C++ we can define the iterator code as a macro:
There are different opinions on if this macro trick really improves readability or makes it worse. It helps keep all the relevant code in one place without sacrificing performance, but for many eyes it looks ugly.
Here are the times:
Comment | Time, Java 8 | Time, C++ |
---|---|---|
The hash map code is built into the main code, default hash | 1107 | |
The hash map code is built into the main code, multiply by 11, 17 | 622 | |
The hash map code is built into the main code, modulo big prime | 449 | 576 |
The speed has improved, but Java version is still faster.
Hash_HomeMade3
: collecting cells in a double-linked list
In Java this imitated the activity of LinkedHashMap
. We can translate this solution to C++ directly. Instead of class Cell
, we use class ListedCell
with
prev
and next
pointers. Obviously, we must add appropriate list modifications where the cells are inserted into the map or removed from it.
The macros ITERATE
and END_ITERATE
introduced in the previous section, can help again. We simply rewrite this macros as
and the client code (in our case, that of step()
) does not have to change at all.
It would have been nice to have macros as language objects with usual visibility rules, rather than a pure text pre-processing feature. Then we would place all the hash map functionality in a separate class and export the iteration macros from there.
Here are the times:
Description | Time, Java 8 | Time, C++ |
---|---|---|
Cells collected in a list, default hash | 1286 | |
Cells collected in a list, multiply by 11, 17 | 567 | |
Cells collected in a list, modulo big prime | 433 | 552 |
The default hash result became worse, the rest improved. C++ is still behind, but the improvement is better than in Java. C++ is slowly catching up.
Hash_HomeMade4
: The hash map capacity is set to 256K
In Java the hash map capacity was defined as a constant to improve performance, so we had to make anothe class to change this value. In C++ it is a template parameter,
so the C++’s version of the Java’s Hash_HomeMade4
is Hash_HomeMade3<256*1024>
.
Here are the times:
Description | Time, Java 8 | Time, C++ |
---|---|---|
Big hashmap capacity, default hash | 1184 | |
Big hashmap capacity, multiply by 11, 17 | 500 | |
Big hashmap capacity, modulo big prime | 353 | 424 |
C++ keeps catching up (using the fastest hash function), but is still behind.
Hash_HomeMade5
: the action lists were introduced
Here we replace our vector
containers with usual linked lists. For that, we move on from the ListedCell
to the ActilnCell
class, which has one extra reference field
(next_action
), and introduce the obvious transformations of the code. Again, the garbage collection issue becomes important. This time the code change is in favour
of C++. In Java, the iteration over the action lists looked like this:
It was important to set c.next_action
to null
, to remove an extra reference to another Cell
object. If we didn’t do that, a lot of unwanted objects would be considered
live and escape garbage collection.
In C++ this is unnecessary, but there is another consideration:
Since reset()
can delete the object, we have to make sure to extract the next_action
field before this call. There is no such problem for set()
.
Here are the results:
Description | Time, Java 8 | Time, C++ |
---|---|---|
Action lists, default hash | 1162 | |
Action lists, multiply by 11, 17 | 478 | |
Action lists, modulo big prime | 354 | 425 |
It improved all the cases except for the fastest one. It did the same in Java, though.
Hash_HomeMade6
: hard-coded hash function
Since our hash function is a template parameter, it is as good as already hard-coded. We can skip this version. It didn’t help much in Java either: the time went down from 354 ms to 353 ms (recovering from the action lists damage).
Hash_HomeMade7
: the memory allocation optimisation
This is the first time we’ll be doing something in C++ that isn’t applicable in Java. We allocate and deallocate memory every time we insert a new cell into the hash table or remove existing one. We could keep the objects ane re-use them. We could do that in Java, too, but in Java memory allocation is cheap, so it didn’t promise any improvements.
We’ll use the same ActionCell
as before and collect all unallocated elements in a linked list, using the next
pointer. Calls to new
and delete
must be modified in a
straightforward way:
Since this is the last of our versions that is applicable to various hash functions, we’ll run it with all of them:
Hash function | Time, Java 8 | Time, C++ |
---|---|---|
Default hash | 1065 | |
Multiply by 3, 5 | 542 | |
Multiply by 11, 17 | 411 | |
Multiply by two big primes | 441 | |
Multiply by one big prime | 437 | |
Modulo big prime | 353 | 359 |
CRC32 | 438 |
The division-based hash function is the best. It almost caught up with Java, only a little bit is left to go.
Hash_Additive
: The additive division-based hash function
We’ve finished the generic algorithms (suitable for any hash functions), and the next thing to try is the set of additive algorithms, based on the calculation of the hash value based on the previous hash value. It worked in Java (at least, for one version), we’ll try this in C++.
The first version operates using the division-based hash function and makes use of the fact that when a constant is added to the divisible, the remainder changes in a
predictable way. The rest stays the same as in Hash_HomeMade7
.
Description | Time, Java 8 | Time, C++ |
---|---|---|
The additive implementation of "Modulo big prime" hash | 327 | 312 |
Finally, we’ve made it: the C++ version runs faster than Java’s one.
Hash_Additive2
: The additive multiplication-based hash function
The next version is based on the LongPointHash4
hash function, which multiplies two parts of a 64-bit number by two big prime numbers and adds the results. The transition
from one hash value to another one when a constant is added to the key is trivial. However, the quality of hashing of this function is somewhat lower than that of the
division-based one, so we can’t predict how this version will perform.
Description | Time, Java 8 | Time, C++ |
---|---|---|
The additive implementation of the "Multiply by two big primes" hash | 351 | 360 |
C++ is behind again, but, anyway, this version isn’t the fastest.
Hash_Additive3
: The ultimate additive multiplication-based hash function
This version does not keep original co-ordinates at all, only their values multiplied by prime numbers. This makes calculating hash values very fast but requires special procedure to recover the co-ordinates when reporting the simulation results.
Description | Time, Java 8 | Time, C++ |
---|---|---|
Improved additive implementation of the "Multiply by two big primes" hash | 361 | 348 |
Java became slower, C++ became faster and caught up, but this is still not the fastest version.
The first race: The summary
This is the end of the first race. Finally, the C++ has won, but just marginally (312 ms against 327 in Java). Here is the full scorecard (we show the best results in each row):
Java class name | Comment | Time, Java 8 | Time, C++ |
---|---|---|---|
Hash_Reference | Initial version with a trivial hash function | 4961 | 1372 |
Hash_LongPoint | Storing position in a 64-bit value and using a good hash function | 1650 | 1043 |
Hash1 | Mutable Count | 1437 | |
Hash2 | A unified hash map | 1163 | 845 |
Hash3 | Check in set removed | 1151 | |
Hash4 | Entry iterator in the loop | 976 | |
Hash5 | ArrayList<Entry> introduced | 975 | 812 |
Hash6 | Re-used ArrayList<Entry> | 947 | 726 |
Hash7 | ArrayList replaced with an array | 904 | 753 |
Hash8 | LinkedHashMap used,
capacity made bigger | 591 | 1847 |
Hash_HomeMade | The first home-made hash map implementation (small capacity) | 465 | 611 |
Hash_HomeMade2 | The hash map code is built into the main code | 449 | 576 |
Hash_HomeMade3 | The cells are collected into a double-linked list | 433 | 552 |
Hash_HomeMade4 | The hash map capacity is set to 256K | 353 | 424 |
Hash_HomeMade5 | The action lists were introduced | 354 | 425 |
Hash_HomeMade6 | The division-based hash function, hard coded | 353 | |
Hash_Additive | The division-based hash function, additive | 327 | 312 |
Hash_Additive2 | Multiplication by two big numbers, additive | 351 | 360 |
Hash_Additive3 | Multiplication by two big numbers, ultimate additive | 361 | 348 |
The big race: the glider gun
We’ve finished the previous article with the simulation of the Gosper Glider Gun. We’ll do the same in C++ now. This is the ultimate race. If the Acorn simulations can be compared with separate Formula-1 laps, the Glider Gun is more like Le Mans 24 hours – literally: it really runs for 24 hours and more (2,500,000 iterations ran for 33 hours in Java).
We’ll be running the Hash_Additive
version, as this is the best one we’ve produced so far.
Here are the results (I didn’t have patience to finish 2.5M steps):
Steps | Time for 10,000, sec | Total time, sec | ||
---|---|---|---|---|
Java | C++ | Java | C++ | |
10,000 | 0.75 | 0.77 | 0.75 | 0.77 |
20,000 | 1.6 | 2.7 | 2.3 | 3.5 |
30,000 | 2.7 | 4.9 | 5.0 | 8.4 |
500,000 | 125 | 317 | 2,206 | 6,301 |
1,000,000 | 324 | 980 | 13,577 | 37,160 |
1,500,000 | 553 | 1,670 | 35,203 | 101,968 |
2,000,000 | 834 | 2,436 | 70,420 | 187,174 |
2,500,000 | 1,126 | 119,481 |
This result is sensational. The C++ is losing dramatically. The execution speed in C++ becomes three times lower than in Java as the structure grows. Why this happens is a topic of another big investigation. We can straight away come with some theories:
-
One possible explanations involves a different strategy of memory allocation (direct allocate/deallocate vs garbage collection);
-
Different memory layout patterns could have caused better utilisation of the RAM cache in Java’s case;
-
Java could have generated much better code;
-
This could be a result of Java using compressed pointers (objects become shorter and more of them fit into the cache).
We won’t investigate it here, however, as this article is already too long. We’ll return to this issue later.
Conclusions
-
The general myth “C++ is always faster than Java” is busted.
-
The default hash map implementation in C++ is better than the default one in Java.
-
However, Java’s hash map performance can be improved by using higher capacity and the
LinkedHashMap
, while the C++’s can’t (or, rather, it can, but by custom coding). -
Most of the optimisations applicable in Java are also applicable in C++. This specifically applies to implementing a custom-made hash map. In our case it performed three times faster than the standard one (although the standard one worked three times faster than the Java’s standard one).
-
The custom implementation of hash maps makes both Java and C++ faster, but helps Java more (in fact, it helps it so much that it outperforms C++).
-
The C++’s default hash function for integer numbers (
uint32_t
in our test) works much better than the one in Java. It performed better than any of the custom ones until we implemented a custom hash map. Why this happened is still a mystery. -
While C++ caught up with Java performance on the small test, it failed miserably on the big one.
Coming soon
The C++ program being three times slower than the Java version is an abnormal situation and a mystery. It must be investigated. I’ll do that in one of the next articles.