Home-made hash tables in C++: a race against Java

This is the third article in a row that has a title starting with “Home-made”. While for the first one (on binary numbers) this is just a coincidence, for the other two (on hash tables) this illustrates specialisation as an optimisation idea. The code specially written for the task may perform better than the generic code. I say “may”, because this isn’t an absolute rule. The cases when this rule does not apply include:

very complex algorithms where re-implementing them is likely to cause plain bugs (it is surprising how many people make mistakes even writing a binary search from scratch);
algorithms that contain special performance cases, for which the developers have already provided a special treatment;
cases where the language is powerful enough to provide ways to implement generic code that efficiently performs any specific tasks.

I’m assuming here that we need optimisation in the first place, that’s why I don’t mention obvious cases where this particular piece of code isn’t a bottleneck, or the performance isn’t critical at all.

I always believed that the last of the three points above is applicable to C++. Let’s verify this using the problem that we have just solved in Java: a Game of Life simulation using hash maps.

The first implementation was done here: “Game of Life, hash tables and hash codes”. We studied the effect of the hash functions and found some that worked well.

Then we applied some improvements to the program: “Optimising the hash-based Life”. Most of the improvements dealt with replacing of Java standard classes with the classes of our own, which helped improve performance. We didn’t, however, replace the standard HashMap class.

Finally, in “Home-made hash tables in Java” we implemented our own version of the HashMap class, which gave additional boost of performance. The overall increase from when we started was about 15 times; the last 80% came from this rewriting. The article contains a full list of applied improvements and their effect.

Now I’d like to apply similar approach to C++. Two major questions for today’s study are:

Does C++ templating make custom-made implementation of the standard classes completely unnecessary or even harmful?
How does performance of C++ code compare with that of the similar Java code? Most people (me included) expect much higher performance from C++. How correct is this expectation?

We’ll be comparing the results achieved on JRE 1.8.0.45 with those of GCC 4.5.4.

The latest version of Java code used as the basis of this development is here. The new C++ code is here.

The test

The articles mentioned above give full explanation of the hash-based approach to the Life simulation. The last one (“Home-made hash tables in Java”) contains the full list of performance figures.

We’re using the ACORN pattern (see the previous article for the demonstration) and run its simulation for 10,000 steps. The original time for this simulation in Java was 4961 ms. The improvements reduced it to 327 ms, so the simulation rate went up from 2,015 to 30,500 steps per second.

We’ll do the same for C++. We’ll make a corresponding C++ version for each Java one, except for the cases where it is not applicable. This becomes an exciting race. Who comes first, Java or C++? And finally, after the finish of this race, we’ll have another one, the super-race: we’ll run the program with the Gosper Glider Gun as our original pattern. What makes this one special is that it produces more and more gliders and thus makes the colony grow indefinitely.

Let’s go.

The reference implementation

The general organisation of the new C++ program follows that of the Java program. We have a class Point that represents co-ordinates, and an interface Worker for Life implementations. All the implementations will be checked against the default one, called Hash_Reference. This implementation works directly on Point values and uses the direct C++ equivalents of Java classes HashMap and HashSet:

std::unordered_set<Point, PointHash> field;
std::unordered_map<Point, int, PointHash> counts;

typedef uint32_t HashType;

class PointHash
{
public:
    inline HashType operator() (Point p) const
    {
        return (HashType) (p.x * 3 + p.y * 5);
    }
};

The conversion of Java code is straightforward. This is what setting and resetting of cells looks like:

void inc(Point w)
{
    ++ counts[w];
}

void dec(Point w)
{
    if (!--counts[w]) {
        counts.erase(w);
    }
}

void set(Point w)
{
    for (Point p : w.neighbours()) {
        inc(p);
    }
    field.insert(w);
}

void reset(Point w)
{
    for (Point p : w.neighbours()) {
        dec(p);
    }
    field.erase(w);
}

This relies on the Point class to implement method

std::array<Point, 8> neighbours() const

Here is the result:

Class name	Comment	Time, Java 8	Time, C++
Hash_Reference	Initial version with a trivial hash function	4961	1372

C++ won the start of the race with a big advantage.

`Hash_LongPoint`: Storing positions as 64-bit numbers

Just like in Java, we switch from using a class (Point) to using a 64-bit number, only in C++ case it will be an unsigned number:

typedef uint64_t LongPoint;

In Java we had to wrap this value in a special class, so that HashMap could call hashCode() on the instances of that class. In C++ this isn’t necessary, because we can specify a hash function as a template parameter of unordered_map. We can also make this function a template parameter of our Life implementation, which makes a factory class unnecessary:

template<class HASH = std::hash<LongPoint> > class Hash_LongPoint
                                                   : public Worker
{
private:
    std::unordered_set<LongPoint, HASH> field;
    std::unordered_map<LongPoint, int, HASH> counts;

We can now test Hash_LongPoint on the same set of hash functions as we did in Java (this set includes the standard hash function for a 64-bit number).

Here are the results:

Class name	Comment	Time, Java 8	Time, C++
Long	`Long` default	5486	1043
LongPoint	Multiply by 3, 5	4809	1355
LongPoint3	Multiply by 11, 17	1928	1183
LongPoint4	Multiply by two big primes	1611	1302
LongPoint5	Multiply by one big prime	1744	1256
LongPoint6	Modulo big prime	1650	1201
LongPoint7	CRC32	3254	1244

The C++ equivalent of the Long default hash function is the default one for LongPoint, which is uint64_t. It is available via the standard class std::hash<LongPoint>. This function performs much better than its Java counterpart. It is the best of the lot, which is impressive.

Our best option in Java (LongPoint6, which divides the number by a big prime number and returns the reminder) also performs reasonably well, but, surprisingly, LongPoint3 performs better.

The CRC-based function isn’t as bad as in Java’s case, because this time we use the hardware-based implementation:

#include <nmmintrin.h>

class LongPointHash7
{
public:
    HashType operator() (LongPoint p) const
    {
        return (HashType)~_mm_crc32_u64(~(uint64_t)0, p);
    }
};

We’ll test all our implementations with the three best hash functions: the default one for int64_t, the LongPoint3 (multiplying by 11 and 17), and our Java favourite, the division-based LongPoint6.

`Hash1`: using mutable objects as hash table values

In Java we used values of class Integer as our initial values, because this class is standard in Java and allows boxing and unboxing, which makes the code clearer. In C++ we can make hash maps of anything, including type int. No boxing is needed, and the values are immediately mutable, which makes the code very nice:

void inc(Point w)
{
    ++ counts[w];
}

(indexation of the hash map creates an empty element if the index is not found. In the case of int it is zero, which is exactly what we need).

This means that the C++ version of Hash_LongPoint already employs the Hash1 improvement, so there is no equivalent of Hash1 in C++. In Java this change improved time from 1650 to 1437 ms; all our C++ versions are still faster.

`Hash2`: a unified hash map

We merge two hash maps (a real hash map and a hash set) into one. For that, we create the class Value, exactly like we did in Java. All the other changes are also straightforward.

Description	Time, Java 8	Time, C++
A unified hash map, default hash		845
A unified hash map, multiply by 11, 17		961
A unified hash map, modulo big prime	1163	1003

The standard hash function is still the winner, and C++ is still ahead (although Java is catching up).

`Hash3`: a check in `set()` removed

In Java, the code of set() looked like this:

void set (LongPoint k)
{
    long w = k.v;
    inc (w-DX-DY);
    inc (w-DX);
    inc (w-DX+DY);
    inc (w-DY);
    inc (w+DY);
    inc (w+DX-DY);
    inc (w+DX);
    inc (w+DX+DY);
    Value c = field.get (k);
    if (c == null) {
        field.put (k,  new Value (0, true));
    } else {
        c.live = true;
    }
}

The improvement made use of the fact that c == null condition is never true. The corresponding fragment in C++ just looks like this:

    field[w].set();

The check happens inside the implementation of unordered_map, so we can’t remove it. As a result, the Hash3 improvement is not applicable. In Java it reduced the time from 1163 ms to 1151 ms; C++ is still ahead.

`Hash4`: `Entry` iterator in the loop

This improvement was very specific for Java. The Java’s hash map has a concept of a key set and that of an entry set, so there are two ways of iterating the map:

for (LongPoint w : field.keySet ()) {
    Value c = field.get (w);

and

for (Entry<LongPoint, Value> w : field.entrySet ()) {
    Value c = w.getValue ();

Replacing the first one with the second one produced big effect: the time went down from 1151 ms to 976 ms.

In C++ there is only one way to iterate the map, and that is the one we’ve already been using:

for (auto w : field) {
    Value& v = w.second;

where auto actually stands for std::pair<LongPoint, Value&>.

This means that we are already iterating over hash map entries, so this improvement is not applicable. For the first time, Java runs faster than C++ (the C++ time is 1003 ms). However, this applies to the division-based hash function, which was the best in Java. The C++ still has two reserves: two other hash functions that made the time 845 ms and 961 ms.

`Hash5`: `ArrayList<Entry>` introduced

This has a direct equivalent in C++. Instead of collecting LongPoint values into a std::vector, we can collect values produced by the map iterator, which are std::pair<LongPoint, Value&>. The rest of the program must be modified exactly as in Java.

Description	Time, Java 8	Time, C++
`ArrayList<Entry>`, default hash		812
`ArrayList<Entry>`, multiply by 11, 17		909
`ArrayList<Entry>`, modulo big prime	975	923

C++ has regained leadership. The race becomes more and more interesting.

`Hash6`: re-using `ArrayList` objects

In C++ the equivalent of ArrayList is std::vector, and it benefits from the same modification as the ArrayList: re-using the object may reduce the amount of memory allocations. The change is very similar to that in Java: toSet and toReset become fields, and clear() is called on them in step().

Description	Time, Java 8	Time, C++
Re-used `ArrayList<Entry>`, default hash		726
Re-used `ArrayList<Entry>`, multiply by 11, 17		842
Re-used `ArrayList<Entry>`, modulo big prime	947	857

This change really helped in C++ – even more than it did in Java. The C++ is gaining advantage in this race.

`Hash7`: replacing `ArrayList` with a plain array

This change helped in Java; let’s see if the equivalent change helps in C++. However, when trying to implement a similar scheme, we immediately discover that there isn’t a fully equivalent change. The C++’s version of the Entry class is std::pair<LongPoint, Value&>, which does not have a default constructor, so we can’t allocate arrays of this type:

toReset = new Entry[INITIAL_ARRAY_SIZE];

just won’t compile. We could resolve this by using low-level memory manipulation functions in a C-like manner, but that would be too much for our task. Besides, this is what the standard library already does inside its implementation of std::vector.

We’ll have to create another Entry class:

typedef std::pair<LongPoint, Value*> Entry;

and then convert from one to another:

for (auto it = field.begin(); it != field.end(); it++) {
    Value& v = it->second;
    if (v.live) {
        if (v.count < 2 || v.count > 3) toReset[reset_count++] =
                                          Entry(it->first, &v);
    }
    else {
        if (v.count == 3) toSet[set_count++] = Entry(it->first, &v);
    }
}

Here is the result:

Description	Time, Java 8	Time, C++
`ArrayList` replaced with an array, default hash		753
`ArrayList` replaced with an array, multiply by 11, 17		869
`ArrayList` replaced with an array, modulo big prime	904	891

This is a disappointment. The times got worse, although still better than those in Java. This is the example of the idea opposite to the one we started with: in C++ using the standard classes may be more beneficial than re-implementing them. I wish it stays like that.

`Hash8`: using the `LinkedHashMap` and increasing the hash map capacity

The standard C++ library does not have an equivalent of the LinkedHashMap class. Probably, the library designers decided that whoever needs to arrange data in some other container, such as a linked list, can do so easily. We can test what happens when we increase the hash map capacity of our existing solution (we’ll use the Hash6 as the best so far):

Description	Time, Java 8	Time, C++
Capacity made bigger, default hash		2075
Capacity made bigger, multiply by 11, 17		1847
Capacity made bigger, modulo big prime	591	2090

This is the first time where Java performance is much higher than that of the C++. Of course, the comparison isn’t completely fair, since we compare different algorithms, but still in Java the improved algorithm is available as a part of the standard library, while in C++ it is not.

The first home-made implementation

Now we are going to replace the standard hash map implementation with our own one. We’ll copy the Java implementation almost exactly. The class Field implements the hash table functionality, while Hash_MomeMade uses it to implement Life. The most notable differences are:

in Java we had to pass a hasher object into a constructor, while in C++ it became a template parameter, removing any overheads when using it;
in Java we passed the hash map capacity to the constructor, while in C++ we also made it a template parameter – a constant is always more efficient to use than a variable;
Java version employs the optimisation of Hash7 (replacing of ArrayList collections with plain arrays), while the C++ version does not – this change didn’t improve performance in C++;
C++ does not have a garbage collection, so we must explicitly delete all objects that become unused. We must also delete the main array in the destructor.

Since we were testing on both Java 7 and Java 8, we weren’t using the lambda syntax in Java. As a result the main method (step) looked a bit ugly, In C++ we can use this syntax. Here is what the step method looks like:

void step()
{
    toSet.clear();
    toReset.clear();

    field.iterate([this](Cell * cell) {
        if (cell->live) {
            if (cell->neighbours < 2 || cell->neighbours > 3)
               toReset.push_back(cell);
        }
        else {
            if (cell->neighbours == 3) toSet.push_back(cell);
        }
    });
    for (Cell * c : toSet) {
        set(c);
    }
    for (Cell * c : toReset) {
        reset(c);
    }
}

Here Cell is the class that represents the value stored in the hash table, together with the key and hash table artefacts (the index value and the table_next pointer to arrange values in a linked list). Our vectors are defined as

std::vector<Cell*> toReset;
std::vector<Cell*> toSet;

As in Java, we first test it on a small table capacity (8192). Here are the results:

Description	Time, Java 8	Time, C++
The first home-made hash map implementation (small capacity), default hash		1143
The first home-made hash map implementation (small capacity), multiply by 11, 17		629
The first home-made hash map implementation (small capacity), modulo big prime	465	611

This result is sensational: Java runs faster than C++ with exactly the same algorithm. The only explanation I can think of is the garbage collector. While usually we tend to think about it as a performance-hindering feature, here it can actually help, provided that the heap size is big enough. Instead of deallocating every single unused object, we get rid of all of them, also at some cost, but rather infrequently.

Another interesting feature of this result is that for the first time the hash function that was the best in Java also became the best in C++. Moreover, the previous C++’s best (the default 64-bit integer hash) is now performing worse than before. This is something to investigate later.

`Hash_HomeMade2`: the hash map code is built into the main code

In Java this helped because it removed callback-based iteration. Virtual calls in Java are fast but directly-inserted code is even faster. Does it work the same way in C++?

Mostly, the code transformation is about merging the two classes (Field and Hash_HomeMade), and moving all the relevant methods into one combined class. We can then provide some degree of encapsulation by collecting all the Field methods together and visually separating them from the code using them. It works with all the methods except for iteration (and more efficient iteration was the point of this change). In Java we didn’t have a choice but to put the iteration code directly where it was used:

for (Cell cell : table) {
    for (Cell c = cell; c != null; c = c.table_next) {
        if (c.live) {
            if (c.neighbours < 2 || c.neighbours > 3) toReset[resetPtr ++] = c;
        } else {
            if (c.neighbours == 3) toSet[setPtr ++] = c;
        }
    }
}

In C++ we can define the iterator code as a macro:

#define ITERATE(cell) \
    for (size_t i = 0; i < CAPACITY; i++) {\
        for (Cell* cell = table[i]; cell; cell = cell->table_next)

#define END_ITERATE }

ITERATE (cell) {
    if (cell->live) {
        if (cell->neighbours < 2 || cell->neighbours > 3)
            toReset.push_back(cell);
    }
    else {
        if (cell->neighbours == 3) toSet.push_back(cell);
    }
} END_ITERATE

There are different opinions on if this macro trick really improves readability or makes it worse. It helps keep all the relevant code in one place without sacrificing performance, but for many eyes it looks ugly.

Here are the times:

Comment	Time, Java 8	Time, C++
The hash map code is built into the main code, default hash		1107
The hash map code is built into the main code, multiply by 11, 17		622
The hash map code is built into the main code, modulo big prime	449	576

The speed has improved, but Java version is still faster.

`Hash_HomeMade3`: collecting cells in a double-linked list

In Java this imitated the activity of LinkedHashMap. We can translate this solution to C++ directly. Instead of class Cell, we use class ListedCell with prev and next pointers. Obviously, we must add appropriate list modifications where the cells are inserted into the map or removed from it.

The macros ITERATE and END_ITERATE introduced in the previous section, can help again. We simply rewrite this macros as

#define ITERATE(c) \
    for (ListedCell* c = full_list.next; c != &full_list; c = c->next)

#define END_ITERATE

and the client code (in our case, that of step()) does not have to change at all.

It would have been nice to have macros as language objects with usual visibility rules, rather than a pure text pre-processing feature. Then we would place all the hash map functionality in a separate class and export the iteration macros from there.

Here are the times:

Description	Time, Java 8	Time, C++
Cells collected in a list, default hash		1286
Cells collected in a list, multiply by 11, 17		567
Cells collected in a list, modulo big prime	433	552

The default hash result became worse, the rest improved. C++ is still behind, but the improvement is better than in Java. C++ is slowly catching up.

`Hash_HomeMade4`: The hash map capacity is set to 256K

In Java the hash map capacity was defined as a constant to improve performance, so we had to make anothe class to change this value. In C++ it is a template parameter, so the C++’s version of the Java’s Hash_HomeMade4 is Hash_HomeMade3<256*1024>.

Here are the times:

Description	Time, Java 8	Time, C++
Big hashmap capacity, default hash		1184
Big hashmap capacity, multiply by 11, 17		500
Big hashmap capacity, modulo big prime	353	424

C++ keeps catching up (using the fastest hash function), but is still behind.

`Hash_HomeMade5`: the action lists were introduced

Here we replace our vector containers with usual linked lists. For that, we move on from the ListedCell to the ActilnCell class, which has one extra reference field (next_action), and introduce the obvious transformations of the code. Again, the garbage collection issue becomes important. This time the code change is in favour of C++. In Java, the iteration over the action lists looked like this:

for (ActionCell c = toSet; c != null; c = next_action) {
    set (c);
    next_action = c.next_action;
    c.next_action = null;
}
for (ActionCell c = toReset; c != null; c = next_action) {
    reset (c);
    next_action = c.next_action;
    c.next_action = null;
}

It was important to set c.next_action to null, to remove an extra reference to another Cell object. If we didn’t do that, a lot of unwanted objects would be considered live and escape garbage collection.

In C++ this is unnecessary, but there is another consideration:

for (ActionCell* c = toSet; c; c = c->next_action) {
    set(c);
}
for (ActionCell* c = toReset; c;) {
    ActionCell* next_action = c->next_action;
    reset(c);
    c = next_action;
}

Since reset() can delete the object, we have to make sure to extract the next_action field before this call. There is no such problem for set().

Here are the results:

Description	Time, Java 8	Time, C++
Action lists, default hash		1162
Action lists, multiply by 11, 17		478
Action lists, modulo big prime	354	425

It improved all the cases except for the fastest one. It did the same in Java, though.

`Hash_HomeMade6`: hard-coded hash function

Since our hash function is a template parameter, it is as good as already hard-coded. We can skip this version. It didn’t help much in Java either: the time went down from 354 ms to 353 ms (recovering from the action lists damage).

`Hash_HomeMade7`: the memory allocation optimisation

This is the first time we’ll be doing something in C++ that isn’t applicable in Java. We allocate and deallocate memory every time we insert a new cell into the hash table or remove existing one. We could keep the objects ane re-use them. We could do that in Java, too, but in Java memory allocation is cheap, so it didn’t promise any improvements.

We’ll use the same ActionCell as before and collect all unallocated elements in a linked list, using the next pointer. Calls to new and delete must be modified in a straightforward way:

ActionCell * allocate(LongPoint position, int neighbours, bool live = false)
{
    if (!free_list) {
        return new ActionCell(position, neighbours, live);
    }
    ActionCell * result = free_list;
    free_list = result->next;
    result->set(position, neighbours, live);
    return result;
}

void deallocate(ActionCell * cell)
{
    cell->next = free_list;
    free_list = cell;
}

Since this is the last of our versions that is applicable to various hash functions, we’ll run it with all of them:

Hash function	Time, Java 8	Time, C++
Default hash		1065
Multiply by 3, 5		542
Multiply by 11, 17		411
Multiply by two big primes		441
Multiply by one big prime		437
Modulo big prime	353	359
CRC32		438

The division-based hash function is the best. It almost caught up with Java, only a little bit is left to go.

`Hash_Additive`: The additive division-based hash function

We’ve finished the generic algorithms (suitable for any hash functions), and the next thing to try is the set of additive algorithms, based on the calculation of the hash value based on the previous hash value. It worked in Java (at least, for one version), we’ll try this in C++.

The first version operates using the division-based hash function and makes use of the fact that when a constant is added to the divisible, the remainder changes in a predictable way. The rest stays the same as in Hash_HomeMade7.

Description	Time, Java 8	Time, C++
The additive implementation of "Modulo big prime" hash	327	312

Finally, we’ve made it: the C++ version runs faster than Java’s one.

`Hash_Additive2`: The additive multiplication-based hash function

The next version is based on the LongPointHash4 hash function, which multiplies two parts of a 64-bit number by two big prime numbers and adds the results. The transition from one hash value to another one when a constant is added to the key is trivial. However, the quality of hashing of this function is somewhat lower than that of the division-based one, so we can’t predict how this version will perform.

Description	Time, Java 8	Time, C++
The additive implementation of the "Multiply by two big primes" hash	351	360

C++ is behind again, but, anyway, this version isn’t the fastest.

`Hash_Additive3`: The ultimate additive multiplication-based hash function

This version does not keep original co-ordinates at all, only their values multiplied by prime numbers. This makes calculating hash values very fast but requires special procedure to recover the co-ordinates when reporting the simulation results.

Description	Time, Java 8	Time, C++
Improved additive implementation of the "Multiply by two big primes" hash	361	348

Java became slower, C++ became faster and caught up, but this is still not the fastest version.

The first race: The summary

This is the end of the first race. Finally, the C++ has won, but just marginally (312 ms against 327 in Java). Here is the full scorecard (we show the best results in each row):

Java class name	Comment	Time, Java 8	Time, C++
Hash_Reference	Initial version with a trivial hash function	4961	1372
Hash_LongPoint	Storing position in a 64-bit value and using a good hash function	1650	1043
Hash1	Mutable Count	1437
Hash2	A unified hash map	1163	845
Hash3	Check in `set` removed	1151
Hash4	`Entry` iterator in the loop	976
Hash5	`ArrayList<Entry>` introduced	975	812
Hash6	Re-used `ArrayList<Entry>`	947	726
Hash7	`ArrayList` replaced with an array	904	753
Hash8	`LinkedHashMap` used, capacity made bigger	591	1847
Hash_HomeMade	The first home-made hash map implementation (small capacity)	465	611
Hash_HomeMade2	The hash map code is built into the main code	449	576
Hash_HomeMade3	The cells are collected into a double-linked list	433	552
Hash_HomeMade4	The hash map capacity is set to 256K	353	424
Hash_HomeMade5	The action lists were introduced	354	425
Hash_HomeMade6	The division-based hash function, hard coded	353
Hash_Additive	The division-based hash function, additive	327	312
Hash_Additive2	Multiplication by two big numbers, additive	351	360
Hash_Additive3	Multiplication by two big numbers, ultimate additive	361	348

The big race: the glider gun

We’ve finished the previous article with the simulation of the Gosper Glider Gun. We’ll do the same in C++ now. This is the ultimate race. If the Acorn simulations can be compared with separate Formula-1 laps, the Glider Gun is more like Le Mans 24 hours – literally: it really runs for 24 hours and more (2,500,000 iterations ran for 33 hours in Java).

We’ll be running the Hash_Additive version, as this is the best one we’ve produced so far.

Here are the results (I didn’t have patience to finish 2.5M steps):

Steps	Time for 10,000, sec		Total time, sec
Steps	Java	C++	Java	C++
10,000	0.75	0.77	0.75	0.77
20,000	1.6	2.7	2.3	3.5
30,000	2.7	4.9	5.0	8.4
500,000	125	317	2,206	6,301
1,000,000	324	980	13,577	37,160
1,500,000	553	1,670	35,203	101,968
2,000,000	834	2,436	70,420	187,174
2,500,000	1,126		119,481

This result is sensational. The C++ is losing dramatically. The execution speed in C++ becomes three times lower than in Java as the structure grows. Why this happens is a topic of another big investigation. We can straight away come with some theories:

One possible explanations involves a different strategy of memory allocation (direct allocate/deallocate vs garbage collection);
Different memory layout patterns could have caused better utilisation of the RAM cache in Java’s case;
Java could have generated much better code;
This could be a result of Java using compressed pointers (objects become shorter and more of them fit into the cache).

We won’t investigate it here, however, as this article is already too long. We’ll return to this issue later.

Conclusions

The general myth “C++ is always faster than Java” is busted.
The default hash map implementation in C++ is better than the default one in Java.
However, Java’s hash map performance can be improved by using higher capacity and the LinkedHashMap, while the C++’s can’t (or, rather, it can, but by custom coding).
Most of the optimisations applicable in Java are also applicable in C++. This specifically applies to implementing a custom-made hash map. In our case it performed three times faster than the standard one (although the standard one worked three times faster than the Java’s standard one).
The custom implementation of hash maps makes both Java and C++ faster, but helps Java more (in fact, it helps it so much that it outperforms C++).
The C++’s default hash function for integer numbers (uint32_t in our test) works much better than the one in Java. It performed better than any of the custom ones until we implemented a custom hash map. Why this happened is still a mystery.
While C++ caught up with Java performance on the small test, it failed miserably on the big one.

Coming soon

The C++ program being three times slower than the Java version is an abnormal situation and a mystery. It must be investigated. I’ll do that in one of the next articles.

Experiments in program optimisation

Home-made hash tables in C++: a race against Java

The test

The reference implementation

Hash_LongPoint: Storing positions as 64-bit numbers

Hash1: using mutable objects as hash table values

Hash2: a unified hash map

Hash3: a check in set() removed

Hash4: Entry iterator in the loop

Hash5: ArrayList<Entry> introduced

Hash6: re-using ArrayList objects

Hash7: replacing ArrayList with a plain array

Hash8: using the LinkedHashMap and increasing the hash map capacity

The first home-made implementation

Hash_HomeMade2: the hash map code is built into the main code

Hash_HomeMade3: collecting cells in a double-linked list

Hash_HomeMade4: The hash map capacity is set to 256K

Hash_HomeMade5: the action lists were introduced

Hash_HomeMade6: hard-coded hash function

Hash_HomeMade7: the memory allocation optimisation

Hash_Additive: The additive division-based hash function

Hash_Additive2: The additive multiplication-based hash function

Hash_Additive3: The ultimate additive multiplication-based hash function

The first race: The summary

The big race: the glider gun

Conclusions

Coming soon

`Hash_LongPoint`: Storing positions as 64-bit numbers

`Hash1`: using mutable objects as hash table values

`Hash2`: a unified hash map

`Hash3`: a check in `set()` removed

`Hash4`: `Entry` iterator in the loop

`Hash5`: `ArrayList<Entry>` introduced

`Hash6`: re-using `ArrayList` objects

`Hash7`: replacing `ArrayList` with a plain array

`Hash8`: using the `LinkedHashMap` and increasing the hash map capacity

`Hash_HomeMade2`: the hash map code is built into the main code

`Hash_HomeMade3`: collecting cells in a double-linked list

`Hash_HomeMade4`: The hash map capacity is set to 256K

`Hash_HomeMade5`: the action lists were introduced

`Hash_HomeMade6`: hard-coded hash function

`Hash_HomeMade7`: the memory allocation optimisation

`Hash_Additive`: The additive division-based hash function

`Hash_Additive2`: The additive multiplication-based hash function

`Hash_Additive3`: The ultimate additive multiplication-based hash function