Experiments in program optimisation

Home-made hash tables in C++: a race against Java

Story: Conway's Game of Life

Tags: Java optimisation hashmap life

21 Mar 2016

This is the third article in a row that has a title starting with “Home-made”. While for the first one (on binary numbers) this is just a coincidence, for the other two (on hash tables) this illustrates specialisation as an optimisation idea. The code specially written for the task may perform better than the generic code. I say “may”, because this isn’t an absolute rule. The cases when this rule does not apply include:

I’m assuming here that we need optimisation in the first place, that’s why I don’t mention obvious cases where this particular piece of code isn’t a bottleneck, or the performance isn’t critical at all.

I always believed that the last of the three points above is applicable to C++. Let’s verify this using the problem that we have just solved in Java: a Game of Life simulation using hash maps.

The first implementation was done here: “Game of Life, hash tables and hash codes”. We studied the effect of the hash functions and found some that worked well.

Then we applied some improvements to the program: “Optimising the hash-based Life”. Most of the improvements dealt with replacing of Java standard classes with the classes of our own, which helped improve performance. We didn’t, however, replace the standard HashMap class.

Finally, in “Home-made hash tables in Java” we implemented our own version of the HashMap class, which gave additional boost of performance. The overall increase from when we started was about 15 times; the last 80% came from this rewriting. The article contains a full list of applied improvements and their effect.

Now I’d like to apply similar approach to C++. Two major questions for today’s study are:

We’ll be comparing the results achieved on JRE 1.8.0.45 with those of GCC 4.5.4.

The latest version of Java code used as the basis of this development is here. The new C++ code is here.

The test

The articles mentioned above give full explanation of the hash-based approach to the Life simulation. The last one (“Home-made hash tables in Java”) contains the full list of performance figures.

We’re using the ACORN pattern (see the previous article for the demonstration) and run its simulation for 10,000 steps. The original time for this simulation in Java was 4961 ms. The improvements reduced it to 327 ms, so the simulation rate went up from 2,015 to 30,500 steps per second.

We’ll do the same for C++. We’ll make a corresponding C++ version for each Java one, except for the cases where it is not applicable. This becomes an exciting race. Who comes first, Java or C++? And finally, after the finish of this race, we’ll have another one, the super-race: we’ll run the program with the Gosper Glider Gun as our original pattern. What makes this one special is that it produces more and more gliders and thus makes the colony grow indefinitely.

Let’s go.

The reference implementation

The general organisation of the new C++ program follows that of the Java program. We have a class Point that represents co-ordinates, and an interface Worker for Life implementations. All the implementations will be checked against the default one, called Hash_Reference. This implementation works directly on Point values and uses the direct C++ equivalents of Java classes HashMap and HashSet:

std::unordered_set<Point, PointHash> field;
std::unordered_map<Point, int, PointHash> counts;

typedef uint32_t HashType;

class PointHash
{
public:
    inline HashType operator() (Point p) const
    {
        return (HashType) (p.x * 3 + p.y * 5);
    }
};

The conversion of Java code is straightforward. This is what setting and resetting of cells looks like:

void inc(Point w)
{
    ++ counts[w];
}

void dec(Point w)
{
    if (!--counts[w]) {
        counts.erase(w);
    }
}

void set(Point w)
{
    for (Point p : w.neighbours()) {
        inc(p);
    }
    field.insert(w);
}

void reset(Point w)
{
    for (Point p : w.neighbours()) {
        dec(p);
    }
    field.erase(w);
}

This relies on the Point class to implement method

std::array<Point, 8> neighbours() const

Here is the result:

Class name CommentTime, Java 8Time, C++
Hash_Reference Initial version with a trivial hash function 49611372

C++ won the start of the race with a big advantage.

Hash_LongPoint: Storing positions as 64-bit numbers

Just like in Java, we switch from using a class (Point) to using a 64-bit number, only in C++ case it will be an unsigned number:

typedef uint64_t LongPoint;

In Java we had to wrap this value in a special class, so that HashMap could call hashCode() on the instances of that class. In C++ this isn’t necessary, because we can specify a hash function as a template parameter of unordered_map. We can also make this function a template parameter of our Life implementation, which makes a factory class unnecessary:

template<class HASH = std::hash<LongPoint> > class Hash_LongPoint
                                                   : public Worker
{
private:
    std::unordered_set<LongPoint, HASH> field;
    std::unordered_map<LongPoint, int, HASH> counts;

We can now test Hash_LongPoint on the same set of hash functions as we did in Java (this set includes the standard hash function for a 64-bit number).

Here are the results:

Class name Comment Time, Java 8Time, C++
Long Long default 5486 1043
LongPoint Multiply by 3, 5 4809 1355
LongPoint3 Multiply by 11, 17 1928 1183
LongPoint4 Multiply by two big primes 1611 1302
LongPoint5 Multiply by one big prime 1744 1256
LongPoint6 Modulo big prime 1650 1201
LongPoint7 CRC32 3254 1244

The C++ equivalent of the Long default hash function is the default one for LongPoint, which is uint64_t. It is available via the standard class std::hash<LongPoint>. This function performs much better than its Java counterpart. It is the best of the lot, which is impressive.

Our best option in Java (LongPoint6, which divides the number by a big prime number and returns the reminder) also performs reasonably well, but, surprisingly, LongPoint3 performs better.

The CRC-based function isn’t as bad as in Java’s case, because this time we use the hardware-based implementation:

#include <nmmintrin.h>

class LongPointHash7
{
public:
    HashType operator() (LongPoint p) const
    {
        return (HashType)~_mm_crc32_u64(~(uint64_t)0, p);
    }
};

We’ll test all our implementations with the three best hash functions: the default one for int64_t, the LongPoint3 (multiplying by 11 and 17), and our Java favourite, the division-based LongPoint6.

Hash1: using mutable objects as hash table values

In Java we used values of class Integer as our initial values, because this class is standard in Java and allows boxing and unboxing, which makes the code clearer. In C++ we can make hash maps of anything, including type int. No boxing is needed, and the values are immediately mutable, which makes the code very nice:

void inc(Point w)
{
    ++ counts[w];
}

(indexation of the hash map creates an empty element if the index is not found. In the case of int it is zero, which is exactly what we need).

This means that the C++ version of Hash_LongPoint already employs the Hash1 improvement, so there is no equivalent of Hash1 in C++. In Java this change improved time from 1650 to 1437 ms; all our C++ versions are still faster.

Hash2: a unified hash map

We merge two hash maps (a real hash map and a hash set) into one. For that, we create the class Value, exactly like we did in Java. All the other changes are also straightforward.

DescriptionTime, Java 8Time, C++
A unified hash map, default hash 845
A unified hash map, multiply by 11, 17 961
A unified hash map, modulo big prime 1163 1003

The standard hash function is still the winner, and C++ is still ahead (although Java is catching up).

Hash3: a check in set() removed

In Java, the code of set() looked like this:

void set (LongPoint k)
{
    long w = k.v;
    inc (w-DX-DY);
    inc (w-DX);
    inc (w-DX+DY);
    inc (w-DY);
    inc (w+DY);
    inc (w+DX-DY);
    inc (w+DX);
    inc (w+DX+DY);
    Value c = field.get (k);
    if (c == null) {
        field.put (k,  new Value (0, true));
    } else {
        c.live = true;
    }
}

The improvement made use of the fact that c == null condition is never true. The corresponding fragment in C++ just looks like this:

    field[w].set();

The check happens inside the implementation of unordered_map, so we can’t remove it. As a result, the Hash3 improvement is not applicable. In Java it reduced the time from 1163 ms to 1151 ms; C++ is still ahead.

Hash4: Entry iterator in the loop

This improvement was very specific for Java. The Java’s hash map has a concept of a key set and that of an entry set, so there are two ways of iterating the map:

for (LongPoint w : field.keySet ()) {
    Value c = field.get (w);

and

for (Entry<LongPoint, Value> w : field.entrySet ()) {
    Value c = w.getValue ();

Replacing the first one with the second one produced big effect: the time went down from 1151 ms to 976 ms.

In C++ there is only one way to iterate the map, and that is the one we’ve already been using:

for (auto w : field) {
    Value& v = w.second;

where auto actually stands for std::pair<LongPoint, Value&>.

This means that we are already iterating over hash map entries, so this improvement is not applicable. For the first time, Java runs faster than C++ (the C++ time is 1003 ms). However, this applies to the division-based hash function, which was the best in Java. The C++ still has two reserves: two other hash functions that made the time 845 ms and 961 ms.

Hash5: ArrayList<Entry> introduced

This has a direct equivalent in C++. Instead of collecting LongPoint values into a std::vector, we can collect values produced by the map iterator, which are std::pair<LongPoint, Value&>. The rest of the program must be modified exactly as in Java.

DescriptionTime, Java 8Time, C++
ArrayList<Entry>, default hash 812
ArrayList<Entry>, multiply by 11, 17 909
ArrayList<Entry>, modulo big prime 975 923

C++ has regained leadership. The race becomes more and more interesting.

Hash6: re-using ArrayList objects

In C++ the equivalent of ArrayList is std::vector, and it benefits from the same modification as the ArrayList: re-using the object may reduce the amount of memory allocations. The change is very similar to that in Java: toSet and toReset become fields, and clear() is called on them in step().

DescriptionTime, Java 8Time, C++
Re-used ArrayList<Entry>, default hash 726
Re-used ArrayList<Entry>, multiply by 11, 17 842
Re-used ArrayList<Entry>, modulo big prime 947 857

This change really helped in C++ – even more than it did in Java. The C++ is gaining advantage in this race.

Hash7: replacing ArrayList with a plain array

This change helped in Java; let’s see if the equivalent change helps in C++. However, when trying to implement a similar scheme, we immediately discover that there isn’t a fully equivalent change. The C++’s version of the Entry class is std::pair<LongPoint, Value&>, which does not have a default constructor, so we can’t allocate arrays of this type:

toReset = new Entry[INITIAL_ARRAY_SIZE];

just won’t compile. We could resolve this by using low-level memory manipulation functions in a C-like manner, but that would be too much for our task. Besides, this is what the standard library already does inside its implementation of std::vector.

We’ll have to create another Entry class:

typedef std::pair<LongPoint, Value*> Entry;

and then convert from one to another:

for (auto it = field.begin(); it != field.end(); it++) {
    Value& v = it->second;
    if (v.live) {
        if (v.count < 2 || v.count > 3) toReset[reset_count++] =
                                          Entry(it->first, &v);
    }
    else {
        if (v.count == 3) toSet[set_count++] = Entry(it->first, &v);
    }
}

Here is the result:

DescriptionTime, Java 8Time, C++
ArrayList replaced with an array, default hash 753
ArrayList replaced with an array, multiply by 11, 17 869
ArrayList replaced with an array, modulo big prime 904 891

This is a disappointment. The times got worse, although still better than those in Java. This is the example of the idea opposite to the one we started with: in C++ using the standard classes may be more beneficial than re-implementing them. I wish it stays like that.

Hash8: using the LinkedHashMap and increasing the hash map capacity

The standard C++ library does not have an equivalent of the LinkedHashMap class. Probably, the library designers decided that whoever needs to arrange data in some other container, such as a linked list, can do so easily. We can test what happens when we increase the hash map capacity of our existing solution (we’ll use the Hash6 as the best so far):

DescriptionTime, Java 8Time, C++
Capacity made bigger, default hash 2075
Capacity made bigger, multiply by 11, 17 1847
Capacity made bigger, modulo big prime 591 2090

This is the first time where Java performance is much higher than that of the C++. Of course, the comparison isn’t completely fair, since we compare different algorithms, but still in Java the improved algorithm is available as a part of the standard library, while in C++ it is not.

The first home-made implementation

Now we are going to replace the standard hash map implementation with our own one. We’ll copy the Java implementation almost exactly. The class Field implements the hash table functionality, while Hash_MomeMade uses it to implement Life. The most notable differences are:

Since we were testing on both Java 7 and Java 8, we weren’t using the lambda syntax in Java. As a result the main method (step) looked a bit ugly, In C++ we can use this syntax. Here is what the step method looks like:

void step()
{
    toSet.clear();
    toReset.clear();

    field.iterate([this](Cell * cell) {
        if (cell->live) {
            if (cell->neighbours < 2 || cell->neighbours > 3)
               toReset.push_back(cell);
        }
        else {
            if (cell->neighbours == 3) toSet.push_back(cell);
        }
    });
    for (Cell * c : toSet) {
        set(c);
    }
    for (Cell * c : toReset) {
        reset(c);
    }
}

Here Cell is the class that represents the value stored in the hash table, together with the key and hash table artefacts (the index value and the table_next pointer to arrange values in a linked list). Our vectors are defined as

std::vector<Cell*> toReset;
std::vector<Cell*> toSet;

As in Java, we first test it on a small table capacity (8192). Here are the results:

DescriptionTime, Java 8Time, C++
The first home-made hash map implementation (small capacity), default hash 1143
The first home-made hash map implementation (small capacity), multiply by 11, 17 629
The first home-made hash map implementation (small capacity), modulo big prime 465 611

This result is sensational: Java runs faster than C++ with exactly the same algorithm. The only explanation I can think of is the garbage collector. While usually we tend to think about it as a performance-hindering feature, here it can actually help, provided that the heap size is big enough. Instead of deallocating every single unused object, we get rid of all of them, also at some cost, but rather infrequently.

Another interesting feature of this result is that for the first time the hash function that was the best in Java also became the best in C++. Moreover, the previous C++’s best (the default 64-bit integer hash) is now performing worse than before. This is something to investigate later.

Hash_HomeMade2: the hash map code is built into the main code

In Java this helped because it removed callback-based iteration. Virtual calls in Java are fast but directly-inserted code is even faster. Does it work the same way in C++?

Mostly, the code transformation is about merging the two classes (Field and Hash_HomeMade), and moving all the relevant methods into one combined class. We can then provide some degree of encapsulation by collecting all the Field methods together and visually separating them from the code using them. It works with all the methods except for iteration (and more efficient iteration was the point of this change). In Java we didn’t have a choice but to put the iteration code directly where it was used:

for (Cell cell : table) {
    for (Cell c = cell; c != null; c = c.table_next) {
        if (c.live) {
            if (c.neighbours < 2 || c.neighbours > 3) toReset[resetPtr ++] = c;
        } else {
            if (c.neighbours == 3) toSet[setPtr ++] = c;
        }
    }
}

In C++ we can define the iterator code as a macro:

#define ITERATE(cell) \
    for (size_t i = 0; i < CAPACITY; i++) {\
        for (Cell* cell = table[i]; cell; cell = cell->table_next)

#define END_ITERATE }

ITERATE (cell) {
    if (cell->live) {
        if (cell->neighbours < 2 || cell->neighbours > 3)
            toReset.push_back(cell);
    }
    else {
        if (cell->neighbours == 3) toSet.push_back(cell);
    }
} END_ITERATE

There are different opinions on if this macro trick really improves readability or makes it worse. It helps keep all the relevant code in one place without sacrificing performance, but for many eyes it looks ugly.

Here are the times:

CommentTime, Java 8Time, C++
The hash map code is built into the main code, default hash 1107
The hash map code is built into the main code, multiply by 11, 17 622
The hash map code is built into the main code, modulo big prime 449 576

The speed has improved, but Java version is still faster.

Hash_HomeMade3: collecting cells in a double-linked list

In Java this imitated the activity of LinkedHashMap. We can translate this solution to C++ directly. Instead of class Cell, we use class ListedCell with prev and next pointers. Obviously, we must add appropriate list modifications where the cells are inserted into the map or removed from it.

The macros ITERATE and END_ITERATE introduced in the previous section, can help again. We simply rewrite this macros as

#define ITERATE(c) \
    for (ListedCell* c = full_list.next; c != &full_list; c = c->next)

#define END_ITERATE

and the client code (in our case, that of step()) does not have to change at all.

It would have been nice to have macros as language objects with usual visibility rules, rather than a pure text pre-processing feature. Then we would place all the hash map functionality in a separate class and export the iteration macros from there.

Here are the times:

DescriptionTime, Java 8Time, C++
Cells collected in a list, default hash 1286
Cells collected in a list, multiply by 11, 17 567
Cells collected in a list, modulo big prime 433 552

The default hash result became worse, the rest improved. C++ is still behind, but the improvement is better than in Java. C++ is slowly catching up.

Hash_HomeMade4: The hash map capacity is set to 256K

In Java the hash map capacity was defined as a constant to improve performance, so we had to make anothe class to change this value. In C++ it is a template parameter, so the C++’s version of the Java’s Hash_HomeMade4 is Hash_HomeMade3<256*1024>.

Here are the times:

DescriptionTime, Java 8Time, C++
Big hashmap capacity, default hash 1184
Big hashmap capacity, multiply by 11, 17 500
Big hashmap capacity, modulo big prime 353 424

C++ keeps catching up (using the fastest hash function), but is still behind.

Hash_HomeMade5: the action lists were introduced

Here we replace our vector containers with usual linked lists. For that, we move on from the ListedCell to the ActilnCell class, which has one extra reference field (next_action), and introduce the obvious transformations of the code. Again, the garbage collection issue becomes important. This time the code change is in favour of C++. In Java, the iteration over the action lists looked like this:

for (ActionCell c = toSet; c != null; c = next_action) {
    set (c);
    next_action = c.next_action;
    c.next_action = null;
}
for (ActionCell c = toReset; c != null; c = next_action) {
    reset (c);
    next_action = c.next_action;
    c.next_action = null;
}

It was important to set c.next_action to null, to remove an extra reference to another Cell object. If we didn’t do that, a lot of unwanted objects would be considered live and escape garbage collection.

In C++ this is unnecessary, but there is another consideration:

for (ActionCell* c = toSet; c; c = c->next_action) {
    set(c);
}
for (ActionCell* c = toReset; c;) {
    ActionCell* next_action = c->next_action;
    reset(c);
    c = next_action;
}

Since reset() can delete the object, we have to make sure to extract the next_action field before this call. There is no such problem for set().

Here are the results:

DescriptionTime, Java 8Time, C++
Action lists, default hash 1162
Action lists, multiply by 11, 17 478
Action lists, modulo big prime 354 425

It improved all the cases except for the fastest one. It did the same in Java, though.

Hash_HomeMade6: hard-coded hash function

Since our hash function is a template parameter, it is as good as already hard-coded. We can skip this version. It didn’t help much in Java either: the time went down from 354 ms to 353 ms (recovering from the action lists damage).

Hash_HomeMade7: the memory allocation optimisation

This is the first time we’ll be doing something in C++ that isn’t applicable in Java. We allocate and deallocate memory every time we insert a new cell into the hash table or remove existing one. We could keep the objects ane re-use them. We could do that in Java, too, but in Java memory allocation is cheap, so it didn’t promise any improvements.

We’ll use the same ActionCell as before and collect all unallocated elements in a linked list, using the next pointer. Calls to new and delete must be modified in a straightforward way:

ActionCell * allocate(LongPoint position, int neighbours, bool live = false)
{
    if (!free_list) {
        return new ActionCell(position, neighbours, live);
    }
    ActionCell * result = free_list;
    free_list = result->next;
    result->set(position, neighbours, live);
    return result;
}

void deallocate(ActionCell * cell)
{
    cell->next = free_list;
    free_list = cell;
}

Since this is the last of our versions that is applicable to various hash functions, we’ll run it with all of them:

Hash functionTime, Java 8Time, C++
Default hash 1065
Multiply by 3, 5 542
Multiply by 11, 17 411
Multiply by two big primes 441
Multiply by one big prime 437
Modulo big prime 353 359
CRC32 438

The division-based hash function is the best. It almost caught up with Java, only a little bit is left to go.

Hash_Additive: The additive division-based hash function

We’ve finished the generic algorithms (suitable for any hash functions), and the next thing to try is the set of additive algorithms, based on the calculation of the hash value based on the previous hash value. It worked in Java (at least, for one version), we’ll try this in C++.

The first version operates using the division-based hash function and makes use of the fact that when a constant is added to the divisible, the remainder changes in a predictable way. The rest stays the same as in Hash_HomeMade7.

DescriptionTime, Java 8Time, C++
The additive implementation of "Modulo big prime" hash 327 312

Finally, we’ve made it: the C++ version runs faster than Java’s one.

Hash_Additive2: The additive multiplication-based hash function

The next version is based on the LongPointHash4 hash function, which multiplies two parts of a 64-bit number by two big prime numbers and adds the results. The transition from one hash value to another one when a constant is added to the key is trivial. However, the quality of hashing of this function is somewhat lower than that of the division-based one, so we can’t predict how this version will perform.

DescriptionTime, Java 8Time, C++
The additive implementation of the "Multiply by two big primes" hash 351 360

C++ is behind again, but, anyway, this version isn’t the fastest.

Hash_Additive3: The ultimate additive multiplication-based hash function

This version does not keep original co-ordinates at all, only their values multiplied by prime numbers. This makes calculating hash values very fast but requires special procedure to recover the co-ordinates when reporting the simulation results.

DescriptionTime, Java 8Time, C++
Improved additive implementation of the "Multiply by two big primes" hash 361 348

Java became slower, C++ became faster and caught up, but this is still not the fastest version.

The first race: The summary

This is the end of the first race. Finally, the C++ has won, but just marginally (312 ms against 327 in Java). Here is the full scorecard (we show the best results in each row):

Java class name CommentTime, Java 8Time, C++
Hash_Reference Initial version with a trivial hash function 4961 1372
Hash_LongPoint Storing position in a 64-bit value and using a good hash function 1650 1043
Hash1 Mutable Count 1437
Hash2 A unified hash map 1163 845
Hash3 Check in set removed 1151
Hash4 Entry iterator in the loop 976
Hash5 ArrayList<Entry> introduced 975 812
Hash6 Re-used ArrayList<Entry> 947 726
Hash7 ArrayList replaced with an array 904 753
Hash8 LinkedHashMap used, capacity made bigger 591 1847
Hash_HomeMade The first home-made hash map implementation (small capacity) 465 611
Hash_HomeMade2 The hash map code is built into the main code 449 576
Hash_HomeMade3 The cells are collected into a double-linked list 433 552
Hash_HomeMade4 The hash map capacity is set to 256K 353 424
Hash_HomeMade5 The action lists were introduced 354 425
Hash_HomeMade6 The division-based hash function, hard coded 353
Hash_Additive The division-based hash function, additive 327 312
Hash_Additive2 Multiplication by two big numbers, additive 351 360
Hash_Additive3 Multiplication by two big numbers, ultimate additive 361 348

The big race: the glider gun

We’ve finished the previous article with the simulation of the Gosper Glider Gun. We’ll do the same in C++ now. This is the ultimate race. If the Acorn simulations can be compared with separate Formula-1 laps, the Glider Gun is more like Le Mans 24 hours – literally: it really runs for 24 hours and more (2,500,000 iterations ran for 33 hours in Java).

We’ll be running the Hash_Additive version, as this is the best one we’ve produced so far.

Here are the results (I didn’t have patience to finish 2.5M steps):

StepsTime for 10,000, secTotal time, sec
JavaC++JavaC++
10,000 0.75 0.77 0.75 0.77
20,000 1.6 2.7 2.3 3.5
30,000 2.7 4.9 5.0 8.4
500,000 125 317 2,206 6,301
1,000,000 324 980 13,577 37,160
1,500,000 5531,670 35,203101,968
2,000,000 8342,436 70,420187,174
2,500,0001,126 119,481

This result is sensational. The C++ is losing dramatically. The execution speed in C++ becomes three times lower than in Java as the structure grows. Why this happens is a topic of another big investigation. We can straight away come with some theories:

We won’t investigate it here, however, as this article is already too long. We’ll return to this issue later.

Conclusions

Coming soon

The C++ program being three times slower than the Java version is an abnormal situation and a mystery. It must be investigated. I’ll do that in one of the next articles.

comments powered by Disqus