In the previous article we were de-multiplexing the E1 stream in Java. Now I want to try C/C++. General expectation is that it will be faster than in Java, possibly much faster; we’ll check if this is true.
By “C/C++” I mean “C with a little bit of C++”. There will be just enough C++ to make a program look a bit nicer and add a bit of safety. I mean small features, such as declaring of variable where it is first used rather than at the top. Perhaps occasional class here and there. There won’t be any multiple inheritance or curiously recurring template pattern, and the reader won’t be forced to google the difference between static and dynamic cast.
It seems natural to convert the Java code to C++ directly first, to compare the compilers and execution environments. This is what I’m going to do today; I’ll leave using C-specific low-level features for later articles. As a result, today’s discussion will be more about syntax of C/C++ than about optimisation. In particular, you will see lots of macro definitions and invocations. If you are not interested in it or if you are already a C expert, you are welcome to jump straight to the “Running it” section.
Conversion of non-unrolled versions
The syntax of Java is close enough to that of C++ that we can almost convert one to another using find-and-replace. I will only give one example, the rest can be seen in the repository.
Since we’re not using class libraries, the buffers are represented as
byte pointers, which, unlike Java’s
do not keep the data size and do not provide index checking. That’s why the input length must be passed as an additional
parameter. It is the caller’s responsibility to ensure that memory is correctly allocated for the output buffers.
unsigned, rather than
int for lengths and loop variables out of habit: in C++ the sizes are
traditionally stored in variables of type
size_t, which is unsigned.
Another habit explains declaration of type
byte. I prefer reserving the name
char for character data, and here we
are dealing with bytes of general nature.
We also need a timer (replacement for Java’s
System.currentTimeMillis ()). I’ve moved it to separate header file
"timer.h". We’ll copy the measurement and correctness test code from Java with only minor changes.
We don’t need to run test five times, since C++ doesn’t have any warm-up effect. We’ll still run it twice,
just to be safe. Another change is triggered by absence of a garbage collection in C++. All the memory we allocate
must be freed. Finally, we’ll run all out versions at once, so to improve stability of results, we’ll run them in
the same memory, re-using the buffers.
We are ready to run the tests, but let us convert unrolled versions first.
Conversion of unrolled versions: macro magic
We could convert unrolled versions just as we did all the others, by find and replace. However, in C/C++ we can do better, since C/C++ have a rudimental meta-programming feature: macros. See the full text in the repository.
This is how Unrolled_1 can be written down using macros:
Here I feel necessary to include some practical points on macro usage. Readers with good C experience can skip the next section:
If a macro has limited lifetime, you should
#undefit as soon as it becomes unnecessary; this can save you from accidentally declaring macro with the same name later. This is especially applicable to temporary macros in include files. In this example I didn’t
#undefthe macros, because they will be used in other functions.
If a macro is only used inside one function, it is a good idea to
#defineit right inside that function and
#undefit after the use.
Don’t forget that no white space is allowed between the macro name and open parenthesis. If there is any whitespace there, the parenthesis together with the argument list will be considered part of a macro body.
The backslash (
'\') at the end of a line indicates that a macro is continued on the next line. But for that it must really be at the end of the line, no white space is allowed after it (although some compilers allow that).
It is always a good idea to wrap the arguments into round brackets inside macro definition (see the
MOVE_BYTEdefinition). You never know what will come as parameters. Imagine the brackets were not there and the macro was invoked like this:
It would have been expanded as
and this is not quite what was intended.
- For those who do not know the significance of a
dostatement in macro definitions, I’ll explain it in one of the following articles.
Here the usage of macros does not save you much typing – you must still invoke
MOVE_BYTE manually 64 times.
But at least the actual code of
MOVE_BYTE is in one place only, and can be modified easily.
By the way, you can avoid writing down all 64 invocations of
MOVE_BYTE by creating multi-layered macros:
This way you have to write 12 invocations of some macros rather than 64. The same number will result from de-composition into six layers of two invocations each. A general formula for the minimal number of invocations for a given total number doesn’t seem to be simple; perhaps some mathematically inclined reader can suggest one.
The macros we’ve just created will be useful for other versions of our method, too. For instance, a four times unrolled version looks like this:
And fully unrolled version looks like this:
This is much smaller in source size than the version in Java.
A sad comment on meta-programming
Meta-programming is an automatic generation of a program or some part of it. This is what I was doing in Java (all unrolled versions were generated). There are lots of external tools that perform some form of meta-programming (pre-processors, language converters, etc.), but the most convenient is when meta-programming is supported by the language itself. This means that some part of a program can be executed at compile-time, producing real program, which is then compiled into the executable code. As you could see in the previous section, it is very useful feature for high-performance programming, which involves lots of inlining and unrolling. Unfortunately, such a feature is a rare commodity in modern programming languages. Java does not have it (generics do not count and dynamic code generation is quite troublesome and heavy-weight), neither does Go or Python. C has a rudimentary version of it in the form of macro expansion. In addition, C++ employs templates, which are very powerful but not easy to use and a bit ugly. They are not actually designed for full-scale meta-programming, but rather for generic type-independent programming, that’s why the primary working mechanism there is pattern matching, which for me isn’t the programming tool of choice. It is not impossible to implement the above examples using templates but the code won’t look nice and it is yet to be seen how fast it would run.
In the old days meta-programming was more common. It was widely available in macro-assemblers, where main language (assembler) was enriched with macro-features, which included variables, loops, branches and other high-level features. But the best by far in that respect was the approach of PL/I, where macro-language was PL/I itself, just the lines were marked to be run at compile time. Should such a feature be available in C++, out program would have looked much nicer:
I assume imaginary version of C++ where percentage mark indicates code that runs at compile time. As you can see,
the code looks very close to
Dst_First_1. Unfortunately, such a version does not exist, and we have to write ugly
macro definitions. What a pity.
More macro magic: parameter concatenation
In our Java code there is a version that involves 32 small functions (
Unrolled_3). Quite frankly,
I don’t expect its performance to be any better than performance of other unrolled versions in C++;
in addition you can recall that the very reason for its existence was HotSpot limitation of method size.
However, for the sake of completeness I’ll convert this function as well, using another handy feature of
C’s macro language: token concatenation. A combination of two hash characters (
##), used in macro, causes concatenation
of text to the left and to the right of it without any whitespaces, allowing creation of new objects such
as function names.
You can see how using
## helps create necessary method names. Note absence of semicolons between calls to
in C/C++ function definitions do not end with semicolons (but class definitions do!)
Making things shorter: macros as macro parameters
Our code contains manually written sequences of macro calls of size 64 (as a
MOVE_BYTES definition), one each of size 2, 4, 8
and 16 (various
Unrolled_1_x versions) and four of size 32 (all fully unrolled versions).
A code can be made a bit shorter by introducing a generic macro duplicator – a poor man’s meta-programming loop.
Such a duplicator (also a macro) can take a macro as a parameter and call it given (constant) number of times.
This is based on a very handy feature of macros: although a macro with parameters can’t be invoked with different
number of parameters, using its name without any parameters at all is allowed, the macro won’t be expanded then.
This allows passing name of one macro as a parameter to another macro.
There are three ways to write such a multiplier. The simplest (but the longest) involves a lot of typing:
And so on – definition of
DUP_64 will contain 64 calls to
We can save a bit of typing by basing higher-order duplicator on lower-order one. A partial solution looks like this:
Here the lists of
m() being invoked are exactly half the size.
Another approach is to call each previous macro twice:
This approach has two significant advantages. It allows writing down loops of arbitrary sizes without much typing
(a macro definition for any loop of size that is power of two has exactly two macro invocations; for other loop
sizes the number may be bigger but is always limited by number of bits in the loop size). It also provides,
as a by-product, a set of loops that don’t start at zero but at some given number, and we need these for our partially
unrolled loop solutions. It has, however, one disadvantage. The indices in macro expansions appear as arithmetic
expressions rather than as direct numbers; and this makes it impossible to use this macro with
## concatenation operation does not evaluate expressions before concatenation. That’s why we will use the
partial solution. Versions of code with both solutions are in the repository:
this is the solution we discard
and this is the solution we’ll use.
Neither of these approaches works for
DEF_DEMUX does not allow semicolons between calls.
The macros above only work with argument macros that take one parameter. For more parameters, separate series of
macros must be created. I called them
Using this macros, for instance, the sequence of
CALL_DEMUX in the previous example (
Unrolled_3) will become just
Partially unrolled methods require service macro that relies on the name of the loop variable (
j) and adds it to the
What is nice is that
DUP_xx macros are now generic and can be moved to a separate header file and later re-used
in other projects. I’ve moved them to
As a result of applying macros and multipliers the entire size of the test program became 383 lines, while in Java
Unrolled_2_Full was 553 lines long.
We are ready to run it all. We’ll be using the same system as for the Java testing: Linux, running on a Dell blade with Intel(R) Xeon(R) CPU E5-2670 @ 2.60GHz. We’ll use gcc 4.6.
# c++ -O3 -o e1 e1.cpp -lrt
-O3 indicates the highest level of optimisation, and
-lrt is needed to include relevant run-time library).
# ./e1 9Reference: 1937 1930 11Src_First_1: 1890 1878 11Src_First_2: 1920 1916 11Src_First_3: 1890 1891 11Dst_First_1: 1465 1465 11Dst_First_2: 1443 1444 11Dst_First_3: 1758 1758 10Unrolled_1: 632 633 12Unrolled_1_2: 633 633 12Unrolled_1_4: 655 655 12Unrolled_1_8: 648 649 13Unrolled_1_16: 635 633 15Unrolled_2_Full: 635 634 10Unrolled_3: 634 634 10Unrolled_4: 653 653
You might be surprised by the names in the first columns; so am I. We are lucky that we can read it at all.
The C++ standard does not require any particular format of a type name as returned by
typeid call, so
the class name prepended by the number of characters in it isn’t the worst possible case.
It is clearly visible that the results for two runs are quite stable. It means that running the test twice is unnesessary and we can remove appropriate code.
# ./e1 9Reference: 1939 11Src_First_1: 1885 11Src_First_2: 1924 11Src_First_3: 1892 11Dst_First_1: 1467 11Dst_First_2: 1445 11Dst_First_3: 1761 10Unrolled_1: 633 12Unrolled_1_2: 634 12Unrolled_1_4: 654 12Unrolled_1_8: 650 13Unrolled_1_16: 635 15Unrolled_2_Full: 635 10Unrolled_3: 635 10Unrolled_4: 655
Let’s recall the results for Java version and put them together in the same table:
|Version||Time in Java||Time in C|
Here comes our first sensation: C++ is not always faster than Java. Both
quite a bit faster in Java than in C++. I’ll come back to this phenomenon in a later article. Let’s look at
The compiler does not panic when compiling big methods; there are no cases when something ran unusually slow
In most cases C++ is still a bit faster than Java, with typical execution time being 70-80% of the Java time, sometimes less (43% in
Src_First_3and 68% in
Reference). Some of the speed difference can be attributed to the absence of index checking in C++.
All fully unrolled versions (
Unrolled_4) run at the same speed, so there is no need to create complex solutions such as
The speed of unrolled solutions is virtually identical in C++ and Java. Sometimes Java is a bit faster, sometimes C++ is.
Relative speeds of different solutions aren’t the same as in Java. In Java manual optimisation of
Dst_First_1produced very slow result (
Dst_First_2), while in C++ it even improved speed a bit, which causes suspicion that optimiser in
gnu C++isn’t as good as one in HotSpot.
Src_First_3didn’t experience big slowdown either – could be due to array index checking.
The difference between
Dst_First_1is that the input size is hard-coded in the former one, which should improve performance. However, the opposite has happened. This is something to investigate.
Notes for myself
Some points to investigate:
Why some C++ versions run slower than in Java?
How well does C++ compiler optimise programs? Is manual optimisation good or bad there?
What happens if we use vectors with index checking instead of
bytepointers? How much will it affect performance?
Can C++ code be made faster without rewriting – perhaps there are some secret compiler keys? Can the speed be improved any more? Answers will follow soon.