In one of the previous articles I was using C macros to implement deep loop unrolling. I also mentioned C++ templates and stated that it was possible to use those for the same purpose, but that the code would look complex and ugly. Now I want to test this statement.
Simple unrolled methods
Let’s look at the first piece of unrolled code using macros, namely
(you can see it in a repository):
Using templates, we can write it like this:
Some comments on this:
move_bytes<N>is a separate function, with its own generated code; each function gets instantiated when used somewhere at compile time. In our case each
move_bytes<N>causes instantiation of
move_bytes<N-1>. Here the parameter
Nindicated the number of bytes to be moved.
This chain of instantiations can go for ever, eventually causing stack overflow in the template processor, so a terminating definition is required. The syntax
template<> inline void move_bytes<0>means that the definition of
move_bytes<0>does not follow the general pattern. It is implemented as an empty function, which is reasonable for a function that moves zero bytes.
There is no need for separate
MOVE_BYTEfunction or macro any more: its code can be easily placed inside
Hopefully the functions will be inlined into each other (I asked for that by using the
inlinekeyword, but the compiler can do some inlining even without such a request), and in the end we’ll get an unrolled loop.
Macros can access any variables from the context they are invoked; functions can’t. That’s why the entire context these functions operate one must be passed to them as parameters
Unlike macros, the template functions are truly generic and do not depend on exact values of variables. In our case we could even use the
DST_SIZE(our tuning parameter) as a byte count. With macros we could construct macro name dynamically (using the
##operator), but all the required macros must exist already.
The templated version works with any values of
NUM_TIMESLOTS, so two of the three asserts can be removed.
The functional languages enthusiasts will probably find it exciting. I still prefer the macro solution as the more straightforward one.
However I must agree that it does not look as ugly as I thought it would, so my original statement is probably not correct.
This is what fully unrolled version looked like using macros:
move_timeslot function already, now we must provide a way to call it repeatedly. We’ll do it in exactly
the same way we did
j defines the first timeslot to be moved. It’s been added to make
move_timeslots<N>() suitable for
partially unrolled versions, too:
We can go further if we observe that the partially unrolled versions are now using the unroll factor (which is 4
Unroll_1_4 directly rather that as part of a macro name, and this factor is the only difference between them.
It makes it possible to template the whole function:
Later, instead of using classes
Unrolled_1_4, etc., we can instantiate templates:
Unrolled_1_F<4>, etc. Just be prepared to see funny class names as results of
Unrolled_2_Full can be constructed as
Unrolled_1_F<32>, but it looks clearer as it is.
The remaining two versions (
Unrolled_4) both involve repeatitive calling of functions from
demux(). The difference was that in
Unrolled_4 it was the same function called 32 times, while in
Unrolled_3 there were
32 different specialised functions. Using templates, we can do both: we can create service templated functions
to implement loops, and we can create 32 instantiations of templated function. We can make all the service templates
private members of a class, so there won’t be any need to
#undef them after use. But it is easy to show that in
our specific case all of that is unnecessary.
Here is the macro version of
demux_0() here contains a fully unrolled loop that copies 64 bytes. The way to write such a loop in the templated
implementation will be to call
move_bytes<64>(). But there is no point creating and calling a special function
demux_0(), whose sole purpose is calling another function. Instead, we can call that other function directly,
and that is exactly what
Similar observation is applicable to
Unrolled_3, except there we have 32 functions, each of which just calls
In short, both
Unrolled_4 are unnecessary and can be removed from our C++ code. In fact,
they can be removed from the macro-based code as well. The only reason they were created in the first place was that
in Java the compiler panicked at long functions.
I’ve put the templated code into separate file,
The code is available in the repository.
Now we are ready to compile and run our templated code. But first let’s recall the results for the macro-based code:
# ./e1 9Reference: 1939 11Src_First_1: 1885 11Src_First_2: 1924 11Src_First_3: 1892 11Dst_First_1: 1467 11Dst_First_2: 1445 11Dst_First_3: 1761 10Unrolled_1: 633 12Unrolled_1_2: 634 12Unrolled_1_4: 654 12Unrolled_1_8: 650 13Unrolled_1_16: 635 15Unrolled_2_Full: 635 10Unrolled_3: 635 10Unrolled_4: 655 # c++ -O3 -o e1-template e1-template.cpp -lrt # ./e1-template 9Reference: 1940 10Unrolled_1: 634 12Unrolled_1_FILj2EE: 633 12Unrolled_1_FILj4EE: 653 12Unrolled_1_FILj8EE: 650 12Unrolled_1_FILj16EE: 646 15Unrolled_2_Full: 655
We see is that the templated versions are fast.
Unrolled_8 execute in exactly the same time as the macro-based versions. The
slow down a bit. This is most likely caused by the function inline limit and can possibly be improved by using
appropriate compiler options. We won’t do it here, because
Unrolled_1 is already fast enough.
The assembly output (which one can get by invoking the compiler with
-S option) for
Unrolled_1 is nearly identical
for macro and templated versions. Here it is for curious readers (without the assertion failure handler):
Using templates for meta-programming is not that bad. Effectively we are replacing loops with recursion, which is not the most intuitive thing to do. However, resulting code is quite small, more flexible and, most important, it is efficient. A powerful function inlining engine that is built into the C++ compiler makes the generated machine code identical to that of macro version.