In one of the previous articles I was using C macros to implement deep loop unrolling. I also mentioned C++ templates and stated that it was possible to use those for the same purpose, but that the code would look complex and ugly. Now I want to test this statement.
Simple unrolled methods
Let’s look at the first piece of unrolled code using macros, namely Unrolled_1
(you can see it in a repository):
Using templates, we can write it like this:
Some comments on this:
-
Each
move_bytes<N>
is a separate function, with its own generated code; each function gets instantiated when used somewhere at compile time. In our case eachmove_bytes<N>
causes instantiation ofmove_bytes<N-1>
. Here the parameterN
indicated the number of bytes to be moved. -
This chain of instantiations can go forever, eventually causing stack overflow in the template processor, so a terminating definition is required. The syntax
template<> inline void move_bytes<0>
means that the definition ofmove_bytes<0>
does not follow the general pattern. It is implemented as an empty function, which is reasonable for a function that moves zero bytes. -
There is no need for separate
MOVE_BYTE
function or macro anymore: its code can be easily placed insidemove_bytes<N>
. -
Hopefully the functions will be inlined into each other (I asked for that by using the
inline
keyword, but the compiler can do some inlining even without such a request), and in the end we’ll get an unrolled loop. -
Macros can access any variables from the context they are invoked; functions can’t. That’s why the entire context these functions operate on must be passed to them as parameters
-
Unlike macros, the template functions are truly generic and do not depend on the exact values of variables. In our case we could even use the
DST_SIZE
(our tuning parameter) as a byte count. With macros we could construct macro name dynamically (using the##
operator), but all the required macros must exist already. -
The templated version works with any values of
DST_SIZE
andNUM_TIMESLOTS
, so two of the three asserts can be removed. -
The functional languages enthusiasts will probably find it exciting. I still prefer the macro solution as the more straightforward one.
-
However I must agree that it does not look as ugly as I thought it would, so my original statement is probably not correct.
Deeper unrolling
This is what fully unrolled version looked like using macros:
We have move_timeslot
function already, now we must provide a way to call it repeatedly. We’ll do it in exactly
the same way we did move_bytes
:
Parameter j
defines the first timeslot to be moved. It’s been added to make move_timeslots<N>()
suitable for
partially unrolled versions, too:
We can go further if we observe that the partially unrolled versions are now using the unroll factor (which is 4
for Unroll_1_4
) directly rather that as part of a macro name, and this factor is the only difference between them.
It makes it possible to template the whole function:
Later, instead of using classes Unrolled_1_2
, Unrolled_1_4
, etc., we can instantiate templates: Unrolled_1_F<2>
,
Unrolled_1_F<4>
, etc. Just be prepared to see funny class names as results of typeid()
call.
Class Unrolled_2_Full
can be constructed as Unrolled_1_F<32>
, but it looks clearer as it is.
Other versions
The remaining two versions (Unrolled_3
and Unrolled_4
) both involve repeatitive calling of functions from
demux()
. The difference was that in Unrolled_4
it was the same function called 32 times, while in Unrolled_3
there were
32 different specialised functions. Using templates, we can do both: we can create service templated functions
to implement loops, and we can create 32 instantiations of templated function. We can make all the service templates
private members of a class, so there won’t be any need to #undef
them after use. But it is easy to show that in
our specific case all of that is unnecessary.
Here is the macro version of Unrolled_4
:
The demux_0()
here contains a fully unrolled loop that copies 64 bytes. The way to write such a loop in the templated
implementation will be to call move_bytes<64>()
. But there is no point creating and calling a special function
demux_0()
, whose sole purpose is calling another function. Instead, we can call that other function directly,
and that is exactly what Unrolled_2_Full
does.
Similar observation is applicable to Unrolled_3
, except there we have 32 functions, each of which just calls
move_bytes
.
In short, both Unrolled_3
and Unrolled_4
are unnecessary and can be removed from our C++ code. In fact,
they can be removed from the macro-based code as well. The only reason they were created in the first place was that
in Java the compiler panicked at long functions.
I’ve put the templated code into separate file, e1-template.cpp
.
The code is available in the repository.
Running it
Now we are ready to compile and run our templated code. But first let’s recall the results for the macro-based code:
# ./e1
9Reference: 1939
11Src_First_1: 1885
11Src_First_2: 1924
11Src_First_3: 1892
11Dst_First_1: 1467
11Dst_First_2: 1445
11Dst_First_3: 1761
10Unrolled_1: 633
12Unrolled_1_2: 634
12Unrolled_1_4: 654
12Unrolled_1_8: 650
13Unrolled_1_16: 635
15Unrolled_2_Full: 635
10Unrolled_3: 635
10Unrolled_4: 655
# c++ -O3 -o e1-template e1-template.cpp -lrt
# ./e1-template
9Reference: 1940
10Unrolled_1: 634
12Unrolled_1_FILj2EE: 633
12Unrolled_1_FILj4EE: 653
12Unrolled_1_FILj8EE: 650
12Unrolled_1_FILj16EE: 646
15Unrolled_2_Full: 655
We see is that the templated versions are fast. Unrolled_1
, Unrolled_2
, Unrolled_4
and
Unrolled_8
execute in exactly the same time as the macro-based versions. The Unrolled_1_16
and Unrolled_2_Full
slow down a bit. This is most likely caused by the function inline limit and can possibly be improved by using
appropriate compiler options. We won’t do it here, because Unrolled_1
is already fast enough.
The assembly output (which one can get by invoking the compiler with -S
option) for Unrolled_1
is nearly identical
for macro and templated versions. Here it is for curious readers (without the assertion failure handler):
Conclusions
Using templates for meta-programming isn’t that bad. Effectively we are replacing loops with recursion, which is not the most intuitive thing to do. However, the resulting code is quite small, more flexible and, most important, it is efficient. A powerful function inlining engine that is built into the C++ compiler makes the generated machine code identical to that of macro version.