No, there isn’t a mistake in the title. Everyone expects that C programs always run faster than Java programs, but in the recent articles (“De-multiplexing of E1 stream: converting to C” and “The mystery of an unstable performance”) we came across the example of the opposite.
These are the versions in question and the performance data, as published at the end of the last article:
|Method||Results in Java||Results in C|
We can see that
Dst_First_3 are faster in Java than in C. There are other irregularities
as well: in Java
Dst_First_3 is quite a bit faster than
Dst_First_1, while in C it is a bit slower.
Dst_First_2 is much slower while in C it runs at the same speed as
Dst_First_1. Why is this?
Looking at the code: pointer aliasing
Let’s recall what these three versions of E1 de-multiplexors are:
Dst_First_1: the simplest, unoptimised version;
Dst_First_2: the result of manual optimisation of the
Dst_First_1. Specifically, loop invariant code motion and operation strength reduction were performed.
Dst_First_3: the version of
Dst_First_1with a hard-coded size of input and output buffers.
Let’s compile the program with all necessary flags
c++ -S -O3 -falign-functions=32 -falign-loops=32 -o e1.asm e1.cpp
and look at the assembly:
A quick look at the inner loop (the code between
.L50) reveals that there is something that can be optimised there:
There are three memory access instructions here. The first one reads a byte from the address
%rsi at the offset
%r8 is a copy of a loop variable (
%edi), which is incremented by 32 for each loop iteration. Clearly, this is
The second one reads a quadword (8 bytes) from the address
%rcx at the offset
%r10 is the outer loop variable
(it is incremented by one each iteration). This is clearly
And the third one writes the byte read by the first instructin to the address just read from
dst[dst_num] at the
%rax is another inner loop variable. This is writing to
This means that
dst[dsn_num] was not identified as an expression, which is invariant in the inner loop, and it
wasn’t moved out of the loop. The reason is in potential pointer aliasing.
Unlike Java, C and C++ are quite tolerant towards pointer behaviour. The original version of C was
absolutely liberal – pointers were allowed to point anywhere, and the same area of memory could be accessed via
multiple pointers of arbitrary types. This reduced the optimisation oppotrunity, since for any two pointers of
unknown nature the compiler should assume that they could pointer to the same memory. This is what GCC is still doing
when strict pointer aliasing is switched off. Later (in the first C standard, ‘ANSI/ISO 9899:1990’, commonly referred
to as ‘ANSI C’) the aliasing rules became stricter. An object now can only be accessed by two pointers of
“similar” types (
unsigned are considered similar), and by a
char pointer. This allows a compiler
to optimise pointer dereferences where types are different and neither of the pointers is a
char*. In GNU C
this is called a strict aliasing rule and is switched on by the
-fstrict-aliasing switch, which is a part of
-O2 optimisation pack.
The impact of the strict aliasing rule can be seen in the following examples, where
This is how they are compiled:
We can see that the code for
b() contains an extra instruction
The reason is that the compiler assumed aliasing of
p (of type
p (of type
char**), and did not optimise the second
a() this did not happen because
int* does not alias
Dst_First_1::demux is similar to
dst + dst_num is of type
dst[dst_num] + dst_pos is a
char*, and they are considered aliased. As a result, the compiler
did not optimise repetitive access to
The analysis of the code for
Dst_First_3 shows the same problem.
The only way I know to improve the code is to optimise the program manually.
This is what we’ve done in
Dst_First_2, but there we also performed strength reduction, which could affect the
code somehow. Let’s write versions of
Dst_First_2 with only
but no strength reduction (the code is here):
11Dst_First_1: 1455 11Dst_First_2: 1453 11Dst_First_3: 1463 12Dst_First_1a: 1454 12Dst_First_3a: 1443
do not show anything exceptional. The speed of
Dst_First_1 did not improve due to the optimisation, while the
The code for
Dst_First_1a looks better (I’ll only show the inner loop, the rest you can see
in the repository, file e1.asm):
It is difficult to say why this code isn’t faster than the original one. Perhaps an extra instruction got
absorbed between the latencies of some other instructions, and the extra dependency that it creates was eliminated
by dynamic instruction reordering. It is interesting to check how many CPU cycles it takes to perform the loop.
This loop is executed
32 * 64 * 1000000
times in 1454 ms, which makes it 0.71 ns per loop. The CPU runs at 2.6 GHz, or at
0.38 ns per cycle. This means that these 7 instructions take 1.87 CPU cycles to execute. For this to happen,
a lot of parallel execution must be going on, and who knows what the limit is? Perhaps, an extra instruction can be
squeezed in somewhere as well.
The code for
Dst_First_2 looks exactly like
Dst_First_1a (except for one insignificant instruction swap),
which proves that GNU C is good enough in performing operation strength reduction, and there is no point helping it
to do it. We should rather try writing more easily readable code, and I believe that
Dst_First_1a is easier
to read than
The code for
Dst_First_3a looks the best of all, you can see it in
e1.asm in the repository.
But the code still runs slowly. The optimisation has not helped – at least, not on its own.
Looking at the code again: loop unrolling
The code for both
Dst_First_3a seems almost ideal. I can’t rewrite the assembly version
manually and make it much faster; most likely, I can’t make it faster at all.
Unless I unroll the loops.
We’ve been talking a lot about loop unrolling in the original article (“De-multiplexing of E1 streams in Java”). We performed manual loop unrolling, sometimes taking it to extremes. But we know that some compilers are capable of performing some level of loop unrolling automatically. The only way to explain why Java executes this code in 1022 ms while C-compiled code runs for 1443 ms is that Java performs loop unrolling while C++ does not.
Why does GNU C not do it? Could it be that it is just a bad compiler? No. Unrolling all the loops in the program is probably as bad an idea as not unrolling anything at all, if not worse. Loop unrolling increases code size big time, and bigger code often runs slower. In addition, when the exact loop iteration count is unknown, some extra code is required at loop entry or exit to compensate for possible “leftovers”. In short, loop unrolling must be performed where it definitely benefits the program, and the C compiler does not know where this is, unless it has access to profiling data. Java has access to profiling data by design, that’s why it can perform such optimisations more efficiently.
GNU C has several switches that control loop unrolling:
Perform loop optimizations: move constant expressions out of loops, simplify exit test conditions and optionally
do strength-reduction and loop unrolling as well. Enabled at levels |
Unroll loops whose number of iterations can be determined at compile time or upon entry to the loop.
Unroll all loops, even if their number of iterations is uncertain when the loop is entered. This usually makes
programs run more slowly. |
There are also some switches that control maximal size of unrolled loop, maximal number of times loops are unrolled and so on. We won’t use them, relying on the default values.
As you can see,
-floop-optimize is capable of performing some unrolling, and it is invoked with
But it didn’t unroll our loops.
-funroll-all-loops looks dangerous, it can easily make the program run slower.
c++ -O3 -falign-functions=32 -falign-loops=32 -funroll-loops -o e1 e1.cpp -lrt ./e1 9Reference: 1604 11Src_First_1: 1119 11Src_First_2: 1101 11Src_First_3: 1458 11Dst_First_1: 1079 11Dst_First_2: 880 11Dst_First_3: 1009 12Dst_First_1a: 882 12Dst_First_3a: 727 10Unrolled_1: 635 12Unrolled_1_2: 633 12Unrolled_1_4: 651 12Unrolled_1_8: 648 13Unrolled_1_16: 635 15Unrolled_2_Full: 635
Let’s put all the results into one table. The old results are taken from The mystery of an unstable performance.
|Version||Time in Java||Old time in C||New time in C|
All the versions (excluding manually unrolled) became faster.
All the versions now run faster than in Java.
Dst_Firstversions now run faster than
Src_First, it was not so in C initially.
While resolving the pointer aliasing and optimising the loop did not help on its own, it helped a lot in combination with the loop unrolling. This is visible by comparing
The same is true for hard-coded loop boundaries. It didn’t help much on its own but made a big difference in combination with the loop unrolling. Compare
Dst_First_3aversion’s speed is very close to the
Unrolled’s speed, which makes it a perfect choice where extreme performance is not necessary
Dst_First_1ais slower, but its speed is still acceptable for a universal solution (the one without hard-coded sizes).
This is probably all we can do by applying simple changes and tweaking compiler switches. More effort is required if we need to improve speed even more.
The unrolled code
Let’s have a look at the unrolled code
(I’m excluding the assertion part;
the full text is in the repository).
The inner loop was unrolled 8 times in both cases
The inner loop of
Dst_First_1awas 23 bytes long; became 140 bytes
The inner loop of
Dst_First_3awas 23 bytes long; became 142 bytes
Dst_First_1awas 123 bytes long; became 451 byte
Dst_First_3awas 123 bytes long; became 246 byte
The loop unrolling killed the loop alignment in
Dst_First_1a: the loop starts at
0x410c2f, still explicitly aligned. A possible reason is that the loop entry is also a target for a normal branch. I think this is wrong, possibly a deficiency of GNU C.
Both loops are explicitly aligned in
Comparison instructions with jumps to labels from
.L71are sorting out leftovers. They check the number of iterations modulo 7. If it is zero, the code jumps to the loop entry, if it is one, the code processes one byte and then jumps to the loop entry, and so on.
The code for
Dst_First_3adoes not need these tests and looks much neater
The code still does not look optimal:
lealinstruction could have been merged with the memory load operation. Perhaps, it has something to do with the operand size: the
lealis a 32-bit instructions. This is something to investigate. It might mean that the previous statement “This is probably all we can do by applying simple changes and tweaking compiler switches” was not correct.
Pointer aliasing requires manual treatment. Even if the performance does not improve, it may improve in combination with other optimisations.
When applied in right places, loop unrolling can improve performance a lot. It may require using additional command-line arguments or pragmas/attributes.
GCC can do strength reduction; there is no need to perform it manually
Finally, the C-compiled program does run faster than the one in Java
There is still something wrong with the code
In the next article I’m going to look at the code generated for
Dst_First_3a closely and check if it can be improved.