No, there isn’t a mistake in the title. Everyone expects that C programs always run faster than Java programs, but in the recent articles (“De-multiplexing of E1 stream: converting to C” and “The mystery of an unstable performance”) we came across the example of the opposite.
These are the versions in question and the performance data, as published at the end of the last article:
Method | Results in Java | Results in C |
---|---|---|
Dst_First_1 | 1155 | 1457 |
Dst_First_2 | 2093 | 1454 |
Dst_First_3 | 1022 | 1464 |
We can see that Dst_First_1
and Dst_First_3
are faster in Java than in C. There are other irregularities
as well: in Java Dst_First_3
is quite a bit faster than Dst_First_1
, while in C it is a bit slower.
In Java Dst_First_2
is much slower while in C it runs at the same speed as Dst_First_1
. Why is this?
Looking at the code: pointer aliasing
Let’s recall what these three versions of E1 de-multiplexors are:
-
Dst_First_1
: the simplest, unoptimised version; -
Dst_First_2
: the result of manual optimisation of theDst_First_1
. Specifically, loop invariant code motion and operation strength reduction were performed. -
Dst_First_3
: the version ofDst_First_1
with a hard-coded size of input and output buffers.
Here is the code for all three versions:
Let’s compile the program with all necessary flags
c++ -S -O3 -falign-functions=32 -falign-loops=32 -o e1.asm e1.cpp
and look at the assembly:
A quick look at the inner loop (the code between .L52
and .L50
) reveals that there is something that can be optimised there:
There are three memory access instructions here. The first one reads a byte from the address %rsi
at the offset %r8
,
where %r8
is a copy of a loop variable (%edi
), which is incremented by 32 for each loop iteration. Clearly, this is
The second one reads a quadword (8 bytes) from the address %rcx
at the offset %r10*8
, where %r10
is the outer loop variable
(it is incremented by one each iteration). This is clearly dst[dst_num]
.
And the third one writes the byte read by the first instructin to the address just read from dst[dst_num]
at the
offset %rax
, where %rax
is another inner loop variable. This is writing to dst[dst_num][dst_pos]
.
This means that dst[dsn_num]
was not identified as an expression, which is invariant in the inner loop, and it
wasn’t moved out of the loop. The reason is in potential pointer aliasing.
Unlike Java, C and C++ are quite tolerant towards pointer behaviour. The original version of C was
absolutely liberal – pointers were allowed to point anywhere, and the same area of memory could be accessed via
multiple pointers of arbitrary types. This reduced the optimisation oppotrunity, since for any two pointers of
unknown nature the compiler should assume that they could pointer to the same memory. This is what GCC is still doing
when strict pointer aliasing is switched off. Later (in the first C standard, ‘ANSI/ISO 9899:1990’, commonly referred
to as ‘ANSI C’) the aliasing rules became stricter. An object now can only be accessed by two pointers of
“similar” types (int
and unsigned
are considered similar), and by a char
pointer. This allows a compiler
to optimise pointer dereferences where types are different and neither of the pointers is a char*
. In GNU C
this is called a strict aliasing rule and is switched on by the -fstrict-aliasing
switch, which is a part of
the -O2
optimisation pack.
The impact of the strict aliasing rule can be seen in the following examples, where b()
models Dst_First_1::demux
:
This is how they are compiled:
We can see that the code for b()
contains an extra instruction
The reason is that the compiler assumed aliasing of p[0]
(of type char*
) and p
(of type char**
), and did not optimise the second p[0]
.
When compiling a()
this did not happen because int*
does not alias int**
.
Unfortunately, our Dst_First_1::demux
is similar to b()
: dst + dst_num
is of type char**
,
while dst[dst_num] + dst_pos
is a char*
, and they are considered aliased. As a result, the compiler
did not optimise repetitive access to dst[dst_num]
.
The analysis of the code for Dst_First_3
shows the same problem.
The only way I know to improve the code is to optimise the program manually.
This is what we’ve done in Dst_First_2
, but there we also performed strength reduction, which could affect the
code somehow. Let’s write versions of Dst_First_1
and Dst_First_2
with only dst[dst_num]
optimised,
but no strength reduction (the code is here):
Results
11Dst_First_1: 1455
11Dst_First_2: 1453
11Dst_First_3: 1463
12Dst_First_1a: 1454
12Dst_First_3a: 1443
do not show anything exceptional. The speed of Dst_First_1
did not improve due to the optimisation, while the
speed of Dst_First_3
did.
The code for Dst_First_1a
looks better (I’ll only show the inner loop, the rest you can see
in the repository, file e1.asm):
It is difficult to say why this code isn’t faster than the original one. Perhaps an extra instruction got
absorbed between the latencies of some other instructions, and the extra dependency that it creates was eliminated
by dynamic instruction reordering. It is interesting to check how many CPU cycles it takes to perform the loop.
This loop is executed 32 * 64 * 1000000
times in 1454 ms, which makes it 0.71 ns per loop. The CPU runs at 2.6 GHz, or at
0.38 ns per cycle. This means that these 7 instructions take 1.87 CPU cycles to execute. For this to happen,
a lot of parallel execution must be going on, and who knows what the limit is? Perhaps, an extra instruction can be
squeezed in somewhere as well.
The code for Dst_First_2
looks exactly like Dst_First_1a
(except for one insignificant instruction swap),
which proves that GNU C is good enough in performing operation strength reduction, and there is no point helping it
to do it. We should rather try writing more easily readable code, and I believe that Dst_First_1a
is easier
to read than Dst_First_2
.
The code for Dst_First_3a
looks the best of all, you can see it in e1.asm
in the repository.
But the code still runs slowly. The optimisation has not helped – at least, not on its own.
Looking at the code again: loop unrolling
The code for both Dst_First_1a
and Dst_First_3a
seems almost ideal. I can’t rewrite the assembly version
manually and make it much faster; most likely, I can’t make it faster at all.
Unless I unroll the loops.
We’ve been talking a lot about loop unrolling in the original article (“De-multiplexing of E1 streams in Java”). We performed manual loop unrolling, sometimes taking it to extremes. But we know that some compilers are capable of performing some level of loop unrolling automatically. The only way to explain why Java executes this code in 1022 ms while C-compiled code runs for 1443 ms is that Java performs loop unrolling while C++ does not.
Why does GNU C not do it? Could it be that it is just a bad compiler? No. Unrolling all the loops in the program is probably as bad an idea as not unrolling anything at all, if not worse. Loop unrolling increases code size big time, and bigger code often runs slower. In addition, when the exact loop iteration count is unknown, some extra code is required at loop entry or exit to compensate for possible “leftovers”. In short, loop unrolling must be performed where it definitely benefits the program, and the C compiler does not know where this is, unless it has access to profiling data. Java has access to profiling data by design, that’s why it can perform such optimisations more efficiently.
GNU C has several switches that control loop unrolling:
-floop-optimize |
Perform loop optimizations: move constant expressions out of loops, simplify exit test conditions and optionally
do strength-reduction and loop unrolling as well. Enabled at levels -O , -O2 , -O3 , -Os
|
-funroll-loops |
Unroll loops whose number of iterations can be determined at compile time or upon entry to the loop.
-funroll-loops implies -frerun-cse-after-loop . It also turns on complete loop peeling (i.e. complete removal of
loops with small constant number of iterations). This option makes code larger, and may or may not make it run faster.
|
-funroll-all-loops |
Unroll all loops, even if their number of iterations is uncertain when the loop is entered. This usually makes
programs run more slowly. -funroll-all-loops implies the same options as -funroll-loops
|
There are also some switches that control maximal size of unrolled loop, maximal number of times loops are unrolled and so on. We won’t use them, relying on the default values.
As you can see, -floop-optimize
is capable of performing some unrolling, and it is invoked with -O3
option.
But it didn’t unroll our loops.
The option -funroll-all-loops
looks dangerous, it can easily make the program run slower.
Let’s try -funroll-loops
:
c++ -O3 -falign-functions=32 -falign-loops=32 -funroll-loops -o e1 e1.cpp -lrt
./e1
9Reference: 1604
11Src_First_1: 1119
11Src_First_2: 1101
11Src_First_3: 1458
11Dst_First_1: 1079
11Dst_First_2: 880
11Dst_First_3: 1009
12Dst_First_1a: 882
12Dst_First_3a: 727
10Unrolled_1: 635
12Unrolled_1_2: 633
12Unrolled_1_4: 651
12Unrolled_1_8: 648
13Unrolled_1_16: 635
15Unrolled_2_Full: 635
Let’s put all the results into one table. The old results are taken from The mystery of an unstable performance.
Version | Time in Java | Old time in C | New time in C |
---|---|---|---|
Reference | 2860 | 1939 | 1604 |
Src_First_1 | 2481 | 1888 | 1119 |
Src_First_2 | 2284 | 1341 | 1101 |
Src_First_3 | 4360 | 1891 | 1458 |
Dst_First_1 | 1155 | 1457 | 1079 |
Dst_First_1a | 882 | ||
Dst_First_2 | 2093 | 1454 | 880 |
Dst_First_3 | 1022 | 1464 | 1009 |
Dst_First_3a | 727 | ||
Unrolled_1 | 659 | 636 | 635 |
Unrolled_1_2 | 654 | 634 | 633 |
Unrolled_1_4 | 636 | 655 | 651 |
Unrolled_1_8 | 637 | 650 | 648 |
Unrolled_1_16 | 25904 | 635 | 635 |
Unrolled_2_Full | 15630 | 635 | 635 |
Observations:
-
All the versions (excluding manually unrolled) became faster.
-
All the versions now run faster than in Java.
-
Dst_First
versions now run faster thanSrc_First
, it was not so in C initially. -
While resolving the pointer aliasing and optimising the loop did not help on its own, it helped a lot in combination with the loop unrolling. This is visible by comparing
Dst_First_1
withDst_First_1a
andDst_First_3
withDst_First_3a
. -
The same is true for hard-coded loop boundaries. It didn’t help much on its own but made a big difference in combination with the loop unrolling. Compare
Dst_First_1
withDst_First_3
andDst_First_1a
withDst_First_3a
. -
The
Dst_First_3a
version’s speed is very close to theUnrolled
’s speed, which makes it a perfect choice where extreme performance is not necessary -
The
Dst_First_1a
is slower, but its speed is still acceptable for a universal solution (the one without hard-coded sizes). -
This is probably all we can do by applying simple changes and tweaking compiler switches. More effort is required if we need to improve speed even more.
The unrolled code
Let’s have a look at the unrolled code
(I’m excluding the assertion part;
the full text is in the repository).
This is Dst_First_1a
:
This is Dst_First_3a
:
Observations:
-
The inner loop was unrolled 8 times in both cases
-
The inner loop of
Dst_First_1a
was 23 bytes long; became 140 bytes -
The inner loop of
Dst_First_3a
was 23 bytes long; became 142 bytes -
The entire
Dst_First_1a
was 123 bytes long; became 451 byte -
The entire
Dst_First_3a
was 123 bytes long; became 246 bytes -
The loop unrolling killed the loop alignment in
Dst_First_1a
: the loop starts at0x410c2f
, still explicitly aligned. A possible reason is that the loop entry is also a target for a normal branch. I think this is wrong, possibly a deficiency of GNU C. -
Both loops are explicitly aligned in
Dst_First_3a
-
Comparison instructions with jumps to labels from
.L76
to.L71
are sorting out leftovers. They check the number of iterations modulo 7. If it is zero, the code jumps to the loop entry, if it is one, the code processes one byte and then jumps to the loop entry, and so on. -
The code for
Dst_First_3a
does not need these tests and looks much neater -
The code still does not look optimal:
leal
instruction could have been merged with the memory load operation. Perhaps, it has something to do with the operand size: theleal
is a 32-bit instructions. This is something to investigate. It might mean that the previous statement “This is probably all we can do by applying simple changes and tweaking compiler switches” was not correct.
Conclusions
-
Pointer aliasing requires manual treatment. Even if the performance does not improve, it may improve in combination with other optimisations.
-
When applied in right places, loop unrolling can improve performance a lot. It may require using additional command-line arguments or pragmas/attributes.
-
GCC can do strength reduction; there is no need to perform it manually
-
Finally, the C-compiled program does run faster than the one in Java
-
There is still something wrong with the code
Coming soon
In the next article I’m going to look at the code generated for Dst_First_3a
closely and check if it can be improved.