I finished my previous article with the observation that, despite the optimisation effort, the code generated by the GNU C compiler was still not optimal.
What’s wrong with the code?
Let’s concentrate on the
We compiled it with function alignment and loop alignment set to 32 and loop unrolling switched on. This is the assembly output:
Let’s look at the inner loop (the code between
jne .L29). This loop has been unrolled 8 times,
which corresponds to the following transformation of the original code:
You can see that operation strength reduction has also happened:
dst_pos * 32 became a separate variable,
updated once in the inner loop.
It is easy to see how registers are allocated:
Here is the code for one assignment from the inner loop (after sorting out some instruction re-ordering):
Why is there the
leal instruction? All it does is that it adds 32 to
%rdx and writes the result into
The result of this addition is later used in the effective address of the
movzbl, where another
register is added to it –
src). But wait, in Intel architecture an addressing mode can consist of
two registers (one possibly scaled) and an offset; why is this
32 offset not placed right there, like this?
The hint comes from the name of the instruction (
leal, rather than
leaq), as well as the name of the result
%r10d, rather than
%r10): this is a 32-bit instruction. The
%rdx is specified as an argument, but in actual
fact it only uses its lower 32 bits (
%edx). Only the lower 32 bits of the result are used; they are written
%r10d, which is the low part of the 64-bit register
%r10. The higher 32 bits are zeroed.
In short, the operation performs unsigned 32-bit addition (with possible wrapping in case of an overflow) and
converts the result to 64 bits. The
leal here has nothing to do with “load effective address”, which is what its name stands for;
it is used as a replacement for the addition instruction (
addl), for the only reason that it allows the result to be in
different register than the source, which saves a move instruction in our case. There is no way to combine this operation with a regular 64-bit addition in an
Why is the compiler computing the sum in 32 bits?
The reason is simple:
dst_pos is declared as
unsigned, as well as
int, is 32 bits long in GCC.
%edx register corresponds to the
dst_pos_x_32 variable from the pseudo-code, which equals to the
dst_pos multiplied by
NUM_TIMESLOTS (32), and this is a 32-bit multiplication. Even if the
result is converted to a 64-bit value later, the C++ requires the multiplication to be performed in 32-bit
mode, and the high part of a result to be discarded. Check this:
In our example the multiplication was replaced with the addition (as a result of the strength reduction), but that addition was still performed in 32 bits for numeric safety:
The same applies to adding of the offsets (32, 64, etc) to
%edx: these must also be done in 32 bits.
If current value of
%edx is close to
0xFFFFFFFF, adding 32 to it can cause
overflow and produce some small number as a result; this result would in fact be correct, while 64-bit result
(where no overflow occurs, and the result is big) would be incorrect. This means we can’t embed this addition into the addressing mode
of a load instructions, because address calculations are always 64 bits on this processor.
We know that in this loop
dst_pos is actually small, and so does the compiler; that’s why it uses 64-bit register
%rax) for this variable. It knows that no overflow can happen while adding 8 to it. However, it doesn’t seem to
be clever enough to figure out that no overflow happens while multiplying it by 32. That’s why it keeps
in a 32-bit variable
%edx and performs all the calculations with it in a 32-bit way, which is unsuitable for
effective address expressions.
unsigned 32 bits?
C/C++ is quite relaxed about data type sizes. It requires
int to be the same size as
unsigned, both to
be at least 16 bits, both to be not smaller than
short and not longer than
long. I still remember compilers
int was 16 bits, it was like that on all 8- and 16-bit platforms. 32-bit platforms established a new
standard, and for a long time the most common size of
int has been 32 bits. The standard does not prevent
changing it to 64 bits.
I can think of two reasons why this hasn’t happened in GNU C. One is the legacy code. There is a huge amount of C code around, it is everywhere. Not all of it is well-written. Some of the code may contain dependencies on type sizes, and it is very difficult to find these dependencied in the code.
The other reason is that 64-bit isn’t always better. Unofficially C/C++ declares that
int is the most natural
and the most efficient integer data format on the target architecture. We’ve seen that for the purpose of array indexing
the 64-bit type is indeed the most efficient. However, there are other cases.
First of all, the instruction encoding is implemented in such a way that the shortest instructions are 32-bit instructions
working with traditional registers (
01 C3 addl %eax, %ebx 48 01 C3 addq %rax, %rbx
this means that the 32-bit code is shorter. However, this is only applicable to small functions that can be fully
compiled using only traditional registers. As soon as the new registers (
r15) are used, the instructions
immediately become as long as in the 64-bit case:
49 01 C2 addq %rax, %r10 41 01 C2 addl %eax, %r10d
This means that the negative effect of working with 64-bit registers is relatively small for real programs.
There are 64-bit instructions that execute much longer than their 32-bit versions. Division is one example. Multiplication seems to run at the same speed.
64-bit numbers take twice the amount of space in data structures. It causes programs to use more memory and can reduce speed due to memory bandwith limitations.
32-bit numbers often allow better low-level optimisations than 64-bit numbers. Here is a simple example – a routine that adds two vectors of 256 numbers each. We’ll make two versions: for 32-bit numbers and for 64-bit:
I won’t show the entire code here, it is surprisingly big
(you can see it in the repository).
I’ll only show the main loops. Here is one for
And here is one for
The fragments look almost identical, except for one important difference:
paddd instruction in the first case
paddq in the second. These are the SSE-2 instructions.
Both operate on 128-bit
xmm registers, but
considers these registers consisting of four 32-bit words, while
paddq of two 64-bit words.
The operations are performed simultaneously on all the components.
The code is really amazing. The GNU C compiler has a built-in vectoriser, which tries to use SSE wherever possible. In this case it has split a source vector into sub-vectors of sizes four or two, added those sub-vectors component-wise, and then calculated the sum of the result.
Obviously, the second code requires twice as many iterations, as a result it runs longer.
As you can see, there isn’t a simple answer to the question which integer type is the most natural on the Intel 64-bit architecture, and there are valid arguments both ways. However, it seems that 64 bits is the best size for values used to index arrays.
Going 64 bits
Dst_First_3a and declare
dst_pos a 64-bit variable. On the 64-bit GCC we can use
When we run it, we get the following:
And the previous result we got (as published in “How to make C run as fast as Java”)
We’ve got very significant improvement. It looks attractive to modify all the code in the same way.
unsigned long throughout the program seems a bit unsafe. What if it is bad for some
other platform? If some 32-bit platform defines
unsigned long as 64-bit, such a modification will cause
a disaster. What we need is a special type for variables used as indices. Fortunately for us, such a type
exists, it is the
Strictly speaking, the
size_t isn’t designed for indices, it is there for sizes. A
sizeof operator returns
results of this type. Many library functions have parameters of this type:
qsort() and others. In addition to byte counts, it is used for element counts. It looks
very reasonable to use this type for indexing, as long as a programmer is happy to use an unsigned type. There
is a signed version of the
size_t, too – it is called
Our index variables are unsigned already, so we can make all of them
It makes sense to modify our interface and make
src_length of type
size_t as well. Besides making the code
consistent (we won’t compare
size_t variables with
unsigned), this has another positive effect: all the
versions that do not hard-code the input size will be able to work with any size that the platform allows.
Previously the input size was limited to 4 Gb, which for 64-bit platform is a totally unnecessary restriction.
Remember, when first talking about the type to use, in the article “De-multiplexing of E1 stream: converting to C”,
size_t. I said then that I
size_t was also unsigned. I didn’t, however, use
size_t itself, because I
considered it an unnecessary complication. Now it turned out that it was actually a performance feature.
This is how writing articles can help you learn something new.
This is what happens when we make everything
(see the code in repository):
# c++ -O3 -falign-functions=32 -falign-loops=32 -funroll-loops -o e1 e1.cpp -lrt # ./e1 9Reference: 1367 11Src_First_1: 987 11Src_First_2: 986 11Src_First_3: 1359 11Dst_First_1: 991 11Dst_First_2: 766 11Dst_First_3: 982 12Dst_First_1a: 763 12Dst_First_3a: 652 10Unrolled_1: 636 12Unrolled_1_2: 632 12Unrolled_1_4: 636 12Unrolled_1_8: 635 13Unrolled_1_16: 635 15Unrolled_2_Full: 635
The results look really impressive. Let’s put them into the table. The old results come from the previous article:
|Version||Time in Java||Old time in C||New time in C|
All the versions became faster; some became much faster
All the unrolled versions now run at the same speed;
Dst_First_3aruns just a bit slower
Referenceversion became very fast
Let’s have a look at the code generated for
(the entire assembly output is in
The code looks much better than before. Each line of the inner loop produces two instructions instead of three now:
Previously there were these three lines:
In fact, the code looks close to ideal, To improve the speed even more, something extraordinary must be done.
By the way, the compiler chose different route from what I suggested before (adding an address, an index and an offset in the effective address calculation). It employed different transformation of the source program:
I don’t think this new way is any faster (after all, address modes are free), but it looks neater.
In “De-multiplexing of E1 stream: converting to C” we took a Java program and converted it into C++. Then we’ve done the following:
- configured function and loop alignment (article: The mystery of an unstable performance);
- optimised memory access to eliminate aliasing effects (article: How to make C run as fast as Java);
- configured loop unrolling (the same article)
Here is a graph that shows how these steps affected the speeds of different versions:
The size of the data types used does matter.
size_tis there for a purpose; this purpose isn’t just for programs to look stylish and conform to the conventions – the programs really run faster if they use this type
We have exhausted conventional ways of achieving high performance in C; further improvements require some radical change.
Although most versions now run much faster than their Java originals, the peak performance in C is the same as in Java.
GNU C isn’t a bad compiler after all
Until now all our versions were produced by direct translation form Java. Can we do better in some C-specific way? What if we sacrifice portability and write for the specific platform? Can we improve the speed this way? This will be the topic of my next article.