Experiments in program optimisation

De-multiplexing of E1 stream: converting to C

Story: De-multiplexing of E1 streams

Tags: optimisation performance de-multiplexing C C++ macro meta-programming

02 May 2014

In the previous article we were de-multiplexing the E1 stream in Java. Now I want to try C/C++. General expectation is that it will be faster than in Java, possibly much faster; we’ll check if this is true.

By “C/C++” I mean “C with a little bit of C++”. There will be just enough C++ to make a program look a bit nicer and add a bit of safety. I mean small features, such as declaring of variable where it is first used rather than at the top. Perhaps occasional class here and there. There won’t be any multiple inheritance or curiously recurring template pattern, and the reader won’t be forced to google the difference between static and dynamic cast.

It seems natural to convert the Java code to C++ directly first, to compare the compilers and execution environments. This is what I’m going to do today; I’ll leave using C-specific low-level features for later articles. As a result, today’s discussion will be more about syntax of C/C++ than about optimisation. In particular, you will see lots of macro definitions and invocations. If you are not interested in it or if you are already a C expert, you are welcome to jump straight to the “Running it” section.

The entire project is available on Github in a repository: https://github.com/pzemtsov/article-E1-demux-C

Conversion of non-unrolled versions

The syntax of Java is close enough to that of C++ that we can almost convert one to another using find-and-replace. I will only give one example, the rest can be seen in the repository.

Java code:

interface Demux
{
    public void demux (byte[] src, byte[][] dst);
}

static final class Reference implements Demux
{
    public void demux (byte[] src, byte[][] dst)
    {
        assert src.length % NUM_TIMESLOTS == 0;

        int dst_pos = 0;
        int dst_num = 0;
        for (byte b : src) {
            dst [dst_num][dst_pos] = b;
            if (++ dst_num == NUM_TIMESLOTS) {
                dst_num = 0;
                ++ dst_pos;
            }
        }
    }
};

C++ code:

typedef unsigned char byte;

class Demux
{
public:
    virtual void demux (const byte * src, unsigned src_length, byte ** dst)
            const = 0;
};

class Reference : public Demux
{
public:
    void demux (const byte * src, unsigned src_length, byte ** dst) const
    {
        assert (src_length % NUM_TIMESLOTS == 0);

        unsigned dst_pos = 0;
        unsigned dst_num = 0;
        for (unsigned src_pos = 0; src_pos < src_length; src_pos++) {
            dst [dst_num][dst_pos] = src [src_pos];
            if (++ dst_num == NUM_TIMESLOTS) {
                dst_num = 0;
                ++ dst_pos;
            }
        }
    }
};

Since we’re not using class libraries, the buffers are represented as byte pointers, which, unlike Java’s byte arrays, do not keep the data size and do not provide index checking. That’s why the input length must be passed as an additional parameter. It is the caller’s responsibility to ensure that memory is correctly allocated for the output buffers.

I’m using unsigned, rather than int for lengths and loop variables out of habit: in C++ the sizes are traditionally stored in variables of type size_t, which is unsigned.

Another habit explains declaration of type byte. I prefer reserving the name char for character data, and here we are dealing with bytes of general nature.

We also need a timer (replacement for Java’s System.currentTimeMillis ()). I’ve moved it to separate header file "timer.h". We’ll copy the measurement and correctness test code from Java with only minor changes. We don’t need to run tests five times, since C++ doesn’t have any warm-up effect. We’ll still run it twice, just to be safe. Another change is triggered by absence of a garbage collection in C++. All the memory we allocate must be freed. Finally, we’ll run all out versions at once, so to improve stability of results, we’ll run them in the same memory, re-using the buffers.

We are ready to run the tests, but let us convert unrolled versions first.

Conversion of unrolled versions: macro magic

We could convert unrolled versions just as we did all the others, by find and replace. However, in C/C++ we can do better, since C/C++ have a rudimental meta-programming feature: macros. See the full text in the repository.

This is how Unrolled_1 can be written down using macros:

#define MOVE_BYTE(i,j) d[i] = src [(j)+(i)*32]

#define MOVE_BYTES_64(j) do {\
   MOVE_BYTE ( 0, j); MOVE_BYTE ( 1, j); MOVE_BYTE ( 2, j); MOVE_BYTE ( 3, j);\
   MOVE_BYTE ( 4, j); MOVE_BYTE ( 5, j); MOVE_BYTE ( 6, j); MOVE_BYTE ( 7, j);\
   MOVE_BYTE ( 8, j); MOVE_BYTE ( 9, j); MOVE_BYTE (10, j); MOVE_BYTE (11, j);\
   MOVE_BYTE (12, j); MOVE_BYTE (13, j); MOVE_BYTE (14, j); MOVE_BYTE (15, j);\
   MOVE_BYTE (16, j); MOVE_BYTE (17, j); MOVE_BYTE (18, j); MOVE_BYTE (19, j);\
   MOVE_BYTE (20, j); MOVE_BYTE (21, j); MOVE_BYTE (22, j); MOVE_BYTE (23, j);\
   MOVE_BYTE (24, j); MOVE_BYTE (25, j); MOVE_BYTE (26, j); MOVE_BYTE (27, j);\
   MOVE_BYTE (28, j); MOVE_BYTE (29, j); MOVE_BYTE (30, j); MOVE_BYTE (31, j);\
   MOVE_BYTE (32, j); MOVE_BYTE (33, j); MOVE_BYTE (34, j); MOVE_BYTE (35, j);\
   MOVE_BYTE (36, j); MOVE_BYTE (37, j); MOVE_BYTE (38, j); MOVE_BYTE (39, j);\
   MOVE_BYTE (40, j); MOVE_BYTE (41, j); MOVE_BYTE (42, j); MOVE_BYTE (43, j);\
   MOVE_BYTE (44, j); MOVE_BYTE (45, j); MOVE_BYTE (46, j); MOVE_BYTE (47, j);\
   MOVE_BYTE (48, j); MOVE_BYTE (49, j); MOVE_BYTE (50, j); MOVE_BYTE (51, j);\
   MOVE_BYTE (52, j); MOVE_BYTE (53, j); MOVE_BYTE (54, j); MOVE_BYTE (55, j);\
   MOVE_BYTE (56, j); MOVE_BYTE (57, j); MOVE_BYTE (58, j); MOVE_BYTE (59, j);\
   MOVE_BYTE (60, j); MOVE_BYTE (61, j); MOVE_BYTE (62, j); MOVE_BYTE (63, j);\
} while (0)

#define MOVE_TIMESLOT(j) do {\
        byte * const d = dst[j];\
        MOVE_BYTES_64 (j);\
    } while (0)

class Unrolled_1 : public Demux
{
public:
    void demux (const byte * src, unsigned src_length, byte ** dst) const
    {
        assert (NUM_TIMESLOTS == 32);
        assert (DST_SIZE == 64);
        assert (src_length == NUM_TIMESLOTS * DST_SIZE);

        for (unsigned j = 0; j < NUM_TIMESLOTS; j++) {
            MOVE_TIMESLOT (j);
        }
    }
};

Here I feel necessary to include some practical points on macro usage. Readers with good C experience can skip the next section:

MOVE_BYTE (x+1, y);

It would have been expanded as

d[x+1] = src[y+x+1*32]

and this is not quite what was intended.

Here the usage of macros does not save you much typing – you must still invoke MOVE_BYTE manually 64 times. But at least the actual code of MOVE_BYTE is in one place only, and can be modified easily.

By the way, you can avoid writing down all 64 invocations of MOVE_BYTE by creating multi-layered macros:

#define MOVE_BYTES_4(i,j) do {\
  MOVE_BYTE(i,  j); MOVE_BYTE(i+1,j);\
  MOVE_BYTE(i+2,j); MOVE_BYTE(i+3,j);\
} while (0)
#define MOVE_BYTES_16(j) do {\
  MOVE_BYTES_4(i,  j); MOVE_BYTES_4(i+4, j);\
  MOVE_BYTES_4(i+8,j); MOVE_BYTES_4(i+12,j);\
} while (0)
#define MOVE_BYTES_64(j) do {\
  MOVE_BYTES_16( 0,j); MOVE_BYTES_16(16,j);\
  MOVE_BYTES_16(32,j); MOVE_BYTES_16(48,j);\
} while (0)

This way you have to write 12 invocations of some macros rather than 64. The same number will result from de-composition into six layers of two invocations each. A general formula for the minimal number of invocations for a given total number doesn’t seem to be simple; perhaps some mathematically inclined reader can suggest one.

The macros we’ve just created will be useful for other versions of our method, too. For instance, a four times unrolled version looks like this:

class Unrolled_1_4 : public Demux
{
public:
    void demux (const byte * src, unsigned src_length, byte ** dst) const
    {
        assert (NUM_TIMESLOTS == 32);
        assert (DST_SIZE == 64);
        assert (src_length == NUM_TIMESLOTS * DST_SIZE);

        for (unsigned j = 0; j < NUM_TIMESLOTS; j+=4) {
            MOVE_TIMESLOT (j);
            MOVE_TIMESLOT (j+1);
            MOVE_TIMESLOT (j+2);
            MOVE_TIMESLOT (j+3);
        }
    }
};

And fully unrolled version looks like this:

class Unrolled_2_Full : public Demux
{
public:
    void demux (const byte * src, unsigned src_length, byte ** dst) const
    {
        assert (NUM_TIMESLOTS == 32);
        assert (DST_SIZE == 64);
        assert (src_length == NUM_TIMESLOTS * DST_SIZE);

        MOVE_TIMESLOT ( 0); MOVE_TIMESLOT ( 1); MOVE_TIMESLOT ( 2); MOVE_TIMESLOT ( 3);
        MOVE_TIMESLOT ( 4); MOVE_TIMESLOT ( 5); MOVE_TIMESLOT ( 6); MOVE_TIMESLOT ( 7);
        MOVE_TIMESLOT ( 8); MOVE_TIMESLOT ( 9); MOVE_TIMESLOT (10); MOVE_TIMESLOT (11);
        MOVE_TIMESLOT (12); MOVE_TIMESLOT (13); MOVE_TIMESLOT (14); MOVE_TIMESLOT (15);
        MOVE_TIMESLOT (16); MOVE_TIMESLOT (17); MOVE_TIMESLOT (18); MOVE_TIMESLOT (19);
        MOVE_TIMESLOT (20); MOVE_TIMESLOT (21); MOVE_TIMESLOT (22); MOVE_TIMESLOT (23);
        MOVE_TIMESLOT (24); MOVE_TIMESLOT (25); MOVE_TIMESLOT (26); MOVE_TIMESLOT (27);
        MOVE_TIMESLOT (28); MOVE_TIMESLOT (29); MOVE_TIMESLOT (30); MOVE_TIMESLOT (31);
    }
};

This is much smaller in source size than the version in Java.

A sad comment on meta-programming

Meta-programming is an automatic generation of a program or some part of it. This is what I was doing in Java (all unrolled versions were generated). There are lots of external tools that perform some form of meta-programming (pre-processors, language converters, etc.), but the most convenient is when meta-programming is supported by the language itself. This means that some part of a program can be executed at compile-time, producing real program, which is then compiled into the executable code. As you could see in the previous section, it is a very useful feature for high-performance programming, which involves lots of inlining and unrolling. Unfortunately, such a feature is a rare commodity in modern programming languages. Java does not have it (generics do not count and dynamic code generation is quite troublesome and heavy-weight), neither does Go or Python. C has a rudimentary version of it in the form of macro expansion. In addition, C++ employs templates, which are very powerful but not easy to use and a bit ugly. They are not actually designed for full-scale meta-programming, but rather for generic type-independent programming, that’s why the primary working mechanism there is pattern matching, which for me isn’t always the programming tool of choice. It is not impossible to implement the above examples using templates but the code won’t look nice and it is yet to be seen how fast it would run.

In the old days meta-programming was more common. It was widely available in macro-assemblers, where main language (assembler) was enriched with macro-features, which included variables, loops, branches and other high-level features. But the best by far in that respect was the approach of PL/I, where macro-language was PL/I itself, just the lines were marked to be run at compile time. Should such a feature be available in C++, out program would have looked much nicer:

class Unrolled_2_Full : public Demux
{
public:
    void demux (const byte * src, unsigned src_length, byte ** dst) const
    {
        assert (src_length % NUM_TIMESLOTS == 0);
        %for (unsigned %dst_num = 0; %dst_num < NUM_TIMESLOTS; ++ %dst_num) {
            %for (unsigned %dst_pos = 0; %dst_pos < DST_SIZE; ++ %dst_pos) {
                dst [%dst_num][%dst_pos] = src [%dst_pos * NUM_TIMESLOTS + %dst_num];
            %}
        %}
    }
};

I assume imaginary version of C++ where percentage mark indicates code that runs at compile time. As you can see, the code looks very close to Dst_First_1. Unfortunately, such a version does not exist, and we have to write ugly macro definitions. What a pity.

More macro magic: parameter concatenation

In our Java code there is a version that involves 32 small functions (Unrolled_3). Quite frankly, I don’t expect its performance to be any better than performance of other unrolled versions in C++; in addition you can recall that the very reason for its existence was HotSpot limitation of method size. However, for the sake of completeness I’ll convert this function as well, using another handy feature of C’s macro language: token concatenation. A combination of two hash characters (##), used in a macro, causes concatenation of text to the left and to the right of it without any whitespaces, allowing creation of new lexems such as function names.

class Unrolled_3 : public Demux
{
public:
    void demux (const byte * src, unsigned src_length, byte ** dst) const
    {
        assert (src_length % NUM_TIMESLOTS == 0);

#define CALL_DEMUX(i) demux_##i (src, dst[i])

        CALL_DEMUX ( 0); CALL_DEMUX ( 1); CALL_DEMUX ( 2); CALL_DEMUX ( 3);
        CALL_DEMUX ( 4); CALL_DEMUX ( 5); CALL_DEMUX ( 6); CALL_DEMUX ( 7);
        CALL_DEMUX ( 8); CALL_DEMUX ( 9); CALL_DEMUX (10); CALL_DEMUX (11);
        CALL_DEMUX (12); CALL_DEMUX (13); CALL_DEMUX (14); CALL_DEMUX (15);
        CALL_DEMUX (16); CALL_DEMUX (17); CALL_DEMUX (18); CALL_DEMUX (19);
        CALL_DEMUX (20); CALL_DEMUX (21); CALL_DEMUX (22); CALL_DEMUX (23);
        CALL_DEMUX (24); CALL_DEMUX (25); CALL_DEMUX (26); CALL_DEMUX (27);
        CALL_DEMUX (28); CALL_DEMUX (29); CALL_DEMUX (30); CALL_DEMUX (31);
#undef CALL_DEMUX
    }

private:

#define DEF_DEMUX(i) \
    inline void demux_##i (const byte * src, byte * d) const\
    {\
        MOVE_BYTES (i);\
    }

    DEF_DEMUX ( 0) DEF_DEMUX ( 1) DEF_DEMUX ( 2) DEF_DEMUX ( 3)
    DEF_DEMUX ( 4) DEF_DEMUX ( 5) DEF_DEMUX ( 6) DEF_DEMUX ( 7)
    DEF_DEMUX ( 8) DEF_DEMUX ( 9) DEF_DEMUX (10) DEF_DEMUX (11)
    DEF_DEMUX (12) DEF_DEMUX (13) DEF_DEMUX (14) DEF_DEMUX (15)
    DEF_DEMUX (16) DEF_DEMUX (17) DEF_DEMUX (18) DEF_DEMUX (19)
    DEF_DEMUX (20) DEF_DEMUX (21) DEF_DEMUX (22) DEF_DEMUX (23)
    DEF_DEMUX (24) DEF_DEMUX (25) DEF_DEMUX (26) DEF_DEMUX (27)
    DEF_DEMUX (28) DEF_DEMUX (29) DEF_DEMUX (30) DEF_DEMUX (31)
#undef DEF_DEMUX
};

You can see how using ## helps create necessary method names. Note absence of semicolons between calls to DEF_DEMUX: in C/C++ function definitions do not end with semicolons (but class definitions do!)

Making things shorter: macros as macro parameters

Our code contains manually written sequences of macro calls of size 64 (as a MOVE_BYTES definition), one each of size 2, 4, 8 and 16 (various Unrolled_1_x versions) and four of size 32 (all fully unrolled versions). A code can be made a bit shorter by introducing a generic macro duplicator – a poor man’s meta-programming loop. Such a duplicator (also a macro) can take a macro as a parameter and call it given (constant) number of times. This is based on a very handy feature of macros: although a macro with parameters can’t be invoked with different number of parameters, using its name without any parameters at all is allowed, the macro won’t be expanded then. This allows passing name of one macro as a parameter to another macro.

There are three ways to write such a multiplier. The simplest (but the longest) involves a lot of typing:

#define DUP_2(m)  do { m(0); m(1); } while (0)
#define DUP_4(m)  do { m(0); m(1); m(2); m(3); } while (0)
#define DUP_8(m)  do { m(0); m(1); m(2); m(3); m(4); m(5); m(6); m(7); }\
                  while (0)

And so on – definition of DUP_64 will contain 64 calls to m().

We can save a bit of typing by basing higher-order duplicator on lower-order one. A partial solution looks like this:

#define DUP_2(m)  do {            m(0); m(1); } while (0)
#define DUP_4(m)  do { DUP_2(m);  m(2); m(3); } while (0)
#define DUP_8(m)  do { DUP_4(m);  m(4); m(5); m(6); m(7); } while (0)

Here the lists of m() being invoked are exactly half the size.

Another approach is to call each previous macro twice:

#define DUP_2_(m, index)  do { m (index); m (index+1); } while (0)
#define DUP_4_(m, index)  do { DUP_2_ (m, index); DUP_2_ (m, index+2); }\
                          while (0)
#define DUP_8_(m, index)  do { DUP_4_ (m, index); DUP_4_ (m, index+4); }\
                          while (0)

#define DUP_2(m)  DUP_2_ (m, 0)
#define DUP_4(m)  DUP_4_ (m, 0)
#define DUP_8(m)  DUP_8_ (m, 0)

This approach has two significant advantages. It allows writing down loops of arbitrary sizes without much typing (a macro definition for any loop of size that is power of two has exactly two macro invocations; for other loop sizes the number may be bigger but is always limited by number of bits in the loop size). It also provides, as a by-product, a set of loops that don’t start at zero but at some given number, and we need these for our partially unrolled loop solutions. It has, however, one disadvantage. The indices in macro expansions appear as arithmetic expressions rather than as direct numbers; and this makes it impossible to use this macro with CALL_DEMUX above: the ## concatenation operation does not evaluate expressions before concatenation. That’s why we will use the partial solution. Versions of code with both solutions are in the repository: this is the solution we discard and this is the solution we’ll use.

Neither of these approaches works for DEF_DEMUX, because DEF_DEMUX does not allow semicolons between calls.

The macros above only work with argument macros that take one parameter. For more parameters, separate series of macros must be created. I called them DUP2_xx.

Using these macros, for instance, the sequence of CALL_DEMUX in the previous example (Unrolled_3) will become just

DUP_32 (CALL_DEMUX)

Partially unrolled methods require a service macro that relies on the name of the loop variable (j) and adds it to the offset parameter:

#define MOVE_TIMESLOT_J(offset) MOVE_TIMESLOT (j + offset)

What is nice is that DUP_xx macros are now generic and can be moved to a separate header file and later re-used in other projects. I’ve moved them to mymacros.h file.

As a result of applying macros and multipliers the entire size of the test program became 383 lines, while in Java just the Unrolled_2_Full was 553 lines long.

Running it

We are ready to run it all. We’ll be using the same system as for the Java testing: Linux, running on a Dell blade with Intel(R) Xeon(R) CPU E5-2670 @ 2.60GHz. We’ll use gcc 4.6.

# c++ -O3 -o e1 e1.cpp -lrt

(-O3 indicates the highest level of optimisation, and -lrt is needed to include relevant run-time library).

# ./e1
9Reference: 1937 1930
11Src_First_1: 1890 1878
11Src_First_2: 1920 1916
11Src_First_3: 1890 1891
11Dst_First_1: 1465 1465
11Dst_First_2: 1443 1444
11Dst_First_3: 1758 1758
10Unrolled_1: 632 633
12Unrolled_1_2: 633 633
12Unrolled_1_4: 655 655
12Unrolled_1_8: 648 649
13Unrolled_1_16: 635 633
15Unrolled_2_Full: 635 634
10Unrolled_3: 634 634
10Unrolled_4: 653 653

You might be surprised by the names in the first columns; so am I. We are lucky that we can read it at all. The C++ standard does not require any particular format of a type name as returned by typeid call, so the class name prepended by the number of characters in it isn’t the worst possible case.

It is clearly visible that the results for two runs are quite stable. It means that running the test twice is unnesessary and we can remove appropriate code.

# ./e1
9Reference: 1939
11Src_First_1: 1885
11Src_First_2: 1924
11Src_First_3: 1892
11Dst_First_1: 1467
11Dst_First_2: 1445
11Dst_First_3: 1761
10Unrolled_1: 633
12Unrolled_1_2: 634
12Unrolled_1_4: 654
12Unrolled_1_8: 650
13Unrolled_1_16: 635
15Unrolled_2_Full: 635
10Unrolled_3: 635
10Unrolled_4: 655

Let’s recall the results for Java version and put them together in the same table:

Version Time in Java Time in C
Reference 2860 1939
Src_First_1 2481 1885
Src_First_2 2284 1924
Src_First_3 4360 1892
Dst_First_1 1155 1467
Dst_First_2 2093 1445
Dst_First_3 1022 1761
Unrolled_1 659 633
Unrolled_1_2 654 634
Unrolled_1_4 636 654
Unrolled_1_8 637 650
Unrolled_1_16 25904 635
Unrolled_2_Full 15630 635
Unrolled_3 790 635
Unrolled_4 711 655

Here comes our first sensation: C++ is not always faster than Java. Both Dst_First_1 and Dst_First_3 run quite a bit faster in Java than in C++. I’ll come back to this phenomenon in a later article. Let’s look at other observations:

Notes for myself

Some points to investigate:

Coming soon

Can C++ code be made faster without rewriting – perhaps there are some secret compiler keys? Can the speed be improved any more? Answers will follow soon.

comments powered by Disqus