Experiments in program optimisation

How to transpose a 16×16 byte matrix using SSE

Story: De-multiplexing of E1 streams

Tags: C C++ GCC optimisation SSE AVX

01 Oct 2014


In “De-multiplexing of E1 stream: converting to C” we tried different ways to de-multiplex E1 streams using SSE and AVX. This problem can be described as transposition of a matrix. In our case the dimensions of the source matrix are 32×64 (32 columns and 64 rows, where 32 is the number of timeslots and 64 is the size of each timeslot’s output). The matrix is assumed stored in memory along the rows.

The usual way to transpose this matrix is to divide it into small blocks that fit into available registers, and transpose each block separately. We tried this using blocks of size 1×4, 1×8, 4×4, 4×16, 8×16, 4×32 and 8×32. The best results were achieved using 8×16 blocks with SSE (120 ms for 1,000,000 iterations), and 8×32 blocks with AVX (109 ms). The times were measured on the Intel® Xeon® CPU E5-2670 @ 2.60GHz.

In the same article I briefly mentioned an attempt to use 16×16 blocks. I didn’t go into the details then, to keep the article shorter. Now I changed my mind. Looking later at the code, I considered it rather beautiful. So today we’ll be transposing a 16×16 byte matrix.


We’ll assume that the matrix is represented as 16 SSE variables (of type __m128i), where each variable keeps one row (16 bytes). The result must be stored the same way.

When designing high-performance matrix manipulations using SSE, we developed two utility functions performing two important manipulations. The first one considered a single SSE value as a 4×4 matrix and transposed it:

inline __m128i transpose_4x4 (__m128i m)
    return _mm_shuffle_epi8 (m, _mm_setr_epi8 (0, 4, 8, 12,
                                               1, 5, 9, 13,
                                               2, 6, 10, 14,
                                               3, 7, 11, 15));

The second one took four SSE variables, interpreted them as a 4×4 matrix of double-words (4-byte entities) and transposed this matrix:

inline void transpose_4x4_dwords (__m128i w0, __m128i w1,
                                  __m128i w2, __m128i w3,
                                  __m128i &r0, __m128i &r1,
                                  __m128i &r2, __m128i &r3)
    // 0  1  2  3
    // 4  5  6  7
    // 8  9  10 11
    // 12 13 14 15

    __m128i x0 = _128i_shuffle (w0, w1, 0, 1, 0, 1); // 0 1 4 5
    __m128i x1 = _128i_shuffle (w0, w1, 2, 3, 2, 3); // 2 3 6 7
    __m128i x2 = _128i_shuffle (w2, w3, 0, 1, 0, 1); // 8 9 12 13
    __m128i x3 = _128i_shuffle (w2, w3, 2, 3, 2, 3); // 10 11 14 15

    r0 = _128i_shuffle (x0, x2, 0, 2, 0, 2);
    r1 = _128i_shuffle (x0, x2, 1, 3, 1, 3);
    r2 = _128i_shuffle (x1, x3, 0, 2, 0, 2);
    r3 = _128i_shuffle (x1, x3, 1, 3, 1, 3);

_128i_shuffle is our own macro defined in sse.h. It is a more convenient version of _mm_shuffle_ps.

Now here is the plan: We’ll apply the second procedure to four groups of variables, each representing a 16×4 matrix, then apply the first procedure to 16 variables, each representing a 4×4 matrix, and then apply the second procedure again to four groups of four registers along the columns. And that’s it, the job is done.

It is easy to see why it works. Let’s assume that the matrix contains hex values from 0x00 to 0xFF. Look at the first four rows of the matrix, stored in four SSE registers, x0 to x3:

x0: 000102030405060708090A0B0C0D0E0F
x1: 101112131415161718191A1B1C1D1E1F
x2: 202122232425262728292A2B2C2D2E2F
x3: 303132333435363738393A3B3C3D3E3F

We want to collect the leftmost 4×4 submatrix (from 00 to 33) into one SSE variable, the next one (from 04 to 37) into the next one, and so on. The first sub-matrix consists of first double-words of all of the source values, the second one of the second ones, and so on. In short, this procedure is in fact transposition of a 4×4 double-word matrix, and it can be done like this:

transpose_4x4_dwords (x0, x1, x2, x3, w00, w01, w02, w03);

This is the result of this procedure:

w00: 00010203101112132021222330313233
w01: 04050607141516172425262734353637
w02: 08090A0B1819191A28292A2A38393A3B
w03: 0C0D0E0F1C1D1E1F2C2D2E2F3C3D3E3F

After we run this procedure for all four 16×4 strips, we’ll have the entire source matrix stored in SSE variables in the following way:

















We named the new variables according to their positions in this matrix.

The next step is to transpose each of the w00w33 as a 4×4 matrix. The transpose_4x4 will do the job.

Now we are ready to collect the results. We want to put columns of the original 16×16 matrix back into x0x15 variables. Before transposition of w00w33, the first column of the source matrix consisted of the first columns of w00w30, the second one from the second coluns snd so on. After the transposition we must use the rows, which are exactly the double-word components of the variables. This means that we need to transpose a double-word matrix again:

transpose_4x4_dwords (w00, w10, w20, w30, x0, x1, x2, x3);

Here is the content of w00w30 before this operation:

w00: 00102030011121310212223203132333
w10: 40506070415161714252627243536373
w20: 8090A0B08191A1B18292A2B28393A3B3
w30: C0D0E0F0C1D1E1F1C2D2E2F2C3D3E3F3

And this is the result (stored back into x0x3):

x0: 00102030405060708090A0B0C0D0E0F0
x1: 01112131415161718191A1B1C1D1E1F1
x2: 02122232425262728292A2B2C2D2E2F2
x3: 03132333435363738393A3B3C3D3E3F3

You can see that the variables contain four first columns of the original matrix. If we perform this operation on other groups of four variables, we’ll get the entire matrix.

Here is the code:

inline void transpose_16x16 (
                __m128i&  x0, __m128i&  x1, __m128i&  x2, __m128i&  x3,
                __m128i&  x4, __m128i&  x5, __m128i&  x6, __m128i&  x7,
                __m128i&  x8, __m128i&  x9, __m128i& x10, __m128i& x11,
                __m128i& x12, __m128i& x13, __m128i& x14, __m128i& x15)
    __m128i w00, w01, w02, w03;
    __m128i w10, w11, w12, w13;
    __m128i w20, w21, w22, w23;
    __m128i w30, w31, w32, w33;

    transpose_4x4_dwords ( x0,  x1,  x2,  x3, w00, w01, w02, w03);
    transpose_4x4_dwords ( x4,  x5,  x6,  x7, w10, w11, w12, w13);
    transpose_4x4_dwords ( x8,  x9, x10, x11, w20, w21, w22, w23);
    transpose_4x4_dwords (x12, x13, x14, x15, w30, w31, w32, w33);
    w00 = transpose_4x4 (w00);
    w01 = transpose_4x4 (w01);
    w02 = transpose_4x4 (w02);
    w03 = transpose_4x4 (w03);
    w10 = transpose_4x4 (w10);
    w11 = transpose_4x4 (w11);
    w12 = transpose_4x4 (w12);
    w13 = transpose_4x4 (w13);
    w20 = transpose_4x4 (w20);
    w21 = transpose_4x4 (w21);
    w22 = transpose_4x4 (w22);
    w23 = transpose_4x4 (w23);
    w30 = transpose_4x4 (w30);
    w31 = transpose_4x4 (w31);
    w32 = transpose_4x4 (w32);
    w33 = transpose_4x4 (w33);
    transpose_4x4_dwords (w00, w10, w20, w30,  x0,  x1,  x2, x3);
    transpose_4x4_dwords (w01, w11, w21, w31,  x4,  x5,  x6, x7);
    transpose_4x4_dwords (w02, w12, w22, w32,  x8,  x9, x10, x11);
    transpose_4x4_dwords (w03, w13, w23, w33, x12, x13, x14, x15);

De-multiplexing of the E1 stream using the 16×16 transposition

This is what the de-multiplexing procedure looks like:

class Read16_Write16_SSE : public Demux
    void demux (const byte * src, size_t src_length, byte ** dst) const
        assert (src_length == NUM_TIMESLOTS * DST_SIZE);
        assert (DST_SIZE % 16 == 0);
        assert (NUM_TIMESLOTS % 16 == 0);

        for (size_t dst_num = 0; dst_num < NUM_TIMESLOTS; dst_num += 16) {
            byte * d0 = dst [dst_num + 0];
            byte * d1 = dst [dst_num + 1];
            byte * d2 = dst [dst_num + 2];
            byte * d3 = dst [dst_num + 3];
            byte * d4 = dst [dst_num + 4];
            byte * d5 = dst [dst_num + 5];
            byte * d6 = dst [dst_num + 6];
            byte * d7 = dst [dst_num + 7];
            byte * d8 = dst [dst_num + 8];
            byte * d9 = dst [dst_num + 9];
            byte * d10= dst [dst_num +10];
            byte * d11= dst [dst_num +11];
            byte * d12= dst [dst_num +12];
            byte * d13= dst [dst_num +13];
            byte * d14= dst [dst_num +14];
            byte * d15= dst [dst_num +15];
            for (size_t dst_pos = 0; dst_pos < DST_SIZE; dst_pos += 16) {

#define LOADREG(i) __m128i w##i = \
                _128i_load (&src [(dst_pos + i) * NUM_TIMESLOTS + dst_num])

#define STOREREG(i) _128i_store (&d##i [dst_pos], w##i)

                LOADREG (0);  LOADREG (1);  LOADREG (2);  LOADREG (3);
                LOADREG (4);  LOADREG (5);  LOADREG (6);  LOADREG (7);
                LOADREG (8);  LOADREG (9);  LOADREG (10); LOADREG (11);
                LOADREG (12); LOADREG (13); LOADREG (14); LOADREG (15);
                transpose_16x16 (w0,  w1, w2,  w3,  w4,  w5,  w6,  w7,
                                 w8,  w9, w10, w11, w12, w13, w14, w15);
                STOREREG (0);  STOREREG (1);  STOREREG (2);  STOREREG (3);
                STOREREG (4);  STOREREG (5);  STOREREG (6);  STOREREG (7);
                STOREREG (8);  STOREREG (9);  STOREREG (10); STOREREG (11);
                STOREREG (12); STOREREG (13); STOREREG (14); STOREREG (15);
#undef LOADREG
#undef STOREREG 

At first this code may look very inefficient. The transpose_16x16 function is called with 16 parameters passed by reference. In general case this requires all the variables to be stored in memory and their addresses passed into the function. The same applies to calls to transpose_4x4_dwords, which returns results via the reference parameters. If those calls aren’t inlined, this is exactly what will happen, so for this code to be efficient, the inlining is absolutely essential.

If all the calls are inlined, the functions act as safe macros, and all the references are optimised out. There is no need to store all the variables in memory then. The code may end up relatively efficient, although some performance penalties are inevitable.

First of all, no matter how the compiler reorders the instructions, there is always some place in the code where there are at least 16 live SSE variables. In addition, some registers are needed for temporary values. Since SSE only has 16 registers, there is definite register shortage here, and some values will end up in memory.

Secondly, the same applies to pointers d0d15, which are all alive inside the inner loop. The processor has 16 general-purpose registers, so there is a shortage here as well. The impact of this shortage may depend on the compiler and the processor in use.

Achieved performance

On our test machine this methods takes 176ms, which is good result, but not the best. The best time was achieved using 8×32 blocks and AVX instructions.

I also tried it on my notebook (Intel(R) Core(TM) i7-4800MQ CPU @2.70 GHz), using Microsoft Visual C++ 2013 in 64-bit mode. This is what I got:

class Reference: 1991
class Write4: 578
class Write8: 545
class Read4_Write4: 685
class Read4_Write4_Unroll: 649
class Read4_Write4_SSE: 317
class Read4_Write16_SSE: 235
class Read8_Write16_SSE: 236
class Read8_Write16_SSE_Unroll: 236
class Read16_Write16_SSE: 208
class Read16_Write16_SSE_Unroll: 204
class Read4_Write32_AVX: 275
class Read8_Write32_AVX: 260
class Read8_Write32_AVX_Unroll: 256

We can see that on Windows our code is the fastest. This is why we mustn’t discard it – it can still be useful. Besides, it allows easy adaptation for AVX2 once it becomes available for testing. On AVX2 we can keep two rows of the matrix in one AVX register, and the entire matrix will require 8 registers. We need AVX2 for that, because AVX does not have byte manipulation instruction (PSHUFB).


Coming soon

This method is still not unrolled. I’ll do it in the next article.

comments powered by Disqus