Background
In “Demultiplexing of E1 stream: converting to C” we tried different ways to demultiplex E1 streams using SSE and AVX. This problem can be described as transposition of a matrix. In our case the dimensions of the source matrix are 32×64 (32 columns and 64 rows, where 32 is the number of timeslots and 64 is the size of each timeslot’s output). The matrix is assumed stored in memory along the rows.
The usual way to transpose this matrix is to divide it into small blocks that fit into available registers, and transpose each block separately. We tried this using blocks of size 1×4, 1×8, 4×4, 4×16, 8×16, 4×32 and 8×32. The best results were achieved using 8×16 blocks with SSE (120 ms for 1,000,000 iterations), and 8×32 blocks with AVX (109 ms). The times were measured on the Intel® Xeon® CPU E52670 @ 2.60GHz.
In the same article I briefly mentioned an attempt to use 16×16 blocks. I didn’t go into the details then, to keep the article shorter. Now I changed my mind. Looking later at the code, I considered it rather beautiful. So today we’ll be transposing a 16×16 byte matrix.
Transposition
We’ll assume that the matrix is represented as 16 SSE variables (of type __m128i
), where each variable keeps one
row (16 bytes). The result must be stored the same way.
When designing highperformance matrix manipulations using SSE, we developed two utility functions performing two important manipulations. The first one considered a single SSE value as a 4×4 matrix and transposed it:
The second one took four SSE variables, interpreted them as a 4×4 matrix of doublewords (4byte entities) and transposed this matrix:
_128i_shuffle
is our own macro defined in sse.h
. It is a more convenient version of _mm_shuffle_ps
.
Now here is the plan: We’ll apply the second procedure to four groups of variables, each representing a 16×4 matrix, then apply the first procedure to 16 variables, each representing a 4×4 matrix, and then apply the second procedure again to four groups of four registers along the columns. And that’s it, the job is done.
It is easy to see why it works. Let’s assume that the matrix contains hex values from 0x00
to 0xFF
.
Look at the first four rows of the matrix, stored in four SSE registers, x0
to x3
:
x0:  00  01  02  03  04  05  06  07  08  09  0A  0B  0C  0D  0E  0F 
x1:  10  11  12  13  14  15  16  17  18  19  1A  1B  1C  1D  1E  1F 
x2:  20  21  22  23  24  25  26  27  28  29  2A  2B  2C  2D  2E  2F 
x3:  30  31  32  33  34  35  36  37  38  39  3A  3B  3C  3D  3E  3F 
We want to collect the leftmost 4×4 submatrix (from 00
to 33
) into one SSE variable, the next one
(from 04
to 37
) into the next one, and so on. The first submatrix consists of first doublewords of
all of the source values, the second one of the second ones, and so on. In short, this procedure is in fact transposition
of a 4×4 doubleword matrix, and it can be done like this:
This is the result of this procedure:
w00:  00  01  02  03  10  11  12  13  20  21  22  23  30  31  32  33 
w01:  04  05  06  07  14  15  16  17  24  25  26  27  34  35  36  37 
w02:  08  09  0A  0B  18  19  19  1A  28  29  2A  2A  38  39  3A  3B 
w03:  0C  0D  0E  0F  1C  1D  1E  1F  2C  2D  2E  2F  3C  3D  3E  3F 
After we run this procedure for all four 16×4 strips, we’ll have the entire source matrix stored in SSE variables in the following way:
w00 
w01 
w02 
w03 

w10 
w11 
w12 
w13 

w20 
w21 
w22 
w23 

w30 
w31 
w32 
w33 
We named the new variables according to their positions in this matrix.
The next step is to transpose each of the w00
… w33
as a 4×4 matrix. The transpose_4x4
will do the job.
Now we are ready to collect the results. We want to put columns of the original 16×16 matrix back into
x0
… x15
variables. Before transposition of w00
… w33
, the first column of the source matrix
consisted of the first columns of w00
… w30
, the second one from the second coluns snd so on.
After the transposition we must use the rows, which are exactly the doubleword components of the variables.
This means that we need to transpose a doubleword matrix again:
Here is the content of w00
… w30
before this operation:
w00:  00  10  20  30  01  11  21  31  02  12  22  32  03  13  23  33 
w10:  40  50  60  70  41  51  61  71  42  52  62  72  43  53  63  73 
w20:  80  90  A0  B0  81  91  A1  B1  82  92  A2  B2  83  93  A3  B3 
w30:  C0  D0  E0  F0  C1  D1  E1  F1  C2  D2  E2  F2  C3  D3  E3  F3 
And this is the result (stored back into x0
… x3
):
x0:  00  10  20  30  40  50  60  70  80  90  A0  B0  C0  D0  E0  F0 
x1:  01  11  21  31  41  51  61  71  81  91  A1  B1  C1  D1  E1  F1 
x2:  02  12  22  32  42  52  62  72  82  92  A2  B2  C2  D2  E2  F2 
x3:  03  13  23  33  43  53  63  73  83  93  A3  B3  C3  D3  E3  F3 
You can see that the variables contain four first columns of the original matrix. If we perform this operation on other groups of four variables, we’ll get the entire matrix.
Here is the code:
Demultiplexing of the E1 stream using the 16×16 transposition
This is what the demultiplexing procedure looks like:
At first this code may look very inefficient. The transpose_16x16
function is called with 16 parameters passed
by reference. In general case this requires all the variables to be stored in memory and their addresses passed
into the function. The same applies to calls to transpose_4x4_dwords
, which returns results via the reference
parameters. If those calls aren’t inlined, this is exactly what will happen, so for this code to be efficient,
the inlining is absolutely essential.
If all the calls are inlined, the functions act as safe macros, and all the references are optimised out. There is no need to store all the variables in memory then. The code may end up relatively efficient, although some performance penalties are inevitable.
First of all, no matter how the compiler reorders the instructions, there is always some place in the code where there are at least 16 live SSE variables. In addition, some registers are needed for temporary values. Since SSE only has 16 registers, there is definite register shortage here, and some values will end up in memory.
Secondly,
the same applies to pointers d0
… d15
, which are all alive inside the inner loop. The processor has 16
generalpurpose registers, so there is a shortage here as well. The impact of this shortage may depend on the
compiler and the processor in use.
Achieved performance
On our test machine this methods takes 176ms, which is good result, but not the best. The best time was achieved using 8×32 blocks and AVX instructions.
I also tried it on my notebook (Intel(R) Core(TM) i74800MQ CPU @2.70 GHz), using Microsoft Visual C++ 2013 in 64bit mode. This is what I got:
>e1new.exe
class Reference: 1991
class Write4: 578
class Write8: 545
class Read4_Write4: 685
class Read4_Write4_Unroll: 649
class Read4_Write4_SSE: 317
class Read4_Write16_SSE: 235
class Read8_Write16_SSE: 236
class Read8_Write16_SSE_Unroll: 236
class Read16_Write16_SSE: 208
class Read16_Write16_SSE_Unroll: 204
class Read4_Write32_AVX: 275
class Read8_Write32_AVX: 260
class Read8_Write32_AVX_Unroll: 256
We can see that on Windows our code is the fastest.
This is why we mustn’t discard it – it can still be useful.
Besides, it allows easy adaptation for AVX2 once it becomes available for testing. On AVX2 we can keep two rows
of the matrix in one AVX register, and the entire matrix will require 8 registers. We need AVX2 for that, because
AVX does not have byte manipulation instruction (PSHUFB
).
Conclusions

The code for this method is quite beautiful (perhaps, the most beautiful of all).

However, the most beautiful isn’t always the fastest. But sometimes it is.

Different methods may be the best for different execution environments. Perhaps, the correct thing to do is to keep several versions and choose the best one at run time – either by means of static configuration, or by automatic discovery.
Coming soon
This method is still not unrolled. I’ll do it in the next article.