Today we are going to look at one of the of the most often used tools in parallel computing: a queue. There are many types of queues; we’ll now concentrate on one of them: a single-producer, single-consumer queues. Being the simplest of them all, this type still allows many important uses:
-
arranging a pipeline: if some job allows decomposition into sequentially performed operations, we can give each operation its own thread. In this case a pipe (a single-producer, single-consumer queue) is placed between each two consecutive stages of the pipeline;
-
handling abnormal spikes in the incoming data rate (buffering);
-
simplifying the program design by replacing callbacks with direct calls or vice versa. It is usually much easier and less error-prone to iterate a complex data structure using callbacks than iterators (no need to store complex iterator states). On the other hand, the consumers often benefit from direct iterating, for the same reason. Queues can help connect the modules; this seems to be the preferred way of operation in Go programming language;
-
while this is not a typical use of the queues, this was, in fact, the main motivation for this research. The idea is to use queues as clock sources. The plan is to build a fast queue with endpoints in C++ and Java. The C++ thread will write tokens to the queue at equal time intervals, while Java thread will read them and perform some operation. Any delay in this operation will increase the amount of queued data, causing queue overflow and, in the extreme cases, data loss. This will allow measuring the effect of garbage collection and other delays in Java.
The safest way to design a multi-threaded program is to assume a relatively slow inter-thread communication (a coarse parallelism). Does it have to be so? Is there perhaps some implementation of a queue that can be used for a fine-grain parallelism?
We have arrived at the topic of this article: what is the speed that can be achieved by a single-producer, single-consumer queue?
I have to warn the reader that this is going to be a very long journey. I recomment a reader who is short of time to scroll down straight to the “Dual-array queue”.
Blocking vs non-blocking
In most cases, when a consumer encounters an empty queue, it has nothing to do and may as well block. Since the wait-notify mechanism is often integrated with the synchronisation mechanism (like in Java), it makes sense to incorporate blocking of the consumer into the queue mechanism itself.
A bigger problem is what to do when a producer encounters a full queue. There are three options then:
- to drop the data
- to grow the queue
- to block.
There are cases where integrity of the data is so important that nothing may be allowed to drop. Often in these cases we have control over the input data rate (e.g. the data is read from the DB or from the disk). Blocking is a preferred option then. In the list of the use cases above, one case (simplifying iterators) definitely requires blocking.
However, in real-time programming the incoming data rate is often outside of our control. For instance, when analysing the network traffic, demodulating the radio waves or processing the streaming video, we must deal with the data at the rate it comes. The queue can help with sudden short bursts of data, but if the data continuously comes faster than we process it, there is no choice but to drop some portions of it. Dropping the data at the input of a queue is the best and the most commonly used option, and the amount of the data dropped is the measure of how well-dimensioned the system is.
In short, we’ll concentrate on single-producer, single-consumer, producer-dropping, consumer-blocking queues. We’ll study the producer-blocking queues later, too.
We’ll write in C++, with porting the solution to Java in mind.
The test
The test will involve two threads, the producer and the consumer, running on two different cores and connected by a queue. We’ll try to maximise the queue throughput, giving it priority over the overall system performance, meaning by it that we won’t hesitate to use spin loops where needed, even though they waste CPU cycles and burn extra joules.
We’ll be using very short messages: one single int32_t
(four bytes). Each message will contain a sequence number, which will be used to detect packet loss.
The consumer will measure the packet loss and display it after some pre-defined number of messages received (10M).
I found it very difficult to measure the performance of the queue on its own, without any client applicaion. Even for the slow implementations, the variation is big, despite the stabilising effort (see later). The measurement is even more problematic when we approach nanosecond intervals. Often the slight variations in the input data rate cause big differences in the resulting throughput. This means that all the results achieved must be viewed only as approximations. The top-performers will probably do better in the real life than the outsiders, but finer comparisons will always be imprecise.
This is what we are going to measure now. The producer will work in two modes:
-
a performance mode, when it sends messages as fast as it can; we’ll measure how long an average write takes and publish it as the
Writing time
(in nanoseconds). This, on its own, isn’t a good performance metric: there is no point of writing fast if we can’t read fast. However, this is some measure of the overhead of the synchronisation mechanism used. -
a clock mode, when it sends messages at certain frequency, determined by the specified time interval. We’ll adjust this interval manually to find the shortest one where the packets aren’t lost. This interval will be published as
Good interval
, and the corresponding packet rate asThroughput
.
In the clock mode we’ll also measure the average queue size, as measured by the consumer at each read operation. Since it slows down the operation, the size measurement will be switched on and off by a conditional compilation variable.
We’ve learned that time reporting is not always fast in Linux, and even when it is, it is not as fast as using the processor’s time stamp counter (TSC).
This is why we’ll be using RDTSC
instruction to set the clock rate of the producer. This limits the code to Intel processors, and this is where we’ll run the test:
a dual Xeon® CPU E5-2620 v3 @ 2.40GHz, using Linux with a 3.17.4 kernel, and GCC 4.8.3.
The queue is defined by the following interface:
The test framework consists of the following components, all running in their own threads:
-
DataSource
: writes data to the queue at time intervals provided by a timer. The elements written to the queue contain incrementing sequence numbers; -
DataSink
: reads elements from the queue as fast as it can and detects data loss using sequence numbers. After each 10M elements it reports the statistics to theReporter
thread (using another queue); -
Reporter
: a thread that measures time and reports lost packets.
The timer implements the following interface:
We’ll be using two implementations: empty_timer
, which doesn’t wait at all, and hires_timer
, which looks like this:
This timer uses the RDTSC
instruction and, therefore, depends on the correct value of the processor’s clock rate. This value is calculated upfront using a calibration
procedure (see the source), and stored in the freq
variable (in GHz).
The timer is configured with the average interval between iterations, in nanoseconds. It can run unevenly (generate a burst of ticks if falling behind).
However, it is worth noticing that it performs at least one RDTSC
instruction at every iteration, so its rate is limited by the latency of this instruction
(8 ns on this processor).
We’ll use the queue size of 100,000. We’ll check later how different queue types respond to a change of the queue size.
The full source code can be found in the repository.
Test precautions
Tests like this often show very inconsistent behaviour. The results vary from one run to another. That’s why it is important to eliminate as many variables as possible:
-
the Turbo boost must be switched off;
-
the system clock source must be set to TSC;
-
our system is a multi-processor one. The effects of the multi-processor environment fall outside of the scope of this article; so we’ll limit the execution to just one physical processor. Besides, the results of the
RDTSC
instruction are not consistent across processor cores. So we’ll bind our data source, data sink and reporter threads to specific cores of the same processor using Linux thread affinity control tools (beware of the hypertheading; it must either be switched off, or taken into account while making thread affinity masks to avoid threads running on the same physical core); -
we’ll limit the execution to the NUMA node associated with the processor we run on; the
numactl
utility helps there; -
whenever possible, we will allocate all our data structures at the 64 byte (the cache line size) boundary. For that, we’ll add the following to the declaration of our
struct elem
:
Here _mm_malloc
and _mm_free
come from the SSE library. I would have preferred to use std::aligned_alloc
, but it has only appeared in C++17
, which I don’t have
available yet.
Strangely, it works on GCC but not on MSVC;
- obviously, make sure as few processes as possible are running on the host.
With all these precautions, I haven’t managed to achieve true consistence of the tests. What’s interesting, the results are usually the same within one program run, but differ from one run to another. I would really appreciate if someone could explain this behaviour and suggest a way to circumvent it.
I’ll publish the test results as ranges. I won’t bother with averages or standard deviations, because I don’t know what the reason for the variance is and, therefore, how representative the sample is.
Now everything is ready; let’s hit the road.
The standard queue
This is the most obvious of all queue implementations. It is based directly on the synchronisation and container primitives from the standard library:
Here are the run results:
Queue name | Writing time, ns | Good interval, ns | Throughput, Mil/sec | Avg size |
---|---|---|---|---|
Standard | 280-500 | 380-550 | 2.6 | 230 |
(We showed the highest value for the throughput).
This packet rate is somewhere between the rate typical for a coarse parallelism (microseconds and milliseconds), and a fine-grain parallelism (tens to hundreds of nanoseconds). This is not bad, but we must be able to do better than this.
The circular queue
Our next queue is also using standard synchronisation utilities but makes its own implementation of a circular queue inside an array instead of the standard class
deque
:
It is unlikely that the std::deque
was the bottleneck in the previous implementation, so we shouldn’t expect big performance gain, and, in fact, it got slower:
Queue name | Writing time, ns | Good interval, ns | Throughput, Mil/sec | Avg size |
---|---|---|---|---|
Circular | 364-395 | 430 | 2.3 | 290 |
We won’t, however, revert back to the standard library, because we are going to manipulate the code of the circular queue. In particular, we’ll integrate it with the synchronisation code. That’s why we haven’t moved the circular queue logic into a separate piece of code, as the usual good coding practice suggests.
No wait queue
In the above examples, the two sides of a queue used a notification mechanism (condition variables) to notify the reader that there is data. This is completely appropriate when we are short of processor capacity and can’t afford wasting CPU cycles while waiting for data. However, we’ve decided to put the queue throughput above everything else. So let’s try some spin loops.
Let’s start with the simplest of the spin loop solutions, where the Circular
code is used as is (including the mutex), but no conditional variable is used:
Here when there is nothing to do the program calls yield
, which is defined as following:
This is a CPU instruction that hints to the CPU that it is in a spin loop and must slow down a little. It is supposed to increase the performance of memory access. At least, this is what Intel manual says; I haven’t actually seen this happening. In addition, there are reports that this instruction has become very slow on the Skylake family on processors. Probably, we’ll have to do without it on those.
Queue name | Writing time, ns | Good interval, ns | Throughput, Mil/sec | Avg size |
---|---|---|---|---|
No wait | 310 | 340 | 2.9 | 1030 |
The result is mixed: the throughput got a bit higher, but so did the average queue size.
Spin-lock queue
The next step is to replace the mutex with some mechanism that doesn’t involve the OS. Let’s try a spin-lock. Here is a simple implementation that’s made in the usual
C++
tradition: the constructor acquires the lock while the destructor releases it:
This lock can be used instead of the lock_guard
in the previous solution:
Note that this solution is not guaranteed to work at all. The spin-lock does not even try to implement any kind of fairness, so if a thread releases the lock and re-acquires it immediately (as our reading thread does when waiting for data), another thread may have no chance to slip through.
Testing confirms that this is indeed very unreliable solution. The writing time varies between 1000 and 200,000 ns, even though the average queue size doesn’t go too high (700-2000). We’ll discard this solution and try more reliable options.
Atomic queue
All the implementations so far followed one pattern: there is some data structure (a queue in our case), which is not thread-safe on its own, because it consists of multiple values that must be kept consistent with each other. We then use a mutual exclusion mechanism to protect the structure from concurrent access.
In a case of some very complex data structure this is the best we can do; in the case of a queue, however, we can do better. A reader and a writer may use the queue simultaneously as long as they access different parts of it. This is exactly what happens when the queue is neither full nor empty. Obviously, to establish this fact, the threads must read some shared variables, but they don’t need any mutexes for that: atomic variables are sufficient.
The reader and the writer threads share two pointer (technically, index) variables: read_ptr
and write_ptr
, and they can use these two variables for synchronisation:
-
if
read_prr == write_ptr
, the queue is empty, and the reader must wait; -
if
(write_ptr + 1) % size == read_ptr
, the queue is full, and the writer must drop its message; -
otherwise they are accessing different elements of the underlying array and can safely proceed to perform their operations.
Using these two pointer variables as synchronisation tools imposes some requirements on them:
-
they must be volatile, which means that they must be written to memory as soon as they are modified and read when they are needed. In short, they mustn’t be placed into registers for any substantial period of time.
-
they must be atomic, which means that no thread may ever see any partial modification of them. For instance, 64-bit variables were not atomic on 32-bit processor architectures. Arrays of 1000 bytes are not atomic anywhere.
-
there must be some memory order enforced; in particular, we must make sure that the reading thread reads the updated
write_ptr
only after the new item has already been placed onto the queue, and the writing thread reads the updatedread_ptr
only after the item has been copied from it.
In C++ all of this can be achieved using atomic variables:
This promises to be a big step forward compared to the previous solutions. The reader and the writer are still tightly inter-dependent (they access each other’s variables on every call), but any kind of waiting only happens at the reader side and only when the queue is empty, which is the natural time we should wait. The fairness problem is therefore solved.
Queue name | Writing time, ns | Good interval, ns | Throughput, Mil/sec | Avg size |
---|---|---|---|---|
Atomic | 23-25 | 42-46 | 21-23 | 26 |
The improvement is indeed impressive. It’s a leap from a coarse to a fine parallelism. The results are so much better than anything we’ve seen so far that we could stop right here. Let’s, however, look closer at the result – what if it can be improved?
It’s interesting to look at the code generated for the queue operations. Here is write
:
Note that here %rdi
is the first parameter (this
), and %rsi
is the second one (the pointer to elem
). And here are the fields:
32(%rdi)
iswrite_ptr
16(%rdi)
issize
24(%rdi)
isread_ptr
8(%rdi)
isqueue
.
And here is read
:
The strange instruction rep nop
mustn’t confuse the reader: its instruction code (F3 90
) has been hijacked for pause
.
One may wonder about the rep ret
in the first sample. This is in fact more cryptic and is a workaround (or a hack, if you prefer this word) for some branch-prediction
misbehaviour in AMD processors. It is described in this blog post: repz ret.
The code looks very good if not absolutely perfect. One could wonder if conditional moves are justified there (a branch misprediction will only happen once in 100000 iterations),
but, even if a plain branch is better, the compiler has no way to establish it. The most striking visible feature of this code, however, is the absence of any
synchronisation instructions. No locks, atomic swaps, or fences. The volatile semantics is, however, provided (the read_ptr
is read from memory on each iteration of the
loop in read
), and the memory order is also observed: the pointer variables are stored as the last operations in the loops. The strong memory ordering of Intel takes
care of the rest.
If we replace acomic<size_t>
with a simple volatile size_t
, we get exactly the same code. No difference.
This is only true on Intel. Other processors (such as ARM) may have much weaker natural memory ordering, and some memory barrier instructions will be required after reading or before writing the pointer variables.
Out of interest, let’s try compiling the code without atomic
or volatile
:
The code looks a bit different but does the same thing. It fulfils volatile, atomic and memory order requirements just as well as the atomic
and volatile
versions
did. This is probably just a coincidence: for instance, the C++ compiler didn’t have to generate memory read in the loop in read
(.L32
). So this isn’t a safe
option.
These code observations kill one possible improvement of the code that I previously had in mind. If reading atomic variables was expensive, one could consider caching
their safe approximations in normal variables. For instance, the write
routine could store the value of read_ptr
in a normal variable and perform no atomic reads
until this value is reached, and the read
routine could do the same. All of this is not necessary when atomic operations are for free (the trick, however, might still be valid for
non-Intel processors).
Overcoming false sharing: aligning the variables
Previously we faced the effects of false sharing. It is worthwhile checking any multi-threading solution for possible exposure to this problem.
A memory location may be cached in local caches of more than one processor. When one processor
(or one core) modifies it, a signal is sent to other processors to discard their cached values and rather fetch the new value from the new owner when needed.
This is completely reasonable. The problem is that the unit of such discarding and retrieval is a cache line, which on modern Intels is 64 bytes long.
One thread modifying some value causes another thread to re-fetch some other value, whose only fault is to share a cache line with the modified one. This is exactly what may happen
to our read_ptr
and write_ptr
values. When the reader updates its pointer, it invalidates the write pointer, and the writer has to fetch it again, and vice versa.
It seems attractive to place read_ptr
and wirte_ptr
into different cache lines. The simplest way to achieve this is to align the entire structure and the second one
of the pointers by 64:
The results really improved:
Queue name | Writing time, ns | Good interval, ns | Throughput, Mil/sec | Avg size |
---|---|---|---|---|
Aligned Atomic | 12-23 | 30-41 | 24-33 | 7 |
Aligning the variables even more
We made sure read_ptr
and write_ptr
reside in different cache lines, but what about other variables? Both queue
and size
are also fields stored in memory.
As we saw in the assembly listing above, both are read at each invocation of read()
and write()
. Both are, therefore, vulnerable to false sharing with the
pointers (or, after the fix, only with the read_ptr
). Let’s align the read_ptr
as well:
Here are the results (strangely, no big improvement):
Queue name | Writing time, ns | Good interval, ns | Throughput, Mil/sec | Avg size |
---|---|---|---|---|
Aligned More Atomic | 15-30 | 35-42 | 24-28 | 6 |
Cached atomic circular queue
Let’s now revisit one of the statements made above:
These code observations kill one possible improvement of the code that I previously had in mind. If reading atomic variables was expensive, one could consider caching their safe approximation in normal variables.
As we saw, reading atomic variables isn’t more expensive than reading normal variables: the code is identical. However, reading variables that are shared with another thread is more expensive than reading local variables. The idea comes back to life:
This version reduces reading of shared variables to the cases when it is absolutely necessary to exchange information between threads. In our case this is when the queue is full of empty. The improvement is visible straight away:
Queue name | Writing time, ns | Good interval, ns | Throughput, Mil/sec | Avg size |
---|---|---|---|---|
Cached Atomic | 6 | 27-32 | 31-37 | 2.2 |
Dual-array queue
The cached circular queue, being almost ideal, still has two disadvantages:
-
as any circular queue, it is not very cache-friendly. This is not so important for a queue of size 100,000 elements, as it fits into the L3 cache, and this is the only cache shared between two different cores. Even then, the writer often has to write to the memory not cached in its L1 or L2 cache. While this is inevitable for the reader (after all, it always reads the memory that has just been modified), this can be avoided for the writer. Much longer queues may present even bigger caching problems;
-
even if the entire queue fits into a cache of some level, the cache isn’t there just for the queue: the rest of the program also wants access to this resource, while the circular nature of the queue frequently pushes useful data out of the cache;
-
if the read and write pointers aren’t too far from each other, we can experience false sharing between the data being written and the data being read.
Let’s try a completely different approach for the queue design, based on the idea that the writer writes to some part of the queue while the reader reads some other part. We’ll make these parts of the queue two separate arrays, one being written and another one being read, the arrays swapped when the reading one is exhausted.
We’ll follow a client-server approach: the only truly shared variable will be a swap request flag (swap_requested
), set by the reader when it runs out of data.
Upon receiving such a request, the writer swaps the arrays and resets the flag. Note that the writer can only do this on its next write operation, so the current content of the queue can get stuck
there forever if the influx of source data stops. This is a shortcoming of this implementation.
Here we preferred to be safe and aligned every single variable by the cache line boundary. Probably, it is possible to ease these requirements. It is also possible that the memory order requirements may be relaxed. I will not concentrate on these issues here.
Other strategies can be used for reader and writer co-operation. For example, if the queue is full at the call to write()
, and the swap_requested
flag is set,
we can, instead of dropping the element, write it to the new array. I don’t think this is a big improvement: we could just make the array size bigger.
Another variation is using a do while
loop in the reader to perform at least one yield()
between writing the swap request flag and first reading it. I tried this, it
makes things a bit slower.
Again, the atomic
operations do not show in the Intel code. The volatile
keyword would be sufficient on this architecture.
How does this solution compare with the best so far (CachedAtomicCircularQueue
)? This is not clear. Here are different comparison points:
CachedAtomicCircularQueue | DualArrayQueue |
---|---|
Cache-unfriendly | More cache-friendly, unless the queue grows too long |
Two atomic variables | One atomic variable |
The reader writes one each time | The reader writes it to signal empty queue |
The writer writes one each time | The writer writes it to respond |
The reader polls one when the queue is full | The reader polls it for response |
The writer reads one when reaching the previous reading point | The writer reads it each time, although it is well-cached most of the time |
Threads signal to each other by a single memory write | The reader starts a whole transaction,
with an atomic variable being read and written twice,
and two more variables written by the writer
and read by the reader (although read_limit only at
the next iteration)
|
We have no option but testing, and here are the results:
Queue name | Writing time, ns | Good interval, ns | Throughput, Mil/sec | Avg size |
---|---|---|---|---|
Dual | 3 | 23 | 43.5 | 1 |
Overall there is a small difference in favour of the dual-array solution. The average queue size looks especially impressive.
Dual-array queue improved
One small improvement of the dual-array queue is reducing the number of variables. We can hijack the read_limit
as a new swap_flag
. This variable being zero will
serve as the signal to the writer to provide a new chunk of data.
We know that atomic variables are not expensive on Intel, and that shared variables are fast when accessed by a single thread. This makes it possible for
read_limit
, apart from being used as a request flag, to serve its direct purpose: indicate the size of the read buffer. On other platforms one could use
a thread-local copy of this variable, but we won’t bother with it here.
Even if there are performance advantages, they aren’t visible:
Queue name | Writing time, ns | Good interval, ns | Throughput, Mil/sec | Avg size |
---|---|---|---|---|
Dual2 | 3 | 26 | 38 | 1 |
We’ll still keep this version as a lower-impact one.
Thread-local array swapping
In the code of DualArrayQueue
the writer swaps the read and write buffers, and the reader uses the results of this swap. The reader, however, could have its own copy
of the read buffer and perform swapping locally:
The results are a little bit berrer:
Queue name | Writing time, ns | Good interval, ns | Throughput, Mil/sec | Avg size |
---|---|---|---|---|
Dual3 | 2 | 24 | 41.6 | 1 |
XOR-based array swapping
We didn’t use a proper swap (reading two variables and writing them other way around), because it would require two memory updates at every operation.
We used a conditional move instead. It requires two memory reads and a conditional move instruction, or a read, a branch and another read. The GNU C++ compiler
chose the latter. It still worked fast, because the jump is very well predicted – it is taken every other time, and processors know how to predict that. We can,
however, perform the swap without branches, using XOR
:
The result is a bit worse than before:
Queue name | Writing time, ns | Good interval, ns | Throughput, Mil/sec | Avg size |
---|---|---|---|---|
Dual4 | 3 | 25 | 40 | 1 |
The code, however, looks very compact and elegant:
Dual-array queue: asynchronous approach
In all the dual-array versions so far the reader sent a message to the writer when it encountered an empty queue when entering read()
. We can improve this by sending
this message at exit of the previous read()
. This will allow message exchange to run in parallel with some useful activity in the reader.
It probably won’t affect our artificial test, which has no such activity, but may help in real applications.
As expected, the result hasn’t become better, but it hasn’t become worse, either:
Queue name | Writing time, ns | Good interval, ns | Throughput, Mil/sec | Avg size |
---|---|---|---|---|
Dual Async | 3 | 25 | 40 | 1 |
The summary
This is where our story of improving the queue in C++ finishes. We’ve gone quite far:
Queue name | Writing time, ns | Good interval, ns | Throughput, Mil/sec | Avg size |
---|---|---|---|---|
Standard | 280-500 | 380-550 | 2.6 | 230 |
Circular | 364-395 | 430 | 2.3 | 290 |
No wait | 310 | 340 | 2.9 | 1030 |
Atomic | 23-25 | 42-46 | 21-23 | 26 |
Aligned Atomic | 12-23 | 30-41 | 24-33 | 7 |
Aligned More Atomic | 15-30 | 35-42 | 24-28 | 6 |
Cached Atomic | 6 | 27-32 | 31-37 | 2.2 |
Dual | 3 | 23 | 43.5 | 1 |
Dual2 | 3 | 26 | 38 | 1 |
Dual3 | 2 | 24 | 41.6 | 1 |
Dual4 | 3 | 25 | 40 | 1 |
Dual Async | 3 | 25 | 40 | 1 |
Various versions of the dual-array queue are the definite winners. Even though the advantage over atomic versions isn’t big, the queue size shows that dual-array versions are more responsive and cause less latency.
We’ve been working with the queues of 100,000 elements long, each element being four bytes. Let’s see what happens when we change these parameters.
Changing the element size
Let’s try changing the element size to 64, with the same queue size (100K):
Queue name | Writing time, ns | Good interval, ns | Throughput, Mil/sec | Avg size |
---|---|---|---|---|
Standard | 360 | 470 | 2.1 | 830 |
Circular | 515 | 530 | 1.9 | 290 |
No wait | 430 | 430 | 2.3 | 89 |
Atomic | 80 | 63 | 15.9 | 544 |
Aligned Atomic | 59 | 59 | 16.9 | 5 |
Aligned More Atomic | 61 | 61 | 16.4 | 2.6 |
Cached Atomic | 60 | 59 | 16.9 | 2.2 |
Dual | 26-43 | 52 | 19.2 | 1 |
Dual2 | 56 | 56 | 17.9 | 1 |
Dual3 | 32-47 | 47 | 21.2 | 1 |
Dual4 | 60 | 45 | 22.2 | 1 |
Dual Async | 44 | 44 | 22.7 | 1 |
Except for the versions using the OS-based synchronisation, everything went quite a bit slower. However, high throughputs are still available. We see that the solutions that were the best last time are still the best.
Changing the queue size
I won’t show all the results here. Here are the observations:
-
the queue size of 100 or less simply doesn’t work with any data rate of interest. Some of the atomic and dual-array solutions can still perform at 1000 ns/packet or more. Everything else just fails.
-
the standard, circular and no-wait solutions fail at the queue size of 1000. The atomic and dual-array solutions perform well at 100 ns and more. They lose packets at higher speed.
-
the queue size of 10,000: occasional packet loss is observed at high speed on atomic solutions, but not on dual-array ones.
-
going other direction does not show any difference. Even exceeding the L3 cache size (the queue size of 100M at the element size of 4 bytes) does not affect the achieved throughput on any of the solutions. However, one must remember that we didn’t do anything else on the machine, so there were no useful data to cache.
In any case, the dual-array versions performed the best, followed by the atomic ones, with everything else as the outsiders.
Multi-processor case
Studying the effects of true multi-processor environment was not our objective; however, just out of curiosity let’s run some of our solutions using affinity masks that put our data source and data sink onto different processors:
Queue name | Writing time, ns | Good interval, ns |
---|---|---|
Atomic | 5 | 160 |
Aligned Atomic | 5 | 160 |
Aligned More Atomic | 4 | 140 |
Cached Atomic | 5 | 140 |
Dual | 3 | 140 |
Dual2 | 3 | 140 |
Dual3 | 3 | 140 |
Dual4 | 3 | 135 |
Dual Async | 4 | 135 |
All our solutions look poor; however, the dual-array ones are still the best. The typical times exceeding 100 ns indicate that some uncached memory access takes place.
Let’s increase the queue size to one million (it is not cached anyway), and see what happens:
Queue name | Writing time, ns | Good interval, ns |
---|---|---|
Atomic | 5 | 140 |
Aligned More Atomic | 5 | 80 |
Cached Atomic | 3 | 24 |
Dual Async | 4 | 28 |
The results are surprisingly good. It looks like the memory pre-fetch resolves the uncached memory access problem. What is important is the optimisation of the control
values access. The false-sharing reduction (Aligned More Atomic
) helps a lot, and true-sharing reduction (Cached Atomic
) helps even more.
The dual-array versions don’t look too bad, either.
Still, the high queue size where this result has been achieved shows that the multi-processor setup is not good where low latency is required.
Going Java: dual-array version
Now it’s time to test our solutions on Java. We will only port one version – the last one (asynchronous dual-array). The overall test setup will be the same; however, there are some important differences to keep in mind:
-
Java does not have a built-in thread affinity control. Although it is possible to write a native-code library for that, we won’t bother with it. Let’s just run the entire program on one physical processor using
taskset
with appropriate parameters (taskset 0x555
works on my system); -
NUMA considerations stay the same; we’ll use the same NUMA control commands;
-
Java does not have direct access to
rdtsc
instruction. This means we can’t implement a precisehires_timer
. Instead, we’ll be running some arithmetic calculations in a loop to get some controllable delay. Thus the delay value won’t be in nanoseconds; however, reported values will still be. -
Java does not have any control over field alignment inside an object; we’ll have to introduce our own padding where necessary. This is, obviously, not portable – Java does not make any promises regarding the object layout.
Let’s now look at the first version, the DualArrayAsyncIntQueue
:
The code for DataSource
, DataSink
and Reporter
classes can be seen in the repository.
Here are the results:
Queue name | Writing time, ns | Good interval, ns | Throughput, Mil/sec | Avg size |
---|---|---|---|---|
DualArrayAsyncIntQueue | 6 | 16 | 62.5 | 1 |
Surprisingly, the throughput is quite a bit higher than in C++.
Going Java: dual-buffer version
This version isn’t really there to be used in Java. We’re going to use it to connect Java and C++. The Java array isn’t the best object to be accessed via
JNI; the native buffer is much better. That’s why we’ll put our data elements into a native buffer, and a field that’s going to be shared with C++ (read_limit
)
into another native buffer. Let’s, however, first implement both ends in Java:
Here we don’t bother aligning any data elements, because this queue won’t be used in a Java to Java setup, and no fields will be shared with other threads when talking to the C++ code.
One point in this code isn’t completely up to standard: where in C++ we used atomic variables, here we just write data into the native buffers. This causes an appropriate memory ordering on Intel; on other processors it might not work. There are two possible workarounds:
-
use
sun.misc.Unsafe
, which has volatile write and fence operations. However, using this class in a client code is these days considered not comme il faut; -
just write something into any dummy
volatile
variable; this will probably work but looks like a hack.
We’ll ignore this problem for now.
Here are the results:
Queue name | Writing time, ns | Good interval, ns | Throughput, Mil/sec | Avg size |
---|---|---|---|---|
ByteBufferAsyncIntQueue | 9 | 25 | 40 | 1 |
As expected, the speed dropped a bit, still staying better than our best C++ version.
Connecting Java to C++: a native data source
We finally arrived at our last topic: a queue that connects Java and C++. Our last Java version allows easy integration with native code. First, we’ll need a Java class with native functions:
Then, a native implementation of the data source:
(a tricky part here is the conversion of an arbitrary memory pointer to a pointer to atomic<uint32_t>
. Nothing in the C++
standard promises this will work;
however, this looks very natural and works on Windows and Linux).
Finally, we need the queue itself (only the writing part of it):
Here are the results achieved when sending data from C++ to Java (still, on one physical processor):
Queue name | Writing time, ns | Good interval, ns | Throughput, Mil/sec | Avg size |
---|---|---|---|---|
Native ByteBufferAsyncIntQueue | 1 | 20 | 50 | 1 |
This is really a very good result. It’s better than anything we’ve achieved when both sides of the queue were in C++. This mechanism can be used as a clock generator with very fine resolution.
Conclusions
-
We’ve tried multiple strategies to arrange a single-producer, single-consumer queue. Even the simplest implementation, using the standard tools, could sustain the message rate of 2.1 million per second, which is already more than good enough for any coarse parallelism requirements;
-
We’ve improved the throughput by the factor of ten (up to 22 million), which took us to the area of the fine parallelism;
-
This happened at a cost of burning CPU cycles in spin loops;
-
The OS mechanisms for thread synchronisation and inter-thread communication are relatively slow; high-performance applications should rather use some other, lower-level, tools, such as atomic variables and test-and-set instructions;
-
The data structures designed for multi-threaded access perform much better than generic structures wrapped into mutexes;
-
The dual-array design of a queue works very well and demonstrates better performance than the one based on circular arrays;
-
Surprisingly, Java showed better performance than C++
-
We’ve implemented very fast (50M messages/sec) queue between C++ and Java. This creates a good framework to study the real-time behaviour of Java.
Comments are welcome below or on reddit