Today we are going to do some multi-threaded programming in Java. We’ll look at some shared data structures and try several locking and non-locking approaches. Not all the code, including the final solution, will be completely correct from the Java spec point of view; some will be completely incorrect. We’ll check what consequences it will have.
The source code can be found in this repository.
The problem definition: counters and gauges
The problem originates from the real life. Our project included a statistical sub-system that measured and accumulated some values for diagnostic purposes. There were two types of statistics: counters and gauges.
A counter is an object that accumulates the sum or reported values. Here are the examples:
- a number of requests executed
- a number of bytes processed
- a number of packets received
- a number of times the program executed a specific function or a path in the control flow graph.
We’ll call the code that reports counted values “a client”. Here is the client-side interface – the Counter
class:
The statistic engine (“the server”) collects and resets the counter regularly (we’ll call the time between collections “the collection interval”; it was one minute in our project). For that it uses the server-side part of the interface:
A gauge is an object that stores a result of an instant measurement of some variable and performs basic aggregations on these values – minimum, maximum, average. The examples are:
- the amount of allocated memory
- the disk utilisation
- the CPU utilisation as reported by OS
- the size of some queue
The need to calculate the average requires recording the number of invocations and the sum of values, so each gauge always contains two counters inside. This technically allows us to implement only gauges and use them as counters. However, we’ll see soon that counters may be implemented more efficiently than gauges.
The gauges can be used as “weighted counters” to record events that are accompanied by a number. The examples are:
- the size of an incoming message
- the time it took to serve a request
Here is the client-side Gauge
definition:
Some gauges do not require frequent reporting of the value. For instance, it’s sufficient to record CPU utilisation several times a minute. Some, however, benefit from frequent update. Such are the gauges of the second type, but not only them. If we can afford to record the queue size every time we put something into the queue or get something, the measurement will be much more accurate than random sampling, and we’ll be much less likely to miss times when the queue is full or empty.
The server-size interface is more complex than for Counter
, because all the values must be retrieved at once:
These statistics are a bit like trace log files: their purpose is to help developers find what went wrong when something does. They are not shown to the users. The program logic does not depend on them. They are not mission-critical. This doesn’t mean they may be totally wrong, but if one of the counters is out by one or two, it is not the end of the world. However, they must have one very important feature: be fast.
How fast is fast?
Here we must clarify, what is fast? This is the reasoning.
Let’s assume we are making a handler of some incoming requests. It doesn’t perform any sophisticated processing, only initial parsing and some pre-checks. It figures out who must serve the request and dispatches it further down the processing chain. A typical performance of such a handler on modern CPUs is somewhere between one million and ten million operations per second per core, which translates to 100 to 1000 nanoseconds per request (which, on a typical Xeon, means between 250 and 2500 cycles). Several statistic variables may be updated during each of these operations. Moreover, we should never hesitate to add another one if that’s needed. This gives us a rough estimate of the time we can afford to spend on a counter update: 5 to 10 ns is ideal, and 20 ns is probably still reasonable. Anything above 50 ns is too much, and 100 ns is totally unacceptable. We don’t want to convert our event handler into a stat logger. So the boundary between good and bad is somewhere between 20 and 50 ns.
Another obvious requirement for the stats objects is that they must support a multi-threaded operation. Multiple client threads may exist, plus one server thread.
The tests
We’ll start with counters because they are a bit simpler than gauges; however, we’ll keep gauges in mind and, at the end of the article, run the gauge test as well.
We’ll start several threads that will do their own work (perform some calculations) and update the counters at given intervals. These intervals will vary between zero and several hundred nanoseconds, controlled by a parameter called the delay factor. We expect the counters to cause the biggest performance penalty at the smallest delays, because this is where the conflicts are most probable (several threads updating the counter at the same time). That’s why one of the tests will use the delay of zero, even though this isn’t very realistic.
Each thread will count how many counter update operations it performed – this will be our performance metric.
Here is the thread body we’ll be using:
The test spawns a given number of threads, all initialised with the same counter, then runs them for 10 sec, querying the counter every one second
(it is one minute in the real-life code, but we want our tests to run quickly). After the threads
stop, it collects the performance data as well as the number of times counter.add()
was called. The average performance is then reported, in nanosecond per
counter operation:
We’ll be running the test using Java 1.8 on a dual Xeon with 12 physical cores, with hyper-threading switched off.
The turbo-boost must also be switched off.
We’ll run it with nthreads
equal to 1, 2, 6 and 12.
Each test will be repeated several times, two slowest results discarded as possibly performed on the VM that hasn’t yet warmed up.
The average value of the rest will be reported as the result of the test.
The empty counter
The first counter we test is an empty counter. It counts nothing:
This test shows base times, which are the times taken by the imaginary event handler without updating any counters. Here are the times for some delay factors and thread counts:
Counter name | Delay factor | Time, ns, for thread count | |||
---|---|---|---|---|---|
1 | 2 | 6 | 12 | ||
EmptyCounter | 0 | 1 | 1 | 1 | 1 |
EmptyCounter | 10 | 47 | 47 | 47 | 47 |
EmptyCounter | 50 | 374 | 373 | 374 | 374 |
EmptyCounter | 100 | 887 | 887 | 883 | 884 |
The measured times are very stable, and the performance does not depend on the number of threads, as long as they fit into physical cores. As the delay factor changes, the base time varies in a wide range, resembling the realistic values common for real-world event processing.
The times are rounded to nanoseconds, as this accuracy is good enough for our problem.
The trivial counter
The next one is the simplest possible implementation, so simple that everybody knows it is incorrect:
Why is it incorrect? It isn’t thread-safe, and even our simplest use case is multi-threaded (there is at least one client thread plus one server thread). There are three issues that prevent it from running well in such case:
1) The count
isn’t volatile
. This means that both the compiler and the processor are allowed to keep the thread-local view of this variable out of sync with
memory for a substantial time, or, if we’re especially unlucky, permanently. When the variable is modified, the compiler may keep the new value in a register, and the only reason it writes it to memory at all is
that it must appear there at the next synchronisation point. Even when the variable is written to memory, the processor is allowed to keep it in a store buffer until
forced to flush it. Meanwhile, all the other cores will see the previous value of the variable. If multiple cores
modify the variable, the new values may be written to memory in any order. Likewise, when the variable is read, the compiler is allowed to
use the cached copy instead of fetching the externally modified value.
2) The count
isn’t protected from concurrent modification by two clients.
3) The count
isn’t protected from concurrent modification by a client and a server.
The most common way of resolving issues 2 and 3 makes issue 1 irrelevant: a variable that’s read or modified in a synchronized
block is kept in sync with memory automatically;
it does not have to be volatile
.
In theory, lack of synchronisation may cause any result. A VM may optimise the program in an arbitrary way, which can literally cause any value to be produced. However, given a typical code generation, we are most likely to observe two effects:
1) Loss of counts due to simultaneous update from two clients, or one client and one server, in a scenario similar to this:
Client thread 1 | Client thread 2 |
---|---|
tmp = count; | tmp = count; |
++ tmp; | ++ tmp; |
count = tmp; | |
count = tmp; |
This effect causes the counted value to be less than what it should be. More than one increment of count
(and in more than one thread) may happen while
the client thread 1 is holding on to its value, so the loss can be quite big.
2) Missing the reset of the counter:
Client thread | Server thread |
---|---|
tmp = count; | result = count; |
++ tmp; | count = 0; |
count = tmp; |
This may cause the entire interval of counter updates between two retrievals to be counted twice, making the counted value much bigger than it should be.
Do we really observe these effects, and, as they are working against each other, which of them wins? Let’s run the test. I know, many people will say: “What’s the point of running an incorrect test? You can’t use any of its results anyway”. We’ll try it just out of curiosity. Among other things, we are interested in its speed. Does the lack of synchronisation make this test fast?
Running the tests, we see that the sum of retrieved values is bigger than the total number of times Counter.add()
was invoked. For one thread:
Counter sum: 24954272194
Correct sum: 3827471278
If we look at the retrieve history, we see the increasing values:
Counter retrieved: 357673732
Counter retrieved: 706667851
Counter retrieved: 1047647376
Counter retrieved: 1387938007
Counter retrieved: 1714498868
Very infrequently the value does drop:
Counter retrieved: 2349630076
Counter retrieved: 2512206688
Counter retrieved: 220113172
This means that the second effect really occurs, and is very prominent.
For six threads:
Counter sum: 893999846
Correct sum: 961599387
The second effect is still winning over the first one.
For twelve threads:
Counter sum: 903239369
Correct sum: 977822510
Finally, the first effect wins over the second one.
How fast does this test run? For convenience, we’ll publish the times subtracting the values for the EmptyCounter
:
Counter name | Delay factor | Base time | Time, ns, for thread count | |||
---|---|---|---|---|---|---|
1 | 2 | 6 | 12 | |||
TrivialCounter | 0 | 1 | 2 | 15 | 62 | 105 |
TrivialCounter | 10 | 47 | 0 | 46 | 316 | 693 |
TrivialCounter | 50 | 374 | 1 | 57 | 254 | 781 |
TrivialCounter | 100 | 887 | -3 | 15 | 96 | 569 |
We expected the trivial counter to be very fast, and it was – but only when the thread count was one. For bigger thread counts the slowdown is quite big. It shows that frequent update of the same memory location from different processor cores is inefficient. It prevent local caching. When one core updates the variable, the new value is stored in its cache and invalidated in all the other caches. When other cores need the value, they must request it from the cache of the first core. This may still be faster than retrieving the value from the main memory, but definitely slower than getting it from the local cache.
One special case is when the delay factor is zero. Then the introduced delay is not as big. I think it is because in this case the next update of the counter happens while the previous one is still in the write buffer; but I may be wrong here.
Now we start worrying: if the most trivial counter can be so slow, what can we expect from a properly synchronised one?
The volatile counter
One of the problems with the TrivialCounter
was that its main value wasn’t volatile
. Let’s try another one, called VolatileCounter
:
Note that the volatile
modifier does not provide atomic modification of the value. It only makes sure that any read of this variable is
translated into a proper memory read instruction, and any write is flushed from the processor’s write buffer before executing the next Java statement.
As a result, the time between tmp = count;
and count = tmp;
becomes shorter and more predictable, but that does not prevent either of the effects
mentioned above, only reducing them a bit.
Here are the results for one thread, delay factor zero:
Counter sum: 1357072743
Correct sum: 1231492832
For six threads:
Counter sum: 143726338
Correct sum: 192685609
The first effect took over.
Even when the delay factor is big and the number of threads is small, the effects are visible. Here are two outputs with delay factor of 100 and one thread:
Counter sum: 17381285
Correct sum: 17381322
with the first effect and
Counter sum: 20901042
Correct sum: 17459768
with the second one.
Here are the times:
Counter name | Delay factor | Base time | Time, ns, for thread count | |||
---|---|---|---|---|---|---|
1 | 2 | 6 | 12 | |||
VolatileCounter | 0 | 1 | 12 | 138 | 542 | 1556 |
VolatileCounter | 10 | 47 | 1 | 128 | 651 | 1360 |
VolatileCounter | 50 | 374 | 1 | 50 | 608 | 1643 |
VolatileCounter | 100 | 887 | -2 | 25 | 149 | 1121 |
As we can see, the volatile variables really make things slower. We’ll see why now.
Off-topic: the code with and without volatile
It isn’t really relevant for our tests, but at this point I feel curious: what exactly is the difference in the code generated for TrivialCounter
and VolatileCounter
?
Here is the code of the counter update from the non-volatile case:
And this is the volatile case:
In both cases the counter.add()
was inlined. The non-volatile code checked for null
explicitly, while the volatile one made use of the hardware trap – I don’t know why.
The actual incrementing of a value is also done differently: inc
in one case and load-add
-store in another one. These differences, however, are not important.
What’s important is the addl
instruction with the lock
prefix in the volatile case. The lock
prefix here does not protect the volatile
variable. In fact, it does not
protect anything – the memory location the instruction operates on is thread-local (%rsp
), and even that one isn’t really updated (zero added). What this instruction actually
does is flush the processor write buffers.
The description of the lock
prefix is a bit confusing. The corresponding part of the Intel manual begins very scary:
Causes the processor’s
LOCK#
signal to be asserted during execution of the accompanying instruction (turns the instruction into an atomic instruction). In a multiprocessor environment, theLOCK#
signal ensures that the processor has exclusive use of any shared memory while the signal is asserted
If this was indeed the case, any multi-core programming would have been a disaster. Any locked instruction would deny access to memory to all the other cores, and, since there are twelve cores (and there are systems with many more), the chances of one of them issuing such an instruction are very high. The entire execution would stall. Fortunately, later in the article comes a relief:
Beginning with the P6 family processors, when the
LOCK
prefix is prefixed to an instruction and the memory area being accessed is cached internally in the processor, theLOCK#
signal is generally not asserted. Instead, only the processor’s cache is locked. Here, the processor’s cache coherency mechanism ensures that the operation is carried out atomically with regards to memory
This is much better. The area being accessed (the top of the stack) is very likely to be cached locally, so nothing really stalls. The cache coherency mechanism takes care of propagating any change in the cached data (on the cache line basis) to other caches. It does not, however, affect the processor write buffers; that’s why all of them must be flushed to the processor cache when any locked instruction is performed. This is not limited to the memory area addressed by the locked instruction – all of them are flushed:
For the P6 family processors, locked operations serialize all outstanding load and store operations (that is, wait for them to complete). This rule is also true for the Pentium 4 and Intel Xeon processors
The JVM uses this to achieve memory consistency required for the volatile
variables.
Other implementations (such as JET) use the MFENCE
instruction for the same purpose.
If we took a non-volatile version of the code and added the LOCK
prefix to the incq
instruction, we would have produced a properly synchronised counter.
Unfortunately, we can’t force JVM to do it for us.
The simple counter
Now we move to the simplest counter that actually works. All it takes is to make the relevant methods synchronized
. In this case the volatile
modifier is no longer
necessary. All the variables will be saved to memory at the end of the synchronized
block:
The results are now correct. Here is the output for 12 threads, delay factor zero:
Counter sum: 40933218
Correct sum: 40933218
Here are the times:
Counter name | Delay factor | Base time | Time, ns, for thread count | |||
---|---|---|---|---|---|---|
1 | 2 | 6 | 12 | |||
SimpleCounter | 0 | 1 | 39 | 843 | 1578 | 5565 |
SimpleCounter | 10 | 47 | 70 | 724 | 2000 | 7008 |
SimpleCounter | 50 | 374 | 25 | 365 | 2512 | 7981 |
SimpleCounter | 100 | 887 | 20 | 332 | 2344 | 8550 |
The times are awful. When the number of threads is one, the resuls are not so bad, but not great, either. Any higher number of threads ruins the picture completely. What is remarkable is that the synchronisation overhead is high even when delay factor in big. The chances of monitor conflicts are lower in this case, but the synchronisation overhead is still high. In short, this counter is totally unusable.
It is really amazing how slow the synchronized
blocks really are. The time of 8550 ns corresponds to only 117,000 operations per seconds – just for one counter update.
The atomic counter
In the SimpleCounter
implementation we don’t perform a lot of operations inside the synchronized
block. It is either increment of a value or swap it with zero.
There is more lightweight way to do it than the synchronized
block: using atomic operations. A lot of articles have been written on lock-free multithreaded programming in
Java. What’s meant by “lock-free” is usually “atomic” (sometimes the desired result can be achieved by just using volatile
variables, but not in our case). Let’s try it
and check if it really is faster:
This approach is only applicable for counters, not for gauges. Gauges require several operations to be performed atomically (calculating the sum, the maximum and the minimum, and updating the counter).
The reported values are still correct:
Counter sum: 155891214
Correct sum: 155891214
Here are the times:
Counter name | Delay factor | Base time | Time, ns, for thread count | |||
---|---|---|---|---|---|---|
1 | 2 | 6 | 12 | |||
AtomicCounter | 0 | 1 | 11 | 135 | 498 | 926 |
AtomicCounter | 10 | 47 | 1 | 102 | 440 | 1162 |
AtomicCounter | 50 | 374 | 3 | 72 | 581 | 1559 |
AtomicCounter | 100 | 887 | -1 | 27 | 170 | 968 |
The times look much better than for the synchronized
counter. We’ve got really wonderful values for one thread, almost reasonable values for two threads and high delays.
Unfortunately, the values for higher thread counts are still prohibitively bad (although not as ridiculous as for the synchronized
counter).
Off-topic: how it works
What is the code generated for the atomic counter? Let’s look at the appropriate portion of the run()
method (everything was inlined there):
Three instructions (sete
, movzbl
and test
) are in fact unnecessary. I’m not sure what the first test
instruction is for (I suspect this is the way to interrupt
the loop: for that the VM must unmap the page this instruction points to (0x110000
)). Apart from this, the code is very simple: we get the value, add one and write it back
on condition that the memory location still contains the value we found initially. If not, we repeat the procedure. The compare and swap are done atomically using the instruction kindly
provided by Intel: cmpxchg
. It has a lock
prefix, with usual consequences discussed above. The counter value is likely to be cached, so the lock
prefix does not have
to lock anything but the local cache. It will, however, serialise the reads and writes, which has its price.
The code above directly corresponds to the Java code in the AtomicLong
class:
This loop contains no provision for fairness. One thread can get stuck there forever. I made a test to verify how many times this loop usually runs. For that, I made my
own version of AtomicLong
(using sun.misc.Unsafe
), which counted the iterations, stored the counts and printed them eventually. I won’t publish the code here,
for it is straightforward. The test showed that almost every time the loop count is one, very seldom two. I never saw any bigger values even on delay factor of zero and
twelve threads.
This means that the cost of updating the atomic counter isn’t much higher than that of the VolatileCounter
. The execution times show the atomic counter is actually faster.
Compound counters
We’ve looked at two bad solutions, which don’t produce correct results, and two good ones, which do. Neither of them was really fast enough on high thread counts. Even the ones that lacked any synchronisation were slow. We need some other approach to implement counters suitable for highly-parallel applications. We’ll try thread-local counters.
The idea is that each thread, when started, requests thread-local views of all the counters it is going to use. The base counter keeps track of all such views and, when requested a value, asks them all for their local values and adds all of those together.
Let’s first add a method to the Counter
base class:
Then we’ll introduce the CompoundCounter
class, which will be parameterised with the class of the thread-local counter:
This is a simplistic implementation. The proper one must take care of threads starting and stopping. The counters created by the stopped threads must be removed from the list –
either by means of a finalizer, or by using a WeakReference
, We’ll skip this complication for now.
The compound counter has a properly synchronised local value as well, so it can be used directly, too.
We need to modify the counterTest
method, where the threads are created:
Let’s try this CompoundCounter
with all four of the counter types created before:
Counter name | Delay factor | Base time | Time, ns, for thread count | |||
---|---|---|---|---|---|---|
1 | 2 | 6 | 12 | |||
Compound Trivial | 0 | 1 | 3 | 3 | 3 | 3 |
Compound Trivial | 10 | 47 | 1 | 0 | 0 | 0 |
Compound Trivial | 50 | 374 | 1 | 2 | 1 | 1 |
Compound Trivial | 100 | 887 | 3 | -3 | -2 | -2 |
Compound Volatile | 0 | 1 | 11 | 11 | 11 | 11 |
Compound Volatile | 10 | 47 | 1 | 1 | 1 | 1 |
Compound Volatile | 50 | 374 | 2 | 2 | 4 | 4 |
Compound Volatile | 100 | 887 | -1 | -3 | 1 | 1 |
Compound Simple | 0 | 1 | 67 | 73 | 72 | 85 |
Compound Simple | 10 | 47 | 56 | 53 | 55 | 57 |
Compound Simple | 50 | 374 | 24 | 46 | 37 | 51 |
Compound Simple | 100 | 887 | 17 | 24 | 23 | 26 |
Compound Atomic | 0 | 1 | 11 | 11 | 11 | 11 |
Compound Atomic | 10 | 47 | 1 | 1 | 1 | 1 |
Compound Atomic | 50 | 374 | 5 | 6 | 4 | 4 |
Compound Atomic | 100 | 887 | -1 | -6 | 0 | 1 |
This looks much better. All the results improved a lot, including those that didn’t use any synchronisation. This is the result of not sharing of variables between threads. The synchronised and atomic counters benefit also from absence of thread synchronisation collisions.
The performance of the compound atomic counter is already within the target limits that we established in the beginning (10 ns). Unfortunately, this solution is only applicable for counters, not for gauges (or, rather, I didn’t manage to come up with a suitable solution for gauges using atomics; it may still exist).
The performance of the synchronised counter (SimpleCounter
) is at the boundary. We probably can afford one synchronised update operation per event handler
at these times, but not more. The times for gauges are likely to be similar, which means that we don’t have any suitable solution for gauges yet, and must
try something else. Perhaps, we can improve the counters as well?
Indirect counters
Let’s try a different approach. Instead of storing the value inside the counter, we’ll store it in its own object and swap the reference. We’ll call this version “an indirect counter”.
We’ll see later that other code may benefit from getting access to that separate object in a generic way, that’s why we’ll create several abstract classes before introducing the real solution.
First, we need an interface to the object that stores the value:
Then the base class of all the indirect counters:
Finally, the actual implementation, which we’ll call LazyCounter
, because it allows delayed fetching of its value:
The reference to the object is stored in a volatile
field, which prohibits keeping it in a register: it must be read from memory each time it’s used, even if the same counter
is updated several times in a row. This adds one extra memory access per counter update compared to non-indirect counters. However, as we saw, unlike writing, reading
of volatile
does not incur any additional costs compared to ordinary fields. This one is likely to be cached, so the cost is low. The benefit is that the client always
sees the latest value of this reference, and, when it is swapped, starts updating the new one, while the server is still holding on to the old value.
The nice part of this strategy is that the referenced value may contain more than
one field that must be updated in one transaction. We’ll see it in our implementation of the Gauge
later.
There is no synchronisation anywhere, so the class is still not suitable for simultaneous multi-threaded update. In a multi-client environment it can only be used as a thread-local
view of a compound counter. It does, however, improve over the TrivialCounter
and VolatileCounter
: it completely eliminates the second effect
(missing the reset of the counter). At some point after the swap (and rather quickly, due to the volatile
reference), the counter will start updating the new value. This means that
our results can’t be totally out anymore.
Does it mean that this is a completely correct solution? We know it can’t be. No multi-threaded Java program can be completely correct without using some synchronisation primitives (although, as we’ll see later, some degree of cheating may help incorrect program produce sufficiently correct results). There must be something wrong with this version, and there is. There are at least two problems that are not resolved:
1) When the server thread replaces the value object, there is a short interval of time when both the client and the server threads have access to it. The client may change the value after is has been collected by the server, in which case that update will be lost. In the case of a gauge, where more than one value is kept, there is also a possibility that the server uses a half-updated gauge value.
2) Even if the client has updated the reference and isn’t busy using the old one, the processor may still do it: the last value written into the counter variable may still be in a processor write buffer. We don’t know how long this may last. Probably, not that long (the primary reason why a value may get stuck in that buffer is a cache miss, and our value has just been read and is therefore cached).
Since the LazyCounter
can’t be used on its own with more than one client thread, we’ll run our tests with a CompoundCounter
based on it.
We see that the problem does indeed occur. With the delay factor of zero, we see many counter errors even on one thread:
Counter sum: 3860183598
Correct sum: 3860183606
The error, obviously, only gets bigger with more threads. On 12 threads we get:
Counter sum: 34326496626
Correct sum: 34326497846
The error gets smaller with the delay factor growing, but it never becomes zero. Even for the delay of 100 and one thread, we can still see errors:
Counter sum: 16797154
Correct sum: 16797155
If the error was always one or two counts, we could perhaps ignore it (as I said in the beginning, the counters aren’t mission-critical). However, we sometimes observe much bigger differences. We’ll first try to fix those, and only then look at the times.
Indirect counter: lazy retrieval
As said before, the reason for the counter difference is that the last update by the client thread
may coincide with the retrieval by the server thread. The time window for that is very small, though – most likely, not more than 10 CPU cycles. Here comes a solution:
the server must delay using the value until quite sure it is completely stable and isn’t going to change anymore – we’ll call it the cool-down period.
A couple of hundred of nanoseconds should be enough. Let’s, however, be completely paranoid and make this period half the collection interval (30 seconds in the real program, half a second in our tests).
The server thread will run twice as often as before and collect the values on the even iterations, while using them on odd ones. This is where our IndirectCounter
interface
becomes handy. We’ll change our test loop in the following way:
We call it a “lazy retrieval”, because the object being retrieved isn’t used immediately, but rather stored for future use.
The tests of LazyCounter
with one thread look promising: no updates are lost. However, the higher thread counts are still not covered: the
LazyCounter
doesn’t support multiple clients, and CompoundCounter
does not support delayed operation. We need one last step.
Lazy compound indirect counters
This counter will resemble the CompoundCounter
that we saw before, with one important difference: it won’t touch the values until really forced to do so.
In the meantime it will keep the value objects in an array. This is really what is called “lazy operation” –
very common trick in functional programming, when the expression is kept in its symbolic form until the value is really needed.
Here are the classes:
Let’s run it giving it LazyCounter.class
as a parameter. We don’t see any counter losses. This is the result for delay factor zero, 12 threads:
Counter sum: 34882489418
Correct sum: 34882489418
Although this solution is, as we admitted, not completely corresponding to the Java spec, it is still practically usable. And it is, we hope, fast. Let’s look at the times:
Counter name | Delay factor | Base time | Time, ns, for thread count | |||
---|---|---|---|---|---|---|
1 | 2 | 6 | 12 | |||
Compound Lazy | 0 | 1 | 2 | 21 | 2 | 3 |
Compound Lazy | 10 | 47 | 1 | 13 | 48 | 60 |
Compound Lazy | 50 | 374 | -2 | 21 | 98 | 122 |
Compound Lazy | 100 | 887 | -8 | -3 | 20 | 41 |
Here comes the disappointment: the solution is in fact not fast. At least, not faster than the compound counter based on the atomic counter. What went wrong?
The false sharing problem
We’ve mentioned simultaneous update of a variable by multiple threads as a source of a big inefficiency. We saw how it slowed down all our counters before we introduced the compound one.
The problem, however, is much bigger. The caches operate on the cache line basis (64 bytes on our architecture). When a variable is updated, not only its value is invalidated in other caches, but also values of all the variables that share the same cache line. This is called “False sharing”. Can it be the reason of the observed performance degradation?
The counter value objects are allocated by the server thread in sequence, one immediately after another. They can easily end up in the same cache lines. Let’s check if this happens by printing the addresses of allocated objects. This is how we do it:
This class uses sun.misc.Unsafe
to get hold of the objects` addresses, collects these addresses into an array and prints the contents of that array when it has enough.
To make its output easier to read, it prints the first address and then the differences.
The Eclipse refused to compile this class, at least at its standard error settings. Fortunately, the javac compiles it, just gives four warnings.
The LazyCounter
must call the addValue
after allocating a new object:
This is the output we get after several seconds of running with 12 threads:
-177946356:
987 3 3 3 3 3 3 3 3 3 3 17881 3 3 3 3 3 3 3 3 3 3 3 175 3 3 3 3 3 3 3 3 3 3 3
175 3 3 3 3 3 3 3 3 3 3 3 175 3 3 3 3 3 3 3 3 3 3 3 175 3 3 3 3 3 3 3 3 3 3 3
175 3 3 3 3 3 3 3 3 3 3 3 175 3 3 3 3 3 3 3 3 3 3 3 175 3 3 3
The address difference of 3 looks confusing, but is explained easily. By default, JVM uses compressed pointers. These pointers occupy four bytes (that’s why we read the v
variable
using unsafe.getInt()
, not getLong()
) and serve as indices. They are multiplied by 8 and added to a base address. This way a 32-bit address space can address
up to 32Gb heap. What we see in the output is that the objects are allocated 24 bytes apart from each other. Every 12th difference is bigger (175, or 1400 bytes),
because the server allocates some other objects after collecting the values from all the thread local views.
If we prefer uncompressed pointers, we can replace int
with long
in the Dump
class and run the test with pointer compression off:
# java -XX:-UseCompressedOops Test
8288454184:
9656 24 24 24 24 24 24 24 24 24 24 178712 24 24 24 24 24 24 24 24 24 24 24
1832 24 24 24 24 24 24 24 24 24 24 24 1832 24 24 24 24 24 24 24 24 24 24 24
1832 24 24 24 24 24 24 24 24 24 24 24 1832 24 24 24 24 24 24 24 24 24 24 24
1832 24 24 24 24 24 24 24 24 24 24 24 1832 24 24 24 24 24 24 24 24 24 24 24
1832 24 24 24
We’ve learned that an object with a single long
field takes up 24 bytes in memory, and that the objects are allocated next to each other. This means that two of them will
always share the same cache line, sometimes three. The false sharing is really happening here.
One can think of many ways to improve the situation. For instance, we can pre-allocate a lot of the value objects and give them out in an order that’s different from the order they were allocated. We’ll look at some of these options later. For now, since we are not writing production code, let’s follow the easiest route and make the object bigger – pad it to 64 bytes:
The address dump now reports the difference of 64:
8300513448:
9696 64 64 64 64 64 64 64 64 64 64 178752 64 64 64 64 64 64 64 64 64 64 64
1872 64
Why didn’t we suffer from false sharing in our CompoundCounter
test? We just got lucky. The thread local views were allocated in a loop:
Other objects were allocated between calls to getThreadLocalView()
– the Test
object and whatever the class Thread
allocated in its constructor.
Measurements using similar tool to the Dump
class show that the distance between objects was 520 bytes. That’s why it was so fast.
Now it’s time to check the speed:
Counter name | Delay factor | Base time | Time, ns, for thread count | |||
---|---|---|---|---|---|---|
1 | 2 | 6 | 12 | |||
Compound Lazy | 0 | 1 | 4 | 4 | 4 | 4 |
Compound Lazy | 10 | 47 | 0 | 0 | 0 | 0 |
Compound Lazy | 50 | 374 | -1 | 1 | 2 | 3 |
Compound Lazy | 100 | 887 | -6 | -3 | 1 | 1 |
This is an excellent result. All the times are well below the target value of 10 ns, and generally better than the compound solution based on the atomic counter. We finally found a solution that really works, and is incredibly fast – all at the cost of not being completely legal.
A thread-independent solution
The solution we’ve just presented is very good, but it has one obvious deficiency. It depends on the threads explicitly obtaining the thread-local views of the compound counter. This requires extra effort and may be error-prone (one thread may accidentally use a counter from another one; there isn’t any syntax checking for that). Sometimes thread creation can be outside of our control. For instance, a third-party HTTP server may call our code from whichever thread it finds suitable. There are also general-purpose libraries and other utility code that can be called from anywhere. Can such a code still benefit from the thread-local solution?
For the cases like that we’ll create a special adapter that is built on top of the LazyCompoundCounter
and is using the Java’s ThreadLocal
utility class.
We’ll call it simply a FastCounter
:
The thread local views are stored in the ThreadLocal
utility collection, which is implemented as a hash map (a different kind from the standard HashMap
– it uses
re-hashing rather than entry chains) associated with the current Thread
object. The operations on this collection are completely unsynchronised, and don’t cost too much.
On every add()
operation this class checks if there is a thread-local view associated with the current thread, and requests one if not. It adds additional operations,
but does not require any locking.
The example class above re-defines getThreadLocalView()
as a trivial function, just because our test code calls it, and we want this code to behave as if there are
no thread-local views. In real-life code this function will be implemented properly, so that the clients that know their threads can request a thread-local view and save
those hash map lookups.
Here are the times:
Counter name | Delay factor | Base time | Time, ns, for thread count | |||
---|---|---|---|---|---|---|
1 | 2 | 6 | 12 | |||
FastCounter | 0 | 1 | 7 | 7 | 7 | 7 |
FastCounter | 10 | 47 | 1 | 1 | 1 | 1 |
FastCounter | 50 | 374 | -1 | -1 | 3 | 3 |
FastCounter | 100 | 887 | -6 | -7 | -3 | -1 |
This also looks very good. It runs slower that the compound counter with explicit thread-local views only on the delay factors 0 and 10, and even then it is fast enough.
The fast atomic counter
As we remember, the atomic counters also performed well, but only in the thread-local setup. Let’s implement a fast atomic counter, based on the usual
(non-lazy) compound counter and the ThreadLocal
map:
The results are a bit worse than those of FastCounter
, but the solution is still usable:
Counter name | Delay factor | Base time | Time, ns, for thread count | |||
---|---|---|---|---|---|---|
1 | 2 | 6 | 12 | |||
FastAtomicCounter | 0 | 1 | 15 | 15 | 15 | 15 |
FastAtomicCounter | 10 | 47 | 6 | 7 | 7 | 5 |
FastAtomicCounter | 50 | 374 | 16 | 27 | 21 | 21 |
FastAtomicCounter | 100 | 887 | 20 | 24 | 27 | 26 |
The positive feature of this solution is that it will satisfy any Java purist: it is completely correct and properly synchronised.
Gauges
Now it’s time to look at the gauges. This time we won’t implement obviously incorrect solutions. We also won’t need a special IndirectGauge
interface, because gauge’s getAndReset()
method already returns an object rather than a primitive value. There isn’t any Atomic
solution. The classes
implemented for gauges include:
-
EmptyGauge
: does nothing -
SimpleGaugeValue
: an immutable implementation ofGaugeValue
. It is returned by ordinary (non-lazy) implementations -
SimpleGauge
: the most obvioussynchronized
solution -
CompoundGauge
: a non-lazy compound solution based on an arbitrary gauge implementation and the idea of a thread-local view -
LazyGauge
: a gauge that keeps its value inside aMutableGaugeValue
referenced by avolatile
reference -
LazyCompoundGauge
: a lazy compound solution, which collects the values from its thread-local views into a collection calledCompoundGaugeValue
, so that they could be added together later -
FastGauge
: a thread-independent facade of theLazyCompoundGauge
, based on Java’sThreadLocal
.
The implementations are straightforward and look very similar to those of the counters. I won’t show them all here; they are available in the repository. I’ll only show the core of the lazy compound solution – the thread local views:
All the versions produced correct results – the counts and the sums were exactly the same as expected. Here are the times:
Gauge name | Delay factor | Base time | Time, ns, for thread count | |||
---|---|---|---|---|---|---|
1 | 2 | 6 | 12 | |||
Simple | 0 | 2 | 14 | 218 | 704 | 1884 |
Simple | 10 | 47 | 16 | 400 | 1067 | 3706 |
Simple | 50 | 394 | 35 | 107 | 1884 | 5866 |
Simple | 100 | 898 | 40 | 83 | 1574 | 6047 |
Compound Simple | 0 | 2 | 14 | 15 | 16 | 19 |
Compound Simple | 10 | 47 | 17 | 17 | 17 | 17 |
Compound Simple | 50 | 394 | 42 | 41 | 43 | 42 |
Compound Simple | 100 | 898 | 49 | 49 | 49 | 52 |
Compound Lazy | 0 | 2 | 8 | 18 | 39 | 44 |
Compound Lazy | 10 | 47 | 5 | 30 | 90 | 86 |
Compound Lazy | 50 | 394 | 6 | 23 | 81 | 99 |
Compound Lazy | 100 | 898 | 13 | 20 | 33 | 41 |
Just like with the counters, the ordinary SimpleGauge
is very slow, while CompoundGauge
built on top of it is better, but still not ideal.
The lazy compound version, which we hoped for so much, is even worse. This time we don’t need to look far for the reason: obviously, we’ve got caught by the false sharing again.
The MutableGaugeValue
object is 48 bytes long and can share the same cache line with another such object. We need to pad it to 64 bytes by two extra long
s:
Here are the results:
Gauge name | Delay factor | Base time | Time, ns, for thread count | |||
---|---|---|---|---|---|---|
1 | 2 | 6 | 12 | |||
Compound Lazy | 0 | 2 | 3 | 4 | 7 | 5 |
Compound Lazy | 10 | 47 | 0 | 10 | 7 | 16 |
Compound Lazy | 50 | 394 | 10 | 33 | 56 | 63 |
Compound Lazy | 100 | 898 | 14 | 23 | 34 | 49 |
They are a bit better but still far from ideal. It is easy to see why. One gauge update involves update to four fields. Each of them can cause false sharing with any other.
Making the object 64 bytes long will only help if all the objects are aligned by 64 bytes. In practice, they are not, which is easy to check if we use our Dump
tool
again and modify it to print the address modulo 64 as well as the distance to the previous object. Here is the output (with compressed objects off):
9720:24 64:24 64:24 64:24 64:24 64:24 64:24 64:24 64:24 64:24 64:24 3408:40
64:40 64:40 64:40 64:40 64:40 64:40 64:40 64:40 64:40 64:40 64:40 1648:24
Because of this, the last of the four fields can share the cache line with the first field of the following object. To make sure this doesn’t happen, we need to provide
the distance of at least seven long
s (56 bytes) between the blocks of fields. That requires three more dummy fields, which makes the size of our object 88 bytes.
And here are the times:
Gauge name | Delay factor | Base time | Time, ns, for thread count | |||
---|---|---|---|---|---|---|
1 | 2 | 6 | 12 | |||
Compound Lazy | 0 | 2 | 3 | 3 | 3. | 3 |
Compound Lazy | 10 | 47 | 6 | 5 | 5 | 6 |
Compound Lazy | 50 | 394 | 7 | 11 | 11 | 14 |
Compound Lazy | 100 | 898 | 9 | 19 | 20 | 21 |
This is much better. It is still not as fast the counters, but already within our “almost good” range (under 20 ns). It has a disadvantage, though. This scheme causes each value to occupy two cache lines, preventing any other data to be placed there, and cache lines are valuable resources: there are only 512 of them in a 32K data cache. Aligning the values by 64 bytes would be better, and storing the values used by the same thread next to each other – even better.
Now we are ready to introduce a FastGauge
(the LazyCompoundGauge
with a ThreadLocal
-based facade):
Gauge name | Delay factor | Base time | Time, ns, for thread count | |||
---|---|---|---|---|---|---|
1 | 2 | 6 | 12 | |||
FastGauge | 0 | 2 | 14 | 14 | 14 | 14 |
FastGauge | 10 | 47 | 10 | 11 | 11 | 11 |
FastGauge | 50 | 394 | 15 | 15 | 17 | 22 |
FastGauge | 100 | 898 | 22 | 16 | 26 | 32 |
This is a bit slow on high thread counts and high delay factors, but still usable.
Possible problems
Our reference variable is volatile
while our value is not. Java promises that modified non-volatile variables will be flushed to memory before next
synchronisation point (entering or leaving the synchronized
block or writing a volatile
variable). Our solution may fail in the very unlikely case when
the program does not reach any synchronisation point before we start using collected data. Most likely, it won’t fail even then – typical code generators
prefer not to keep values in registers for such a long time if they can avoid that. However, a theoretical possibility exists.
Thread scheduling may also play a role. When talking about the effects of thread synchronisation, we usually look at multi-core scenarios, when the pieces of code really run in parallel, but in our case the danger lies in the opposite case – when several threads share one core. A client thread may be removed from the processor just before it is ready to write the updated counter back into memory, and only scheduled back after 30 seconds. In the real life this is only a theoretical possibility. Proper thread schedulers allocate CPU to threads without causing such delays (unless this is a very low-priority thread; probably, a low priority should be a counter-indication for the use of lazy thread-local statistical variables).
Another case worth mentioning is a garbage collection pause. Let’s imagine the garbage collector to kick in right after the counter update but before the memory write. All the threads are suspended. The GC operation runs for 30 seconds. When the threads continue, the server thread wakes up first and, having discovered that 30 seconds have passed, it proceeds to collect the data and happily uses incorrect values. This means that long garbage collection pauses must also be considered a counter-indication for lazy thread-local variables. However, they should be considered a counter-indication for everything; if the GC delays are so long, there is something wrong with the program. One should consider changing the JVM, or the GC, or its parameters, or the heap size, or, possibly, modifying the program. Perhaps, only some long-running batch processors may tolerate such long GC delays, but programs like that do not require high-performance real-time statistical counters.
Anyway, the worst that may happen is a loss of some counts, or a partially updated gauge value, which will result in incorrect averages. However, since these variables aren’t mission-critical, this unlucky event won’t cause a real damage. I agree that any really mission-critical code must rely on proper synchronisation and comply to the Java standard fully. And if it is slow, so be it.
If these problems are really experienced, one can consider increasing the cool-down period. It does not have to be half the collection interval. It does not even have to be smaller than that interval. We can make it 10 minutes, although it would require storing the last 10 results somewhere. If some thread suspends for 10 minutes or does not perform any synchronised operation for 10 minutes, the program has a problem much bigger than just some counters being a bit out.
The volatile option
If we feel really paranoid about our main counter variable not being declared volatile
, we can declare it as such.
Let’s add a MutableVolatileCounterValue
class where the value
is volatile
. In the case of a gauge, we don’t need to make all four values volatile
, we can get away
with just one – the last one updated (counter
in our case). See MutableVolatileGaugeValue
class.
We’ll build a LazyVolatileCounter
and LazyVolatileGauge
classes on top of these, and test the lazy compound solutions with them. Here are the times for a counter:
Counter name | Delay factor | Base time | Time, ns, for thread count | |||
---|---|---|---|---|---|---|
1 | 2 | 6 | 12 | |||
Compound Lazy Volatile | 0 | 1 | 10 | 10 | 10 | 10 |
Compound Lazy Volatile | 10 | 47 | 1 | 1 | 1 | 1 |
Compound Lazy Volatile | 50 | 374 | 4 | 5 | 7 | 10 |
Compound Lazy Volatile | 100 | 887 | 3 | 5 | 10 | 18 |
And these are the times for gauges:
Gauge name | Delay factor | Base time | Time, ns, for thread count | |||
---|---|---|---|---|---|---|
1 | 2 | 6 | 12 | |||
Compound Lazy Volatile | 0 | 2 | 6 | 5 | 5 | 5 |
Compound Lazy Volatile | 10 | 47 | 4 | 4 | 4 | 5 |
Compound Lazy Volatile | 50 | 394 | 6 | 7 | 11 | 19 |
Compound Lazy Volatile | 100 | 898 | 9 | 8 | 14 | 23 |
The counters are a bit slower that the non-volatile
ones, while with gauges the times are roughly the same. This means that both volatile
and non-volatile
solutions
are practically usable. Note that both are incorrect with respect to the Java spec. The volatile
one is a bit more robust against the compiler keeping values in registers,
but both are vulnerable to the thread scheduling oddities and long GC pauses.
Personally, I still prefer the non-volatile
solution. Our test was a bit artificial. All the test function was doing was calculating some floating-point
function. The real-life code is more likely to read and write some memory, which may cause the operations with the lock
prefix to be more expensive. But I agree that
I may be wrong here – after all, I didn’t manage to make an example where volatile
would make running much slower.
Alternative solutions
In our implementation of the LazyCounter
and LazyGauge
we allocate new object each time we request the value. I don’t think this is a big expense, but some people
suggested that we can re-use the old ones. Specifically, if the cool-down period is set as half of the collection interval, we could get away with exactly two value objects.
At each point in time, the client thread is using one and the server thread is using the other one, and at some point they are swapped. We can keep both references inside the
counter object:
We could combine these two values into an array:
We could make currentIndex
global and share it between all the counters.
The additional benefit of all these versions is that they help prevent the false sharing problem: all the objects are allocated before the time and not necessarily immediately after each other.
We could even go as far as place both values right inside the counter object and change the interface:
These are all valid solutions, but they share common disadvantages:
-
The schedule of value-swapping is hard-coded into the counter or gauge code
-
Previously, the “second effect” (see the
TrivialCounter
section) was eliminated. The abnormal thread scheduling could cause loss of some counts but not totally incorrect results. Once we start re-using the values, the scenarios with incorrect results become possible. No matter what we do, at some point we must reset the counter value, and the client thread can update it right after that.
This, by the way, makes it difficult to implement a similar scheme in C++. Allocation of new objects doesn’t work there, because no one can ever reliably destroy the object; there is always a possibility that the client thread is still using it. Perhaps, the reference must be deposited into some kind of “purgatory”, until next round of requesting the values makes sure the old ones aren’t used. This is one of very few cases where the garbage collector has some advantage over the traditional memory management.
One other approach worth mentioning is the lazy allocation of value objects inside a counter:
The CompoundCounterValue
must be modified to take care of null
values. The memory is allocated in the client threads, and we can hope that the memory
manager allocates the values belonging to different threads far enough from each other, thus resolving the false sharing issue. We can then remove additional padding.
The times look quite good:
Counter name | Delay factor | Base time | Time, ns, for thread count | |||
---|---|---|---|---|---|---|
1 | 2 | 6 | 12 | |||
Compound LazyAlloc | 0 | 1 | 2 | 2 | 2 | 2 |
Compound LazyAlloc | 10 | 47 | -1 | 0 | 0 | 0 |
Compound LazyAlloc | 50 | 374 | 1 | 5 | 4 | 6 |
Compound LazyAlloc | 100 | 887 | 4 | 4 | 8 | 9 |
Similar solution can be implemented for gauges:
Gauge name | Delay factor | Base time | Time, ns, for thread count | |||
---|---|---|---|---|---|---|
1 | 2 | 6 | 12 | |||
Compound LazyAlloc | 0 | 2 | 5 | 4 | 4 | 4 |
Compound LazyAlloc | 10 | 47 | 0 | 0 | 1 | 1 |
Compound LazyAlloc | 50 | 394 | 14 | 18 | 18 | 18 |
Compound LazyAlloc | 100 | 898 | 12 | 14 | 18 | 22 |
Unfortunately, these solutions suffer from a race condition that can’t be resolved by increasing the collection interval. Here is the sequence of events
that causes one counter loss. In the beginning current
is null
:
Client thread | Server thread |
---|---|
cur = current; // null | result = current; // null |
cur = new MutableCounterValue (); | |
current = cur; | |
cur.add (increment); | current = null; |
This is a very rare case: there must be no counter updates for the entire collection interval, and one must happen exactly at the time of collection. Still, it is possible, and, since this is the only counter update in the entire collection interval, we don’t want to miss it.
A very exotic solution
We had to make our gauge object 88 bytes long because there was no way to tell JVM to align it at 64 bytes. We mentioned that this was inefficient with regards to cache lines.
We can arrange our own heap and make sure it is properly aligned; we’ll allocate our own man-made objects and ignore the Java heap management. Our heap
will live inside a direct byte buffer. The object addresses inside this heap will be ordinary int
s.
The BufferGaugeValue
class will contain a LongBuffer
, an object allocator and a set of static access methods:
On top of these operations we’ll implement updating of the gauge value:
And then we need a proper Java object to encapsulate our artificial object inside a CompoundGaugeValue
:
Finally, we create a LazyBufferGauge
:
Obviously, the buffer operations introduce additional delays, and there are other drawbacks, such as limited hard-coded heap size, but our goal of having our objects aligned by 64 bytes is achieved, and the times are good:
Gauge name | Delay factor | Base time | Time, ns, for thread count | |||
---|---|---|---|---|---|---|
1 | 2 | 6 | 12 | |||
Compound LazyBufferGauge | 0 | 2 | 6 | 5 | 5 | 5 |
Compound LazyBufferGauge | 10 | 47 | 4 | 4 | 4 | 5 |
Compound LazyBufferGauge | 50 | 394 | 8 | 10 | 15 | 17 |
Compound LazyBufferGauge | 100 | 898 | 9 | 12 | 17 | 20 |
I wouldn’t recommend this solution for this specific problem, but it mustn’t be completely discarded as something completely ridiculous. We’ll keep it in our toolbox and possibly find some use for it in later articles.
A combined statistics approach
The best way of allocating the thread-local values would be to put together values belonging to the same thread but to different counters and gauges. This would make the most efficient use of cache lines. It would, however, make the overall design of the statistical system too complex, so I won’t follow this idea up here.
Summary
Let’s bring together all the solutions that produce correct or almost correct results. To make the lists shorter, let’s choose one realistic delay factor, 50 (base time 370 – 390 ns). Here are the times for the counters:
Counter name | Delay factor | Base time | Time, ns, for thread count | |||
---|---|---|---|---|---|---|
1 | 2 | 6 | 12 | |||
SimpleCounter | 50 | 374 | 25 | 365 | 2512 | 7981 |
AtomicCounter | 50 | 374 | 3 | 72 | 581 | 1559 |
Compound Simple | 50 | 374 | 24 | 46 | 37 | 51 |
Compound Atomic | 50 | 374 | 5 | 6 | 4 | 4 |
Compound Lazy | 50 | 374 | -1 | 1 | 2 | 3 |
FastCounter | 50 | 374 | -1 | -1 | 3 | 3 |
FastAtomicCounter | 50 | 374 | 16 | 27 | 21 | 21 |
Compound Lazy Volatile | 50 | 374 | 4 | 5 | 7 | 10 |
Compound LazyAlloc | 50 | 374 | 1 | 5 | 4 | 6 |
And here are the times for gauges:
Gauge name | Delay factor | Base time | Time, ns, for thread count | |||
---|---|---|---|---|---|---|
1 | 2 | 6 | 12 | |||
Simple | 50 | 394 | 35 | 107 | 1884 | 5866 |
Compound Simple | 50 | 394 | 42 | 41 | 43 | 42 |
Compound Lazy | 50 | 394 | 7 | 11 | 11 | 14 |
FastGauge | 50 | 394 | 15 | 15 | 17 | 22 |
Compound Lazy Volatile | 50 | 394 | 6 | 7 | 11 | 19 |
Compound LazyAlloc | 50 | 394 | 14 | 18 | 18 | 18 |
Compound Lazy Buffer | 50 | 394 | 8 | 10 | 15 | 17 |
Conclusions
-
The ill-effects of not using synchronisation aren’t just theoretical possibility; they do in fact happen in real life, and are very prominent.
-
Absence of synchronisation does not yet make programs fast.
-
However, its presence may make it slow. While
synchronized
blocks are a convenient way to address shared data structures, they aren’t really suitable for high-performance computing. -
Atomic objects of Java perform much better than
synchronized
blocks, and should be preferred to those wherever possible. -
However, even they aren’t fast enough for high-performance operation. High operation speed can only be achieved by using thread-local objects, not shared ones.
-
Updating of a variable from multiple threads is slow and should be avoided if ultimate speed is needed.
-
False sharing is really a problem. While real sharing is under our control, false sharing is not. It may happen in completely unexpected places, and it is difficult to get rid of reliably. It really slows down multi-threaded programs a lot.
-
We’ve made very fast implementations of counters and gauges, which are not completely trivial and also not 100% legal. But they work.
-
Counters allow completely legal implementation based on atomic values. It is only a little bit slower than the illegal one. Gauges, however, do not allow such solution.
-
We expected the biggest overhead when running with delay factor of zero; in reality, the biggest overhead happened at the highest delay factor. Why this happens, requires additional investigation, but the effect is good: the overheads are increased exactly when we can afford it.
-
Current programming fashion is immutable objects. We, however, achieved substantial performance gain by using mutable ones. This shows that no rule is absolute.
-
When programming in Java, one still has to worry about cache lines and LOCK prefixes. This is sad, because Java is positioned as an application language, not a system programming one. Ideally, all such issues should be taken care of by the language implementation.
-
The last point is especially bad, because it undermines the platform-independent status of Java. Who knows what the cache structure is on other processors. Tomorrow Intel may increase the cache line size to 128 bytes, and we’ll have to update the solution.
-
If someone knows a suitable solution for gauges using atomics, please let me know.
Coming soon
I want to try all these approaches in C++. Some might still be applicable there.
I also want to investigate some other structures typically used in high-performance computing, such as queues. Our exprerience with synchronised objects so far was not very encouraging – what if the classes we usually use are not optimal? What is the performance we can achieve?
Update: LongAccumulator
and LongAdder
This article is based on the relatively old project. Since that time some new functionality has been added to Java. During discussion
on reddit, kevinherron
suggested two implementations based on the new classes, which were introduced in Java 1.8 – LongAccumulator
and LongAdder
, both from java.util.concurrent.atomic
package.
Here are the versions he suggested:
and
Both of the new utility classes offer distributed thread-random cache-aware accumulation, just the Accumulator
is capable of performing any operations on the incoming
numbers, while the Adder
can only add. Both are suitable for counters, and both promise to be fast. Let’s try them. Here are the times:
Counter name | Delay factor | Base time | Time, ns, for thread count | |||
---|---|---|---|---|---|---|
1 | 2 | 6 | 12 | |||
LongAccumulatorCounter | 0 | 1 | 13 | 13 | 13 | 13 |
LongAccumulatorCounter | 10 | 47 | 5 | 6 | 6 | 7 |
LongAccumulatorCounter | 50 | 347 | 12 | 13 | 17 | 20 |
LongAccumulatorCounter | 100 | 887 | 8 | 7 | 16 | 21 |
LongAdderCounter | 0 | 1 | 14 | 14 | 14 | 14 |
LongAdderCounter | 10 | 47 | 3 | 4 | 5 | 5 |
LongAdderCounter | 50 | 387 | 12 | 14 | 17 | 20 |
LongAdderCounter | 100 | 887 | 8 | 8 | 16 | 21 |
The times are reasonable and they don’t deteriorate much with the thread count – exactly what we need. It is also worth mentioning that the Accumulator
isn’t slower than
the Adder
– JIT has done a good job here.
Unfortunately, the results are not correct. Here are some sample numbers for the Adder
on
12 threads and delay factor 0:
Counter sum: 8019220551
Correct sum: 8019220615
And here are the results for the Accumulator
:
Counter sum: 8478867056
Correct sum: 8478867115
Even on one thread and delay of 100 we see errors – for example, on Accumulator
:
Counter sum: 11320241
Correct sum: 11320242
This is to be expected, since neither the Adder
, nor Accumulator
offer correct real-time retrieval. They offer correct multi-threaded counting, but the results
must be collected after the threads finish, otherwise some counts may be lost. This is from the Adder
Javadoc:
If there are updates concurrent with this method, the returned value is not guaranteed to be the final value occurring before the reset.
This, however, is very easy to fix. If some updates happen right at the collection time, it is normal for them to be added to either of the two intervals, the one just finished or the one just started. After all, thread scheduling could have caused collection to happen a bit later or earlier. As long as the counts are not lost, it’s all right. All we need to do is to run the adder (or accumulator) continuously and never reset it. The counter can compute and report the delta to the previous value:
and
Note that we don’t need to protect the start
variable or make it volatile
: it is only ever accessed from the server thread.
The results are now correct.
We could make continuous accumulating the regular feature of the counter and move calculating of the delta to the server side. This would eliminate competition between server and client threads. It would, however, complicate our compound solutions.
This means that my statement above (“When programming in Java, one still has to worry about cache lines and LOCK prefixes”) is not completely correct. In Java 1.8 there are built-in classes that take care of these complexities, and we can go quite far using these classes. Two points, however, keep the stament alive:
-
Our lazy compound implementation is still significantly faster (3 ns for
FastCounter
on 12 threads, delay 50 vs 20 ns for these two) -
These two solutions are only applicable to counters, not for gauges.
Comments are welcome below or on reddit.