Today’s topic may seem trivial. Everybody knows what a
BufferedInputStream is and why it is useful. Reading one
byte at a time from a file or other stream can be incredibly slow, and buffering is the solution. However, few
people know exactly how slow the “incredibly slow” is, so the effect is worth measuring. Besides, today’s story
originates from real life, which means that even some production code can sometimes be improved by adding buffering.
Let’s consider a simple network protocol. A client connects to a server using TCP. The server sends back a sequence of messages. Each message consists of a type (4 bytes), length (4 bytes big-endian integer) and a body (sequence of bytes of the specified length).
Here is the Java client code (the full code for both client and server is in repository):
The test server sends an endless stream of messages of size 100, as fast as possible.
The client receives messages and silently discards them.
The server and the client will run on two Linux servers connected to
1 Gbit Ethernet network.
When we run it, we get the following (I cut some arbitrary four lines from the output; in the future I’ll quote just one line, the most representative):
Time for 1000000 msg: 3630; speed: 275482 msg/s; 29.7 MB/s Time for 1000000 msg: 3587; speed: 278784 msg/s; 30.1 MB/s Time for 1000000 msg: 3728; speed: 268240 msg/s; 29.0 MB/s Time for 1000000 msg: 3458; speed: 289184 msg/s; 31.2 MB/s
This looks very slow. The program isn’t doing anything with data, it must be able to receive more than
After all, this is about
80 CPU cycles for each incoming byte. We should be able to do better than that.
One trivial line of code changes things dramatically:
(full source is here)
Now the results are:
Time for 1000000 msg: 931; speed: 1074113 msg/s; 116.0 MB/s
This is almost four times faster, which is easy to explain. Previously, we called
read() six times for each message
(assuming that the message arrived at once): once for the type, four times for the length and once for the body.
read() causes a system call, which involves a context switch. If we can avoid a system call, we must do so.
Of course, those four times for the length could be reduced to one by reading all four bytes at once:
(the code is here).
Time for 1000000 msg: 1909; speed: 523834 msg/s; 56.6 MB/s
This change made reading almost twice as fast, but the
BufferedInputStream gave us more. It reads everything that’s
available in the socket buffer (not exceeding the length of its own buffer, which is
1024 bytes by default). In the
worst case it means one read operation per message (assuming that the server also uses some kind of buffering and writes
the entire message at once). But the socket buffer may contain more than one message. What’s important, it is likely
to happen if the client is too slow to read messages; by reading everything from the socket buffer it recovers,
in other words, there is a negative feedback scheme here, which is always good, as is makes things stable.
Can we read messages faster than this by, for instance, employing both tricks together (
BufferedInputStream and reading the entire
four bytes)? It is possible, but we can’t test it in this environment. The fastest result achieved so far (
116 MB/s) is
very close to the throughput limit of the
1 Gbit Ethernet. We’ll have to look for faster network. I was lucky to find
some hosts with
10 Gbit Ethernet interfaces around.
These are the results I got on
10 Gbit network (note the increased iteration count):
The original test (no buffering, four reads per integer):
Time for 10000000 msg: 31497; speed: 317490 msg/s; 34.3 MB/s
Time for 10000000 msg: 16953; speed: 589866 msg/s; 63.7 MB/s
Time for 10000000 msg: 2712; speed: 3687315 msg/s; 398.2 MB/s
Time for 10000000 msg: 1885; speed: 5305039 msg/s; 572.9 MB/s
We can see that the performance gain we measured on
1 Gbit interface (
4 times) was misleading – the real speedup
It’s worth noticing that both methods combined achieved combined effect, which is, however, smaller than the
sum of the effects. This is normal, as, while the old
readInt modification eliminated very expensive calls
into the kernel, the new one only eliminates calls to the
BufferedInputStream, which are cheap. Still,
removing these four calls improved speed, even despite one extra memory allocation.
Removing memory allocation
Allocating a fresh object each time it is needed is a common practice in Java. It is tempting to remove
these allocation and rather re-use the objects allocated once. Memory allocation is cheap in Java, but more
memory means more frequent garbage collection, and this can be expensive. It is easy to re-use the four-byte
readInt. To remove other memory allocations, we must modify the interface to the message consumer.
It will now receive a byte array and a length:
Note that this isn’t just a change in method signature: it is also a change in its contract. Previously the method was free to do whatever it wanted with both byte arrays. For instance, it could put them into some container, such as a queue. Now it is prohibited – the method must consume the arrays right there and return. Unfortunately, such restrictions can’t be expressed by Java syntax. They can only be written down in the JavaDoc for human readers, and this is unreliable – who reads JavaDoc, anyway?
In short, such optimisation isn’t recommended in general case. However, there are cases when it is appropriate (provided that it actually makes things faster). Often the message consumer is under the direct control of the developer of the socket reader, and all it does with the message is further decoding. We’ll assume this case in the following discussion.
The modified code is in repository.
It allocates arrays for types and messages, as well as the temporary array for
readInt(). This is what the latter looks like:
And this is the speed:
Time for 10000000 msg: 2130; speed: 4694835 msg/s; 507.0 MB/s
This is a failure – the code runs slower. This seems counter-intuitive and requires explanation.
I don’t have definite answer, only a theory. The problem may be in array index (and possibly
The old code allocated array of size
4 and immediately accessed its elements at constant indices. The compiler
knew that the array wasn’t
null and that all the indices were within range. When the array reference is read from a static field,
the compiler doesn’t know that; it has to generate
null and index check code. The indirect argument in favour
of this theory is that if we replace the entire
return statement in
the speed immediately becomes
In short, memory allocation isn’t always bad, and its removing does not always improve speed. Strange, but true.
Using channels and buffers
We have already improved performance by
16 times and could stop here. This, however, would go against the
general policy of these articles, which is to optimise everything to extreme. It seems that nothing more
can be done within the
InputStream framework, so we’ll have to try the
nio approach, using
The primary reason we expect some speedup there is that there is support for direct buffers by the
so reading data from the socket channels into direct byte buffers requires much less data copying than when
working with streams and byte arrays. Some sophisticated implementations can even arrange
DMA transfer straight from the card
directly into the
ByteBuffer – although we don’t expect this in our case.
There is also another reason: we can read
int and other values directly from
ByteBuffer without any need for bit shifting.
getLong() and other methods of
DirectByteBuffer are compiled straight into the word-moving Intel
sun.misc.Unsafe magic takes care of that. Direct byte buffers are highly recommended for everyone
who decodes messages containing
long and other primitive values in their natural form.
Unfortunately, there isn’t any analog of a
BufferedInputStream for byte channels; we’ll have to create one.
This is our first buffer-based version:
ensure() method looks like (and is) a spaghetti of buffer manipulation calls. It requires some explanation.
What we do is allocate a big buffer (half a megabyte in our case) and read some portion of data from the socket channel into it.
Then we consume that portion until what’s left there is shorter than the next value we need. In such a case we read from
the socket channel again, appending to the remaining data. This way the current position in the buffer keep moving forward,
and finally it reaches the buffer’s end. Then we copy the remaining data to the beginning of the buffer (
buf.compact() takes care
of that) and start all over. The
position() manipulations toggle the buffer between the “read from socket”
position is where the data from the socket are written to;
limit is the end of the available space, which is the same
capacity) and “read from buffer” mode (data between
limit has been read from the socket and is available for
the application to consume).
Note that we are still using the no-allocation scheme for the messages, but not suffering from the array index checks this time
int values are read straight from the buffer).
This is the speed:
Time for 10000000 msg: 923; speed: 10834236 msg/s; 1170.1 MB/s
This is a significant improvement over the stream-based version (almost twice the speed), which suggests that the channels and buffers were
added to Java for a reason. We are now transferring ten million messages per second and use about
2.5 CPU cycles
per byte - what can be better than that?
Running on localhost adapter
Can we do even better? Looking at the numbers, we see that we’re back in familiar situation: we’ve saturated the network adapter.
10 Gbit card can hardly transmit more than
1170 MB/s, and I don’t have anything faster available.
The only faster option is a local (within the host) network. This is, by the way, still a valid use case –
it is not uncommon for a client and a server to run on the same host.
Running our tests in the the local connection scenario, we get the following:
For original unbuffered code:
Time for 10000000 msg: 32803; speed: 304850 msg/s; 32.9 MB/s
For optimised version using streams:
Time for 10000000 msg: 2076; speed: 4816955 msg/s; 520.2 MB/s
(surprisingly a bit slower than on the Ethernet adapter).
For the new
Time for 10000000 msg: 706; speed: 14164305 msg/s; 1529.7 MB/s
So the real speedup from the use of buffers was higher than the value that we saw. Buffers work almost three times faster than streams, not two.
Simplifying the buffer operation
ensure() code above we kept appending the new data to the remaining piece of old data until we hit the end of the buffer.
There is a simpler approach: we copy the leftover to the beginning of the buffer and start there each time we are short of data. The new
looks like this:
I first wrote this version for simplicity – the code looks easier to read than the original one. I expected lower performance due to more copying of the leftovers. To my surprise, I got the following performance data:
Time for 10000000 msg: 625; speed: 16000000 msg/s; 1728.0 MB/s
The code is actually faster than our previous code! And I have an explanation for that, too (any true scientist must have some explanation for whatever result he gets). The latest solution is more memory-compact and, therefore, cache-efficient than the previous one. Instead of moving our working area forward in the buffer, we are re-using the same space over and over again, hitting lower (and faster) levels of cache.
This is really remarkable and rare case, when the simpler and more readable code is also faster than the more complex code. I wish it was always so in computer programming.
Byte buffers vs Byte arrays
In our latest examples the interface to the message consumer is providing the message in a byte array. The message reader copies bytes from the byte buffer,
just to pass the array to the consumer. This is a reasonable approach if we expect that the message consumer stores the array somewhere. But in our case it
is not allowed to do so, since we give it the same array again and again. Why not give it a message in a byte buffer instead? We can do it without copying
memory, since the same memory can be pointed to by more than one byte buffer (this is what
ByteBuffer.duplicate() method is for). Providing messages
in byte buffers is ideal for clients that don’t just accept byte blob as a message, but parse it further, extracting fields. Byte buffers are well-suited
for such operations. This is what the new inner loop of our test looks like:
(the full code is here).
This is the time we get when we run it on localhost:
Time for 10000000 msg: 503; speed: 19880715 msg/s; 2147.1 MB/s
We process more than two gigabytes and almost twenty million messages per second. This is remarkable (remember, we started at
This performance was achieved when the server was sending data continuously in big blocks. In real life it may not be the case. Then the benefit from buffered operation won’t be so high.
If processing of each message is time-consuming, at some point it becomes irrelevant how fast the message receiving loop works. It must just be fast enough.
The solution using
ByteArrayInputStream can be just right for the purpose (although I still prefer using byte buffers due to easier operation).
We managed to speed things up by 70 times, which is quite remarkable.
The biggest enemy of performance in Linux is a system call. Anything that can eliminate some of them helps improve speed.
BufferedInputStreamcan help a lot here.
Java channels and byte buffers can help a lot as well.
Byte buffers are also very efficient when you need to extract primitive data values from a byte stream.
Another way byte buffers can help you is by eliminating unnecessary memory copying. However, improvements like this don’t help until the extra system calls are removed.
Eliminating allocation of new objects may be a good idea in general, but often costs are higher than benefits. Besides, the costs can be incurred in the least expected places, so no statement about benefits of allocating or not allocating memory must be made without testing.
The byte buffers and
BufferedInputStreamclass are Java-specific. However, the issue of avoiding system calls is not. Whatever language you are writing in, proper arranging of your I/O (which includes buffering of network streams) is essential.
While optimising the network client, we came across the issue of dynamic memory allocation. It seems that there is no definite answer if the memory allocation is good or bad, so the issue requires deeper investigation.