Today we’ll look at one of the most basic and the most often used methods from the Java library: System.currentTimeMillis()
.
This method reports current time with the millisecond accuracy. One may think that, because of this, the performance of this method is irrelevant. Who cares if obtaining current time takes 0.1 or 0.2 ms if the measured interval is 100 or 500 milliseconds long? There are, however, cases when we might still want to invoke this method frequently. Here are the examples:
-
Detecting and reporting abnormally long execution times. For instance, we can measure the time it takes to execute an HTTP request. In most cases (we hope) it takes below one milliseconds, which will report as zero if we use this method, but we want to be alarmed if the measured time is abnormally long (e.g. exceeds 100 ms). In this case we’ll time every request, and there may be hundreds of thousands, or even millions, of them per second.
-
Obtaining timestamps to be associated with some objects, for instance, with cached data – to arrange time-based eviction from a cache.
-
Measuring the duration of some long, but frequently initiated asynchronous processes, such as requests to remote servers.
-
Time-stamping some real-world events. For instance, a trading system may wish to record timestamps of incoming orders and performed deals.
In short, despite rather low accuracy of this method, there are cases when it can be called very often, so a very valid question arises: what is the performance of this method?
Measuring speed
The way to test the performance of currentTimeMillis()
is straightforward: we call it many times, while making sure that it isn’t optimised out altogether:
Running it on Windows (my notebook, which is Windows 10, Core i7-6820HQ @2.7GHz), using Java 1.8.0_92, we get:
>java Time
Sum = 2276735854871496107; time = 393; or 3.93 ns / iter
Sum = 2276735892808191170; time = 382; or 3.82 ns / iter
Sum = 2276735930338889327; time = 379; or 3.79 ns / iter
This is very good result. When time is reported in 3.8 ns, we can put time request instructions just about anywhere. We can measure time of any operation and attach timestamp to any resource. It is virtually free. We can query time 260 million times per second.
Let’s run it on Linux. The one I use for testing is of RHEL flavour with a kernel version of 3.17.4, and it runs on a dual Xeon® CPU E5-2620 v3 @ 2.40GHz.
# java Time
Sum = 1499457980079543330; time = 652; or 652.0 ns / iter
Sum = 1499457980725968363; time = 642; or 642.0 ns / iter
Sum = 1499457981368550493; time = 643; or 643.0 ns / iter
We had to reduce the iteration count by a couple of zeroes (make it one million), because the test runs two hundred times longer. The average time to query time in Linux is 640 ns, which is more than half a microsecond. We can only execute this call 1.5 million times per second.
This is really shocking, and it means for us that we must be careful with our use of currentTimeMillis()
on Linux. While still applicable for measuring
time of sequential long operations, this method can’t really be used for the tasks listed above.
Why is this and what can be done?
Windows version
The currentTimeMillis()
is a native function. Its code can be found in OpenJDK distribution, where it is linked to JVM_CurrentTimeMillis
(file hotspot/src/share/vm/prims/jvm.cpp
), which eventually ends up at os::javaTimeMillis()
. This call is OS-dependent.
The Windows version of this code (hotspot/src/os/windows/vm/jvm.cpp/os_windows.cpp
) looks like this:
Java time is based on GetSystemTimeAsFileTime()
, which returns a FILETIME
. This structure dates back to 32-bit systems, and consists of two 32-bit fields,
dwHighDateTime
and dwLowDateTime
, which, combined together, define a 64-bit number of 100-nanosecond intervals since the epoch time.
In MSVC, we can call this function and trace its execution in a debugger. The disassembly in the 32-bit mode shows its code:
This is indeed very clever. The function contains no system calls. Everything happens in the user space. There is a memory area that is mapped into the address space of every process. Some background thread regularly updates three double-words there, which contain the high part of the value, the low part, and the high part again. The client reads all three, in this order, and, if the two high parts are equal, the low part is considered consistent with them (the strong memory access ordering of x86 guarantees this). If not, the procedure must be repeated. This is very unlikely event, that’s why the entire procedure is so fast.
The function looks even better in the 64-bit mode:
Here the background thread can write the value atomically, and the client can atomically read it. Actually, the code could have been even better – we could write a 64-bit value in one instruction:
However, the code is much faster than any possible system call anyway.
What is the resolution of this timer? Let’s collect the values into an array and print it later:
The output is:
68208 5044
214315 5015
362575 5037
508633 4987
654960 5022
800065 5007
943784 5041
The timer ticks roughly once every half a millisecond (2000 Hz). This is perfectly adequate to serve as a base of currentTimeMillis()
.
Linux version
This happened to be much longer journey than I expected. We start at another version of os::javaTimeMillis()
, in
hotspot/src/os/linux/vm/os_linux.cpp
:
It is very unlikely that the time is spent in multiplication or division by 1000, but let’s measure the performance of gettimeofday
anyway:
Running it, we get:
Time for 10000000: 6.251108 s; 625.110800 ns
The call to gettimeofday
is indeed slow. Why? Is there perhaps some system call involved?
Let’s run it in the gdb
. After we have stepped into the call of gettimeofday
, we see this:
The first jmpq
in fact jumps to the next instruction (+6
), and eventually we end up in a function called _dl_runtime_resolve
.
What is actually happening here is that we are linking to the vDSO
(virtual dynamic shared object), which is a small fully relocatable shared library
pre-mapped into the user address space. The linking happens during the first call of gettimeofday
, after which the call is resolved, and the first indirect jump
goes straight into the function. Let’s skip this execution and break before the next call to gettimeofday
. The function looks the same:
but this time the jump takes us to the real implementation. Here it is, with some added comments:
The vDSO is mapped to a new address each time (probably, some security feature). However, the function is always placed at the same address as above (0x00007ffff7ffae50
)
when run under gdb
(I’m not sure what causes this). The function accesses some memory at addresses not far from its code (such as 0x7ffff7ff8080
or 0x7ffff7ff90f0
), always using
the “relative-to-IP” addressing mode. This makes this function completely relocatable – it can be mapped to any address in the user space, together with all its
data.
We can see that the function employs various options. It can make a syscall
, it can execute an rdtsc
instruction, or it can just read some memory (in a similar
fashion to the Windows implementation). Which of these options is applied in our case?
Let’s look at the source code of our version of Linux kernel (3.17.4). Note that this is not the latest one (Linux distributors are very conservative people). The code
of the latest one differs quite a bit (including some constant definitions). Here is the code (in arch/x86/vdso/vclock_gettime.c
):
Here is the rest of the code (gtod
, obviously, stands for gettimeofday
). In
arch/x86/include/asm/vgtod.h
:
In arch/x86/include/asm/vvar.h
:
Back in arch/x86/vdso/vclock_gettime.c
):
What is going on here is very close to what happened in Windows. There is a data structure that accompanies the vDSO and is mapped to the address space of every
process. It is called vvar_vsyscall_gtod_data
and is addressed in the code via the pseudo-variable gtod
. In the assembly listing above this structure sits at
the address 0x7ffff7ff8080
.
Some background thread updates the fields of this structure at regular intervals. Since there are more than two of these fields, the Windows trick of writing the
high part of the number twice doesn’t work. However, a similar trick works. The writer maintains a version number of the data in the structure (the seq
field).
It gets incremented by one when the writer starts updating the structure, and again by one after it finishes (with appropriate write barrier instruction being used).
As a result, the odd value means that the data isn’t consistent. The reader must read the number, make sure it’s even (wait a bit using pause
instruction if not), read all the values of
interest from the structure, read the version number again, and if it is the same as in the beginning, the data is considered correct. This is what gtod_read_begin
and gtod_read_retry
functions are for.
Read barrier instructions must be used to make sure that the processor didn’t re-order reading the version number with reading of the actual data. However,
the strong memory ordering of Intel x86 makes this unnecessary, so the read barrier call (smp_rnb()
) is empty in our case.
The values of interest are wall_time_sec
and wall_time_snsec
. The first one is the proper number of seconds since the epoch as reported by the good old
time()
call. In fact, time()
is implemented by reading exactly this value, and without any locking or version control:
The second value is the nanoseconds time, relative to the last second, shifted left by gtod->shift
, Or, rather, it the time measured in some
units, which becomes the nanosecond time when shifted right by gtod->shift
, which, in our case, is 25. This is very high precision.
One nanosecond divided by 225 is about 30 attoseconds. Perhaps, Linux kernel designers looked far ahead, anticipating the future when we’ll need such
accuracy for our time measurement.
What is the real resolution of these time values? Let’s read these values in a loop and print them when they change.
Unless we want to run the program in gdb
, we must first resolve the vDSO
relocation. That’s easy:
Now we can run a loop to detect change in wall_time_snsec
:
A typical fragment of the output:
3721507: sec=1499630953 ns=16894552383987602 diff=33554102663246 ns_shift=503496896 diff_shift=999990
4367859: sec=1499630953 ns=16928106486650848 diff=33554102663246 ns_shift=504496886 diff_shift=999990
5014216: sec=1499630953 ns=16961660589326449 diff=33554102675601 ns_shift=505496877 diff_shift=999991
We see that the wall_time_nsec
value is updated rather infrequently: for about 650000 iterations it stays the same and then jumps by a big value, which,
being shifted by shift
, becomes 999990 ns, or almost exactly one millisecond. It’s also interesting to see what happens when wall_time_sec
changes:
249316536: sec=1499631433 ns=33542778622228903 diff=33554096688437 ns_shift=999652702 diff_shift=999990
250511726: sec=1499631434 ns=21900718852443 diff=-33520877903376460 ns_shift=652692 diff_shift=-999000010
The new nanosecond value doesn’t start at zero; it starts at 652692, which is over half a millisecond. It’s not always that big – I saw values of 200K and 300K.
In short, two variables available in the gtod
structure provide very fast access to a rather coarse time value with a one millisecond resolution. Just like in Windows.
However, gettimeofday()
does not stop here. It tries to improve accuracy using other methods. In the code above this is what the vgetsns()
function is responsible
for:
The idea is that somewhere in the system there is a high-frequency timer, which we can ask for a current tick count. We record the reading of that timer at the
last tick of the coarse timer (gtod->cycle_last
), and get the difference. The gtod->mask
and gtod->mult
fields help convert the reading of that timer to our
30-attosecond units. This can explain the choice of those units: it’s not that Linux designers wanted to measure times of molecular processes; the unit is chosen very small
to reduce the errors during this conversion.
The code above provides for three types of high-frequency timers: TSC
(based on the RDTSC
instruction of Intel x86 instruction set), HPET
(based on the external
hardware, the HPET chip), and PVCLOCK
, which, most probably, has something to do with running in the VM environment.
In any case, the code only makes use of the high-precision timer to get the offset from the coarse time. The timer doesn’t have to run very stable for years. It is
only supposed to be stable enough to provide the time offset between the coarse timer ticks. Moreover, its clock frequency may vary, as long as that variance is
known to the system and reflected in appropriate update of the mask
and mult
fields. The system is very well designed to provide very accurate time very fast.
So what went wrong?
In our case, the vclock_mode
field is set to 2, which means using the HPET (High Precision Event Timer). This timer is a hardware piece included into all modern PCs: a replacement of
the good old 8253/8254 chip. That chip ran a counter, which, upon reaching zero, could trigger an interrupt, and could also be read directly using IN
instructions. It ran at the
frequency of 1.19318 MHz, and, if programmed to expire every 65536 clocks, caused an interrupt every 55 ms (18.2 Hz), which we all remember since the MS DOS days.
The HPET runs at higher frequency, counts up rather than down, and is available via memory-mapped I/O rather than I/O ports. The routine to read the counter looks like this:
This is compiled into just one 32-bit memory read instruction:
The entire gettimeofday
in the case of a HPET time source is compiled into a sequence of memory reads plus some shifts and multiplications. No divisions or
other expensive operations are present. Why is it then executing so slowly?
I didn’t have a sampling profiler available on the test machine, but there is a poor man’s solution: to attach gdb
to a running program that executes gettimeofday
in a loop, and then interrupt it.
In all the cases that I tried it it always stopped at exactly one place: after the above-mentioned instruction (at gettimeofdat+550
). This indicates that that instructions
alone takes most of the execution time. This is easy to verify. Let’s resolve the hpet_page
in a similar way to resolving vsyscall_gtod_data
and read this variable
in a loop:
Time for 1000000: 0.564707 s; ns/iter=564.707000
It takes more than half a microsecond to read the data from a HPET timer, which makes it a very slow time source. If we print the values we read from this timer,
we’ll see that, unlike in the case of gtod->wall_time_snsec
, the values change each time we access the timer. The typical difference is 8, which means that the timer
ticks 8 times while being read. The timer’s accuracy is way higher than its performance. The difference between timer values measured at the points when the
coarse nanoseconds change (which, as we learned, happens every millisecond), is 14320, which means that the HPET frequency is 14.32 MHz (70 ns per tick)..
This is really sad: the timer runs at high frequency, but there is no way to read it at this frequency. We can read it at about 1.8 MHz, which is just a bit higher than the frequency of the MS DOS timer.
And the worst part is that in our use case we are in fact uninterested in such a high accuracy anyway – all we need is currentTimeMillis()
, and the coarse
timer is perfectly suitable for that.
The coarse timer
There is an easy way out, because the coarse timer is available via the user API: the clock_gettime()
function:
Among multiple values defined in <time.h>
for clk_id
there is one that does exactly what we need: CLOCK_REALTIME_COARSE
(value 5).
The code for this function is present in the same vDSO (vclock_gettime.c
):
The nanosecond times reported by this function are 999991 ns (1 ms) apart, and the execution takes 8.4 ns.
The TSC time source
We’ve learned that the slow execution of currentTimeMillis()
was caused by two factors:
- JVM using
gettimeofday()
instead ofclock_gettime()
gettimeofday()
being very slow if HPET time source is used.
HPET, however, isn’t the only time source these days. Perhaps, not even the most common: many systems use TSC. This could make the entire study irrelevant. However, in our project the servers are configured with the HPET time source for a reason: this time source integrated well with the NTP client, allowing for smooth time adjustment, while TSC was not as stable (I don’t know the details; this is what the local Linux gurus say and I have no option but to trust them). Some other developers may find themselves in the same situation. Besides, a Java developer can’t always know what time source will be configured on the machine the program will run. That’s why I feel that the findings made so far are still important.
However, it is still interesting to find out how the result would change if we used the TSC time source. TSC stands for the time stamp counter,
which is simply the number of CPU cycles counted since the startup time (it is only 64-bit wide, so it will wrap around in 243 years at 2.4GHz clock rate).
This value can be read using rdtsc
instruction. Traditionally, there were two problems with this value:
- the values from different cores or physical processors may be shifted against each other, as the processors may start at different times
- the clock frequency of a processor may change during execution.
The first one seems to be indeed a problem. I tried getting the rdtsc
value from several cores at once, synchronised on writing into some memory location.
Even in the best case I got differences of a couple of thousand cycles. Sometimes more. However, this is only a problem if the programmer wants to use the TSC
manually; in this case the thread affinity must be set accordingly. The OS knows when it re-schedules threads from one core to another, so it can make all necessary adjustments.
The second problem seems to be a thing of the past. The Intel doc says this:
Processor families increment the time-stamp counter differently:
For Pentium M processors (family [06H], models [09H, 0DH]); for Pentium 4 processors, Intel Xeon processors (family [0FH], models [00H, 01H, or 02H]); and for P6 family processors: the time-stamp counter increments with every internal processor clock cycle. The internal processor clock cycle is determined by the current core-clock to busclock ratio. Intel® SpeedStep® technology transitions may also impact the processor clock.
For Pentium 4 processors, Intel Xeon processors (family [0FH], models [03H and higher]); for Intel Core Solo and Intel Core Duo processors (family [06H], model [0EH]); for the Intel Xeon processor 5100 series and Intel Core 2 Duo processors (family [06H], model [0FH]); for Intel Core 2 and Intel Xeon processors (family [06H], DisplayModel [17H]); for Intel Atom processors (family [06H], DisplayModel [1CH]): the time-stamp counter increments at a constant rate.
It also says:
Processor’s support for invariant TSC is indicated by CPUID.80000007H:EDX[8].
This we can test:
prints
100
which indicates that the bit is set.
Additional testing confirms this:
Let’s enable turbo boost and run it at known processor:
# taskset 1 ./t
time: 10000102; clocls/nano: 2.399990
While it is running, we’ll check the clock speed of processor zero:
# cat /proc/cpuinfo | grep MHz |head -n 1
cpu MHz : 1337.156
Now we comment out sleep
, uncomment delay
and repeat:
# taskset 1 ./t
time: 8478796; clocls/nano: 2.399991
And the clock speed is
# cat /proc/cpuinfo | grep MHz |head -n 1
cpu MHz : 3199.968
The rate of the rdtsc
is the same and it equals to the nominal clock rate of the processor, which in both cases differed from the real clock rate
of the corresponding core.
This means that TSC is quite safe to use on these processors. One way to do this is by using the function clock_gettime
mentioned above. It has a clock id
that corresponds to TSC: CLOCK_PROCESS_CPUTIME_ID
.
The speed, however, isn’t great: we get 343 ns per call.
There is a better option: we keep using the gettimeofday
but set the TSC as a preferable time source for the entire system. This is very easy:
echo tsc > /sys/devices/system/clocksource/clocksource0/current_clocksource
To return it back to HPET we do the same, but use “hpet” instead of “tsc”.
This is temporary (does not survive reboot). Let’s go straight back to our Java program and put two zeroes back into the repetition count. The result is amazing:
# java Time
Sum = 2414988828092895332; time = 3636; or 36.36 ns / iter
Sum = 2414989192017338494; time = 3628; or 36.28 ns / iter
Sum = 2414989551179770420; time = 3604; or 36.04 ns / iter
This isn’t as good as in Windows (there it was under 4 ns), but still much better than before, and, probably, practically usable for the use cases listed in the beginning.
If we time gettimeofday()
, we get:
time for 100000000: 3.000262 s; ns/iter=30.002621
Since gettimeofday()
returns time in microseconds, the natural limit of its resolution is 1000 ns, and it is indeed achieved. If we call this function many times
and print the differences between consecutive values, we get many zeroes and occasional ones.
The same test for clock_gettime (CLOCK_REALTIME)
returns 30 ns – exactly the same as the execution time of each call, which is expected.
In short, the TSC time source is both very fast and very accurate, with two reservations:
- on some systems there may be reasons not to use it
- the
clock_gettime(CLOCK_REALTIME_COARSE)
is faster.
Multi-threaded performance
The currentTimeMillis()
based on HPET runs for 640 ns (1.5M operations/second). Is this per core or for the entire system?
Let’s run a similar test to Time.java
, but start N threads, where N is between 1 and 24 (the total number of cores in our dual-processor system,
including hyper-threading ones).
Here are the results for the small numbers of threads:
Thread count | Avg time/call, ns | Total calls/sec, mil |
---|---|---|
1 | 644 | 1.55 |
2 | 918 | 2.18 |
3 | 1366 | 2.20 |
4 | 1871 | 2.14 |
Here is the average time it takes to execute currentTimeMillis()
for all thread counts:
This looks quite linear, which makes us suspect that the HPET chip serialises the requests, only serving one at a time. Or, possibly, a little bit more than one, judging from the transition from thread count 1 to thread count 2, where the performance didn’t halve, but rather dropped by 1.5 times.
Here is the graph of the total performance of the system (the number of calls that can be performed on all cores and processors, in one second):
The performance went up from 1.5M to about 2.1M op/sec and stayed there. The initial increase may have something to do with the fact that we’re testing on a dual-processor
system. Here are the times measured when the execution is limited to one processor (taskset 0x555
):
Thread count | Avg time/call, ns, two processors | Avg time/call, ns, one processor |
---|---|---|
1 | 644 | 596 |
2 | 918 | 1105 |
3 | 1366 | 1672 |
4 | 1871 | 2245 |
The single-processor time does not show that abnormal step between one and two threads; it is roughly proportional to the number of threads and (except for the value for one thread) is longer than the dual-processor time.
Testing with multiple processes gives similar results to that with multiple threads.
In short, the performance of the HPET is indeed limited system-wide. Not more than two million time enquiries per second can be performed on the machine, no
matter how we split the load between cores and processes. If 24 cores are loaded evenly, each can perform below 100K operations per second. This means that one must
really be careful when using currentTimeMillis()
in Java programs (and gettimeofday()
in C) if there are chances that the program will be executed using
HPET. This also applies to the use of C++
time-reporting functionality (such as std::chrono::system_clock::now ()
), for it uses clock_gettime (CLOCK_REALTIME)
,
which ends up reading the same HPET registers (although it invokes a syscall
in the process).
[A side note. Since the processors affect each other when using HPET, there are potential security problems. One process may run a tight loop calling gettimeofday
,
thus starving all the other processes of access to this resource and reducing their performance. Alternatively, some process may call this function and time its execution
using TSC, detect this way when other processes query current time, which may help establish which execution path the other processes take. End of side note]
The TSC-based timer does not demonstrate this behaviour. Its performance is quite stable, only dropping by 40% when all the cores (including the hyper-threading ones) are used:
Off-topic: A system call
We’ve seen that the vDSO code can end up in the system call. In Intel x64 version of Linux, the system call 0x60 is exactly the gettimeofday
function.
How fast is this system call? We can call it directly, modifying the original time measurement of gettimeofday
. We just replace
with
This is what we get:
Time source | Time for vDSO, ns | Time for syscall, ns | Cost of syscall, ns |
---|---|---|---|
HPET | 644 | 734 | 90 |
TSC | 35 | 91 | 56 |
Previously, I believed that system calls in Linux were incredibly expensive. This measurement proves this belief false: a system call isn’t exactly free, but it is cheaper, for example, than an L3 cache miss (100 ns).
However, if the performed action is short (as it is for TSC gettimeofday
), it is still beneficial to avoid system call. The vDSO really helps here: in our case,
it made the execution almost three times faster.
What to do
The first prize is to make sure that the program only ever runs on Windows or on Linux with the TSC time source. If this is impossible, there is no way to speed
the call up while staying in pure Java, and then the solution is to make sure currentTimeMillis()
isn’t called too often. Obviously, one can implement
a custom version of this call using JNI, based on clock_gettime()
or even direct reading of TSC (in the latter case thread affinity control may be required).
We measured the cost of a JNI call here; it was 8 ns. The total cost will be under 20 ns, which is acceptable.
Conclusions
-
The title isn’t completely correct:
currentTimeMillis()
isn’t always slow. However, there is a case when it is. -
This call is lightning fast on Windows (4 ns), and reasonably fast on Linux with TSC time source (36 ns)
-
It is indeed slow on Linux with HPET time source: 640 ns.
-
The HPET is shared between all running threads, of all processes so the entire system can query it only two million times per second.
-
Even this “slow” timer is sufficient for most applications. However, there are cases where higher performance is required.
-
There is an API call that returns coarse time and works fast. This is the
clock_gettime()
. Java VM could have used this call and provide fastcurrentTimeMillis()
. If absolutely necessary, one can implement it her/himself using JNI. -
A Linux system call isn’t as dramatically slow as I believed, but still takes substantial time (50 – 100 ns).
Update: nanoTime
During the discussion on reddit I received multiple comments pointing out that
System.currentTimeMillis()
is in fact a bad choise of call for the first three of the use cases listed in the beginning. The System.nanoTime()
is the correct
one. The reason is that nanoTime()
is monotonic while currentTimeMillis()
is not. The latter may be affected by time adjustments from NTP daemon or by leap seconds.
Or, for that matter, by user-initiated time change.
Even though these factors do not apply to our project (the machines are synchronised with the GPS time when booted, and later adjusted by sub-millisecond intervals; as a result, time never jumps back during program run: it only can stall for a bit), I fully agree with this comment. I would gladly have used a monotonic coarse timer if there was one in Java.
It is a bit awkward that Java provides two functions that differ by two parameters – the nature of time and the resolution. These parameters are orthogonal and grant the need for four functions:
- monotonic coarse
- monotonic fine (
nanoTime
) - real-time coarse (
currentTimeMillis
) - real-time fine.
Two are provided. There is, however, need for the remaining two. The last one, for instance, may be useful to timestamp events in high-frequency trading.
And what we needed was actually the first one. In fact, the only reason we needed coarse timer in the first place was that it had a potential to be fast. Actually, this is the only reason for coarse timers to exist at all.
If coarse timers are not much faster than fine ones, they are not needed. Two types of timers are sufficient then. This, unfortunately, seems to be the
case: coarse timers are not faster, due to currentTimeMillis
being implemented using gettimeofday()
.
Here are the evaluation times of nanoTIme()
compared to currentTimeMillis()
:
Java method | Windows, ns | Linux, HPET, ns | Linux, TSC, ns |
---|---|---|---|
currentTimeMillis() | 4 | 640 | 36 |
nanoTime() | 16 | 639 | 35 |
The nanosecond timer is a bit slower than the millisecond one on Windows, but it is still fast enough. The Linux version runs at the same speed as the milli timer. This is how the Linux version is implemented:
It calls clock_gettime
function with CLOCK_MONOTONIC
parameter, which ends up in the same vDSO
, in the function do_monotonic
:
The timer type is controlled by the same vclock_mode
parameter as for do_realtime
, which is used in gettimeofday
. The execution ends up in the same vgetsns()
,
which either reads HPET registers or performs rdtsc
. The only difference is in the coarse values it reads: gtod->monotonic_time_*
instead of gtod->wall_time_*
.
No wonder it runs at the same speed.
The fact that both real-time and monotonic clocks share the same mode is unfortunate. As I said earlier, the reason for us to use HPET
was that it co-operated with the NTP daemon better. Since monotonic time isn’t affected by NTP, it could have been based on TSC.
When TSC is used, consecutive calls to nanoTime()
return values that are about 80 ns apart. Consecutive calls to clock_gettime()
return intervals of about 48 – 50 ns.
It looks like the accuracy of nanoTime()
is only limited by its own duration (36 ns; the remaining 12 ns must be the test overhead).
The Windows version looks like this:
Here performance_frequency
comes from the call to QueryPerformanceFrequency
, which, on my PC, returns 2648441. This is the timer frequency in Hz, and it is
rather low. Windows can measure time very fast (in 16 ns), but in rather big increments (one tick equals 377.5 ns). Test in Java confirms it: continuous
calls to nanoTime()
return results that differ by zeroes and occasional 377s.
Out of curiosity, let’s look at the code of QueryPerformanceCounter
(this time in 64-bit mode only):
I didn’t include the code where jne
goes if al
(as loaded from 7FFE03C6h
) is not equal to 1, because in my case it is. Probably, this is the flag that tells
it to use rdtsc
. The value is read, added to the number from 7FFE03B8h
(probably, some base value), and then shifted right by the number in 7FFE03C7h
.
This number is 10 in our case. This explains the magical value of 2648441 for the performance frequency: it is the clock frequency of the TSC
on this computer
(2.7 GHz) divided by 1024. The resolution could have been higher if smaller shift factor was chosen.
Conclusions of the Update
-
Coarse timers in Java are not much faster than the fine timers. This means that fine timers could be used everywhere if fine real-time one was provided
-
on Linux, the
nanoTime
performance in TSC mode is mostly satisfactory; in HPET mode it is not. -
on Windows the
nanoTime
is four times slower thancurrentTimeMillis
but still fast enough; the resolution, however, is far from ideal, for unknown reason.
Comments are welcome below or on reddit.