Today we’ll look at one of the most basic and the most often used methods from the Java library:
This method reports current time with the millisecond accuracy. One may think that, because of this, the performance of this method is irrelevant. Who cares if obtaining current time takes 0.1 or 0.2 ms if the measured interval is 100 or 500 milliseconds long? There are, however, cases when we might still want to invoke this method frequently. Here are the examples:
Detecting and reporting abnormally long execution times. For instance, we can measure the time it takes to execute an HTTP request. In most cases (we hope) it takes below one milliseconds, which will report as zero if we use this method, but we want to be alarmed if the measured time is abnormally long (e.g. exceeds 100 ms). In this case we’ll time every request, and there may be hundreds of thousands, or even millions, of them per second.
Obtaining timestamps to be associated with some objects, for instance, with cached data – to arrange time-based eviction from a cache.
Measuring the duration of some long, but frequently initiated asynchronous processes, such as requests to remote servers.
Time-stamping some real-world events. For instance, a trading system may wish to record timestamps of incoming orders and performed deals.
In short, despite rather low accuracy of this method, there are cases when it can be called very often, so a very valid question arises: what is the performance of this method?
The way to test the performance of
currentTimeMillis() is straightforward: we call it many times, while making sure that it isn’t optimised out altogether:
Running it on Windows (my notebook, which is Windows 10, Core i7-6820HQ @2.7GHz), using Java 1.8.0_92, we get:
>java Time Sum = 2276735854871496107; time = 393; or 3.93 ns / iter Sum = 2276735892808191170; time = 382; or 3.82 ns / iter Sum = 2276735930338889327; time = 379; or 3.79 ns / iter
This is very good result. When time is reported in 3.8 ns, we can put time request instructions just about anywhere. We can measure time of any operation and attach timestamp to any resource. It is virtually free. We can query time 260 million times per second.
Let’s run it on Linux. The one I use for testing is of RHEL flavour with a kernel version of 3.17.4, and it runs on a dual Xeon® CPU E5-2620 v3 @ 2.40GHz.
# java Time Sum = 1499457980079543330; time = 652; or 652.0 ns / iter Sum = 1499457980725968363; time = 642; or 642.0 ns / iter Sum = 1499457981368550493; time = 643; or 643.0 ns / iter
We had to reduce the iteration count by a couple of zeroes (make it one million), because the test runs two hundred times longer. The average time to query time in Linux is 640 ns, which is more than half a microsecond. We can only execute this call 1.5 million times per second.
This is really shocking, and it means for us that we must be careful with our use of
currentTimeMillis() on Linux. While still applicable for measuring
time of sequential long operations, this method can’t really be used for the tasks listed above.
Why is this and what can be done?
currentTimeMillis() is a native function. Its code can be found in OpenJDK distribution, where it is linked to
hotspot/src/share/vm/prims/jvm.cpp), which eventually ends up at
os::javaTimeMillis(). This call is OS-dependent.
The Windows version of this code (
hotspot/src/os/windows/vm/jvm.cpp/os_windows.cpp) looks like this:
Java time is based on
GetSystemTimeAsFileTime(), which returns a
FILETIME. This structure dates back to 32-bit systems, and consists of two 32-bit fields,
dwLowDateTime, which, combined together, define a 64-bit number of 100-nanosecond intervals since the epoch time.
In MSVC, we can call this function and trace its execution in a debugger. The disassembly in the 32-bit mode shows its code:
This is indeed very clever. The function contains no system calls. Everything happens in the user space. There is a memory area that is mapped into the address space of every process. Some background thread regularly updates three double-words there, which contain the high part of the value, the low part, and the high part again. The client reads all three, in this order, and, if the two high parts are equal, the low part is considered consistent with them (the strong memory access ordering of x86 guarantees this). If not, the procedure must be repeated. This is very unlikely event, that’s why the entire procedure is so fast.
The function looks even better in the 64-bit mode:
Here the background thread can write the value atomically, and the client can atomically read it. Actually, the code could have been even better – we could write a 64-bit value in one instruction:
However, the code is much faster than any possible system call anyway.
What is the resolution of this timer? Let’s collect the values into an array and print it later:
The output is:
68208 5044 214315 5015 362575 5037 508633 4987 654960 5022 800065 5007 943784 5041
The timer ticks roughly once every half a millisecond (2000 Hz). This is perfectly adequate to serve as a base of
This happened to be much longer journey than I expected. We start at another version of
It is very unlikely that the time is spent in multiplication or division by 1000, but let’s measure the performance of
Running it, we get:
Time for 10000000: 6.251108 s; 625.110800 ns
The call to
gettimeofday is indeed slow. Why? Is there perhaps some system call involved?
Let’s run it in the
gdb. After we have stepped into the call of
gettimeofday, we see this:
jmpq in fact jumps to the next instruction (
+6), and eventually we end up in a function called
What is actually happening here is that we are linking to the
vDSO (virtual dynamic shared object), which is a small fully relocatable shared library
pre-mapped into the user address space. The linking happens during the first call of
gettimeofday, after which the call is resolved, and the first indirect jump
goes straight into the function. Let’s skip this execution and break before the next call to
gettimeofday. The function looks the same:
but this time the jump takes us to the real implementation. Here it is, with some added comments:
The vDSO is mapped to a new address each time (probably, some security feature). However, the function is always placed at the same address as above (
when run under
gdb (I’m not sure what causes this). The function accesses some memory at addresses not far from its code (such as
0x7ffff7ff90f0), always using
the “relative-to-IP” addressing mode. This makes this function completely relocatable – it can be mapped to any address in the user space, together with all its
We can see that the function employs various options. It can make a
syscall, it can execute an
rdtsc instruction, or it can just read some memory (in a similar
fashion to the Windows implementation). Which of these options is applied in our case?
Let’s look at the source code of our version of Linux kernel (3.17.4). Note that this is not the latest one (Linux distributors are very conservative people). The code
of the latest one differs quite a bit (including some constant definitions). Here is the code (in
Here is the rest of the code (
gtod, obviously, stands for
What is going on here is very close to what happened in Windows. There is a data structure that accompanies the vDSO and is mapped to the address space of every
process. It is called
vvar_vsyscall_gtod_data and is addressed in the code via the pseudo-variable
gtod. In the assembly listing above this structure sits at
Some background thread updates the fields of this structure at regular intervals. Since there are more than two of these fields, the Windows trick of writing the
high part of the number twice doesn’t work. However, a similar trick works. The writer maintains a version number of the data in the structure (the
It gets incremented by one when the writer starts updating the structure, and again by one after it finishes (with appropriate write barrier instruction being used).
As a result, the odd value means that the data isn’t consistent. The reader must read the number, make sure it’s even (wait a bit using
pause instruction if not), read all the values of
interest from the structure, read the version number again, and if it is the same as in the beginning, the data is considered correct. This is what
gtod_read_retry functions are for.
Read barrier instructions must be used to make sure that the processor didn’t re-order reading the version number with reading of the actual data. However,
the strong memory ordering of Intel x86 makes this unnecessary, so the read barrier call (
smp_rnb()) is empty in our case.
The values of interest are
wall_time_snsec. The first one is the proper number of seconds since the epoch as reported by the good old
time() call. In fact,
time() is implemented by reading exactly this value, and without any locking or version control:
The second value is the nanoseconds time, relative to the last second, shifted left by
gtod->shift, Or, rather, it the time measured in some
units, which becomes the nanosecond time when shifted right by
gtod->shift, which, in our case, is 25. This is very high precision.
One nanosecond divided by 225 is about 30 attoseconds. Perhaps, Linux kernel designers looked far ahead, anticipating the future when we’ll need such
accuracy for our time measurement.
What is the real resolution of these time values? Let’s read these values in a loop and print them when they change.
Unless we want to run the program in
gdb, we must first resolve the
vDSO relocation. That’s easy:
Now we can run a loop to detect change in
A typical fragment of the output:
3721507: sec=1499630953 ns=16894552383987602 diff=33554102663246 ns_shift=503496896 diff_shift=999990 4367859: sec=1499630953 ns=16928106486650848 diff=33554102663246 ns_shift=504496886 diff_shift=999990 5014216: sec=1499630953 ns=16961660589326449 diff=33554102675601 ns_shift=505496877 diff_shift=999991
We see that the
wall_time_nsec value is updated rather infrequently: for about 650000 iterations it stays the same and then jumps by a big value, which,
being shifted by
shift, becomes 999990 ns, or almost exactly one millisecond. It’s also interesting to see what happens when
249316536: sec=1499631433 ns=33542778622228903 diff=33554096688437 ns_shift=999652702 diff_shift=999990 250511726: sec=1499631434 ns=21900718852443 diff=-33520877903376460 ns_shift=652692 diff_shift=-999000010
The new nanosecond value doesn’t start at zero; it starts at 652692, which is over half a millisecond. It’s not always that big – I saw values of 200K and 300K.
In short, two variables available in the
gtod structure provide very fast access to a rather coarse time value with a one millisecond resolution. Just like in Windows.
gettimeofday() does not stop here. It tries to improve accuracy using other methods. In the code above this is what the
vgetsns() function is responsible
The idea is that somewhere in the system there is a high-frequency timer, which we can ask for a current tick count. We record the reading of that timer at the
last tick of the coarse timer (
gtod->cycle_last), and get the difference. The
gtod->mult fields help convert the reading of that timer to our
30-attosecond units. This can explain the choice of those units: it’s not that Linux designers wanted to measure times of molecular processes; the unit is chosen very small
to reduce the errors during this conversion.
The code above provides for three types of high-frequency timers:
TSC (based on the
RDTSC instruction of Intel x86 instruction set),
HPET (based on the external
hardware, the HPET chip), and
PVCLOCK, which, most probably, has something to do with running in the VM environment.
In any case, the code only makes use of the high-precision timer to get the offset from the coarse time. The timer doesn’t have to run very stable for years. It is
only supposed to be stable enough to provide the time offset between the coarse timer ticks. Moreover, its clock frequency may vary, as long as that variance is
known to the system and reflected in appropriate update of the
mult fields. The system is very well designed to provide very accurate time very fast.
So what went wrong?
In our case, the
vclock_mode field is set to 2, which means using the HPET (High Precision Event Timer). This timer is a hardware piece included into all modern PCs: a replacement of
the good old 8253/8254 chip. That chip ran a counter, which, upon reaching zero, could trigger an interrupt, and could also be read directly using
IN instructions. It ran at the
frequency of 1.19318 MHz, and, if programmed to expire every 65536 clocks, caused an interrupt every 55 ms (18.2 Hz), which we all remember since the MS DOS days.
The HPET runs at higher frequency, counts up rather than down, and is available via memory-mapped I/O rather than I/O ports. The routine to read the counter looks like this:
This is compiled into just one 32-bit memory read instruction:
gettimeofday in the case of a HPET time source is compiled into a sequence of memory reads plus some shifts and multiplications. No divisions or
other expensive operations are present. Why is it then executing so slowly?
I didn’t have a sampling profiler available on the test machine, but there is a poor man’s solution: to attach
gdb to a running program that executes
gettimeofday in a loop, and then interrupt it.
In all the cases that I tried it it always stopped at exactly one place: after the above-mentioned instruction (at
gettimeofdat+550). This indicates that that instructions
alone takes most of the execution time. This is easy to verify. Let’s resolve the
hpet_page in a similar way to resolving
vsyscall_gtod_data and read this variable
in a loop:
Time for 1000000: 0.564707 s; ns/iter=564.707000
It takes more than half a microsecond to read the data from a HPET timer, which makes it a very slow time source. If we print the values we read from this timer,
we’ll see that, unlike in the case of
gtod->wall_time_snsec, the values change each time we access the timer. The typical difference is 8, which means that the timer
ticks 8 times while being read. The timer’s accuracy is way higher than its performance. The difference between timer values measured at the points when the
coarse nanoseconds change (which, as we learned, happens every millisecond), is 14320, which means that the HPET frequency is 14.32 MHz (70 ns per tick)..
This is really sad: the timer runs at high frequency, but there is no way to read it at this frequency. We can read it at about 1.8 MHz, which is just a bit higher than the frequency of the MS DOS timer.
And the worst part is that in our use case we are in fact uninterested in such a high accuracy anyway – all we need is
currentTimeMillis(), and the coarse
timer is perfectly suitable for that.
The coarse timer
There is an easy way out, because the coarse timer is available via the user API: the
Among multiple values defined in
clk_id there is one that does exactly what we need:
CLOCK_REALTIME_COARSE (value 5).
The code for this function is present in the same vDSO (
The nanosecond times reported by this function are 999991 ns (1 ms) apart, and the execution takes 8.4 ns.
The TSC time source
We’ve learned that the slow execution of
currentTimeMillis() was caused by two factors:
- JVM using
gettimeofday()being very slow if HPET time source is used.
HPET, however, isn’t the only time source these days. Perhaps, not even the most common: many systems use TSC. This could make the entire study irrelevant. However, in our project the servers are configured with the HPET time source for a reason: this time source integrated well with the NTP client, allowing for smooth time adjustment, while TSC was not as stable (I don’t know the details; this is what the local Linux gurus say and I have no option but to trust them). Some other developers may find themselves in the same situation. Besides, a Java developer can’t always know what time source will be configured on the machine the program will run. That’s why I feel that the findings made so far are still important.
However, it is still interesting to find out how the result would change if we used the TSC time source. TSC stands for the time stamp counter,
which is simply the number of CPU cycles counted since the startup time (it is only 64-bit wide, so it will wrap around in 243 years at 2.4GHz clock rate).
This value can be read using
rdtsc instruction. Traditionally, there were two problems with this value:
- the values from different cores or physical processors may be shifted against each other, as the processors may start at different times
- the clock frequency of a processor may change during execution.
The first one seems to be indeed a problem. I tried getting the
rdtsc value from several cores at once, synchronised on writing into some memory location.
Even in the best case I got differences of a couple of thousand cycles. Sometimes more. However, this is only a problem if the programmer wants to use the TSC
manually; in this case the thread affinity must be set accordingly. The OS knows when it re-schedules threads from one core to another, so it can make all necessary adjustments.
The second problem seems to be a thing of the past. The Intel doc says this:
Processor families increment the time-stamp counter differently:
For Pentium M processors (family [06H], models [09H, 0DH]); for Pentium 4 processors, Intel Xeon processors (family [0FH], models [00H, 01H, or 02H]); and for P6 family processors: the time-stamp counter increments with every internal processor clock cycle. The internal processor clock cycle is determined by the current core-clock to busclock ratio. Intel® SpeedStep® technology transitions may also impact the processor clock.
For Pentium 4 processors, Intel Xeon processors (family [0FH], models [03H and higher]); for Intel Core Solo and Intel Core Duo processors (family [06H], model [0EH]); for the Intel Xeon processor 5100 series and Intel Core 2 Duo processors (family [06H], model [0FH]); for Intel Core 2 and Intel Xeon processors (family [06H], DisplayModel [17H]); for Intel Atom processors (family [06H], DisplayModel [1CH]): the time-stamp counter increments at a constant rate.
It also says:
Processor’s support for invariant TSC is indicated by CPUID.80000007H:EDX.
This we can test:
which indicates that the bit is set.
Additional testing confirms this:
Let’s enable turbo boost and run it at known processor:
# taskset 1 ./t time: 10000102; clocls/nano: 2.399990
While it is running, we’ll check the clock speed of processor zero:
# cat /proc/cpuinfo | grep MHz |head -n 1 cpu MHz : 1337.156
Now we comment out
delay and repeat:
# taskset 1 ./t time: 8478796; clocls/nano: 2.399991
And the clock speed is
# cat /proc/cpuinfo | grep MHz |head -n 1 cpu MHz : 3199.968
The rate of the
rdtsc is the same and it equals to the nominal clock rate of the processor, which in both cases differed from the real clock rate
of the corresponding core.
This means that TSC is quite safe to use on these processors. One way to do this is by using the function
clock_gettime mentioned above. It has a clock id
that corresponds to TSC:
The speed, however, isn’t great: we get 343 ns per call.
There is a better option: we keep using the
gettimeofday but set the TSC as a preferable time source for the entire system. This is very easy:
echo tsc > /sys/devices/system/clocksource/clocksource0/current_clocksource
To return it back to HPET we do the same, but use “hpet” instead of “tsc”.
This is temporary (does not survive reboot). Let’s go straight back to our Java program and put two zeroes back into the repetition count. The result is amazing:
# java Time Sum = 2414988828092895332; time = 3636; or 36.36 ns / iter Sum = 2414989192017338494; time = 3628; or 36.28 ns / iter Sum = 2414989551179770420; time = 3604; or 36.04 ns / iter
This isn’t as good as in Windows (there it was under 4 ns), but still much better than before, and, probably, practically usable for the use cases listed in the beginning.
If we time
gettimeofday(), we get:
time for 100000000: 3.000262 s; ns/iter=30.002621
gettimeofday() returns time in microseconds, the natural limit of its resolution is 1000 ns, and it is indeed achieved. If we call this function many times
and print the differences between consecutive values, we get many zeroes and occasional ones.
The same test for
clock_gettime (CLOCK_REALTIME) returns 30 ns – exactly the same as the execution time of each call, which is expected.
In short, the TSC time source is both very fast and very accurate, with two reservations:
- on some systems there may be reasons not to use it
currentTimeMillis() based on HPET runs for 640 ns (1.5M operations/second). Is this per core or for the entire system?
Let’s run a similar test to
Time.java, but start N threads, where N is between 1 and 24 (the total number of cores in our dual-processor system,
including hyper-threading ones).
Here are the results for the small numbers of threads:
|Thread count||Avg time/call, ns||Total calls/sec, mil|
Here is the average time it takes to execute
currentTimeMillis() for all thread counts:
This looks quite linear, which makes us suspect that the HPET chip serialises the requests, only serving one at a time. Or, possibly, a little bit more than one, judging from the transition from thread count 1 to thread count 2, where the performance didn’t halve, but rather dropped by 1.5 times.
Here is the graph of the total performance of the system (the number of calls that can be performed on all cores and processors, in one second):
The performance went up from 1.5M to about 2.1M op/sec and stayed there. The initial increase may have something to do with the fact that we’re testing on a dual-processor
system. Here are the times measured when the execution is limited to one processor (
|Thread count||Avg time/call, ns, two processors||Avg time/call, ns, one processor|
The single-processor time does not show that abnormal step between one and two threads; it is roughly proportional to the number of threads and (except for the value for one thread) is longer than the dual-processor time.
Testing with multiple processes gives similar results to that with multiple threads.
In short, the performance of the HPET is indeed limited system-wide. Not more than two million time enquiries per second can be performed on the machine, no
matter how we split the load between cores and processes. If 24 cores are loaded evenly, each can perform below 100K operations per second. This means that one must
really be careful when using
currentTimeMillis() in Java programs (and
gettimeofday() in C) if there are chances that the program will be executed using
HPET. This also applies to the use of
C++ time-reporting functionality (such as
std::chrono::system_clock::now ()), for it uses
which ends up reading the same HPET registers (although it invokes a
syscall in the process).
[A side note. Since the processors affect each other when using HPET, there are potential security problems. One process may run a tight loop calling
thus starving all the other processes of access to this resource and reducing their performance. Alternatively, some process may call this function and time its execution
using TSC, detect this way when other processes query current time, which may help establish which execution path the other processes take. End of side note]
The TSC-based timer does not demonstrate this behaviour. Its performance is quite stable, only dropping by 40% when all the cores (including the hyper-threading ones) are used:
Off-topic: A system call
We’ve seen that the vDSO code can end up in the system call. In Intel x64 version of Linux, the system call 0x60 is exactly the
How fast is this system call? We can call it directly, modifying the original time measurement of
gettimeofday. We just replace
This is what we get:
|Time source||Time for vDSO, ns||Time for syscall, ns||Cost of syscall, ns|
Previously, I believed that system calls in Linux were incredibly expensive. This measurement proves this belief false: a system call isn’t exactly free, but it is cheaper, for example, than an L3 cache miss (100 ns).
However, if the performed action is short (as it is for TSC
gettimeofday), it is still beneficial to avoid system call. The vDSO really helps here: in our case,
it made the execution almost three times faster.
What to do
The first prize is to make sure that the program only ever runs on Windows or on Linux with the TSC time source. If this is impossible, there is no way to speed
the call up while staying in pure Java, and then the solution is to make sure
currentTimeMillis() isn’t called too often. Obviously, one can implement
a custom version of this call using JNI, based on
clock_gettime() or even direct reading of TSC (in the latter case thread affinity control may be required).
We measured the cost of a JNI call here; it was 8 ns. The total cost will be under 20 ns, which is acceptable.
The title isn’t completely correct:
currentTimeMillis()isn’t always slow. However, there is a case when it is.
This call is lightning fast on Windows (4 ns), and reasonably fast on Linux with TSC time source (36 ns)
It is indeed slow on Linux with HPET time source: 640 ns.
The HPET is shared between all running threads, of all processes so the entire system can query it only two million times per second.
Even this “slow” timer is sufficient for most applications. However, there are cases where higher performance is required.
There is an API call that returns coarse time and works fast. This is the
clock_gettime(). Java VM could have used this call and provide fast
currentTimeMillis(). If absolutely necessary, one can implement it her/himself using JNI.
A Linux system call isn’t as dramatically slow as I believed, but still takes substantial time (50 – 100 ns).
During the discussion on reddit I received multiple comments pointing out that
System.currentTimeMillis() is in fact a bad choise of call for the first three of the use cases listed in the beginning. The
System.nanoTime() is the correct
one. The reason is that
nanoTime() is monotonic while
currentTimeMillis() is not. The latter may be affected by time adjustments from NTP daemon or by leap seconds.
Or, for that matter, by user-initiated time change.
Even though these factors do not apply to our project (the machines are synchronised with the GPS time when booted, and later adjusted by sub-millisecond intervals; as a result, time never jumps back during program run: it only can stall for a bit), I fully agree with this comment. I would gladly have used a monotonic coarse timer if there was one in Java.
It is a bit awkward that Java provides two functions that differ by two parameters – the nature of time and the resolution. These parameters are orthogonal and grant the need for four functions:
- monotonic coarse
- monotonic fine (
- real-time coarse (
- real-time fine.
Two are provided. There is, however, need for the remaining two. The last one, for instance, may be useful to timestamp events in high-frequency trading.
And what we needed was actually the first one. In fact, the only reason we needed coarse timer in the first place was that it had a potential to be fast. Actually, this is the only reason for coarse timers to exist at all.
If coarse timers are not much faster than fine ones, they are not needed. Two types of timers are sufficient then. This, unfortunately, seems to be the
case: coarse timers are not faster, due to
currentTimeMillis being implemented using
Here are the evaluation times of
nanoTIme() compared to
|Java method||Windows, ns||Linux, HPET, ns||Linux, TSC, ns|
The nanosecond timer is a bit slower than the millisecond one on Windows, but it is still fast enough. The Linux version runs at the same speed as the milli timer. This is how the Linux version is implemented:
clock_gettime function with
CLOCK_MONOTONIC parameter, which ends up in the same
vDSO, in the function
The timer type is controlled by the same
vclock_mode parameter as for
do_realtime, which is used in
gettimeofday. The execution ends up in the same
which either reads HPET registers or performs
rdtsc. The only difference is in the coarse values it reads:
gtod->monotonic_time_* instead of
No wonder it runs at the same speed.
The fact that both real-time and monotonic clocks share the same mode is unfortunate. As I said earlier, the reason for us to use
was that it co-operated with the NTP daemon better. Since monotonic time isn’t affected by NTP, it could have been based on TSC.
When TSC is used, consecutive calls to
nanoTime() return values that are about 80 ns apart. Consecutive calls to
clock_gettime() return intervals of about 48 – 50 ns.
It looks like the accuracy of
nanoTime() is only limited by its own duration (36 ns; the remaining 12 ns must be the test overhead).
The Windows version looks like this:
performance_frequency comes from the call to
QueryPerformanceFrequency, which, on my PC, returns 2648441. This is the timer frequency in Hz, and it is
rather low. Windows can measure time very fast (in 16 ns), but in rather big increments (one tick equals 377.5 ns). Test in Java confirms it: continuous
nanoTime() return results that differ by zeroes and occasional 377s.
Out of curiosity, let’s look at the code of
QueryPerformanceCounter (this time in 64-bit mode only):
I didn’t include the code where
jne goes if
al (as loaded from
7FFE03C6h) is not equal to 1, because in my case it is. Probably, this is the flag that tells
it to use
rdtsc. The value is read, added to the number from
7FFE03B8h (probably, some base value), and then shifted right by the number in
This number is 10 in our case. This explains the magical value of 2648441 for the performance frequency: it is the clock frequency of the
TSC on this computer
(2.7 GHz) divided by 1024. The resolution could have been higher if smaller shift factor was chosen.
Conclusions of the Update
Coarse timers in Java are not much faster than the fine timers. This means that fine timers could be used everywhere if fine real-time one was provided
on Linux, the
nanoTImeperformance in TSC mode is mostly satisfactory; in HPET mode it is not.
on Windows the
nanoTimeis four times slower than
currentTimeMillisbut still fast enough; the resolution, however, is far from ideal, for unknown reason.
Comments are welcome below or on reddit.