In “Optimising division-based hash functions” we optimised a hash function working on long
numbers and based on calculating a remainder.
The idea was to replace a division operation with a multiplication by a reciprocal. In pseudo-code this replacement looked like this. The original hash function:
The optimised hash function (for simplicity, we’ll limit today’s consideration to an unsigned version):
This is a pseudo-code and not regular Java, because Java does not have operation **
, which multiplies two 64-bit numbers and produces a 128-bit result.
We’ve emulated this operation using the following code:
This implementation uses five multiplications: four for a 64-bit multiplication, and one for computing a remainder. It also uses several additions. This is what is added, as a picture:
high 64 bits of xy | low 64 bits of xy | ||
---|---|---|---|
hi (A*C) | lo (A*C) | ||
hi (A*D) | lo (A*D) | ||
hi (B*C) | lo (B*C) | ||
hi (B*D) | lo (B*D) |
where
y = 232C + D
Since the result of a 64-bit multiplication is shifted right by 91, we only need the high part of a product.
A rather complex way to add all these values together is caused by a possibility that adding A*D
, B*C
and hi(B*D)
may
produce a 65-bit (or even a 66-bit) result.
Since publishing this code I came across two more improvements, which I’ll show now.
The first idea: Specialisation
The mult_unsigned_hipart
above is designed for arbitrary x
and y
values. We are, however, working with a special case,
where we know the value of y
: it is 2614885092524444427. Then we have
D = 744639243 = 0x2C624B0B < 230
In other words, both C and D are 30-bit numbers.
A 30-bit number multiplied by a 32-bit number produces a 62-bit number, which means
BC < 262
BD < 262
hi(BD) < 230</sup>
AD + BC + hi(BD) < 262 + 262 + 230 < 263 + 263 = 264
In short, the sum of two 62-bit numbers and a high part of a 62-bit number is at most a 64-bit number, so there is no carry bits to worry about. As a result, the multiplication routine can be simplified like this:
Note that we can’t ignore the term uhi (B*D)
. It may cause carry when added to the lower part of A*D + B*C
. Let’s assume
Then
B = 4294967295 = 0xFFFFFFFF
ulo (AD + BC) = 0xDBB60FDD
uhi (BD) = 0x2C624B0A
AC + uhi (AD + BC + uhi (BD)) = 608825379
AC + uhi (AD + BC) = 608825378
Right, but this is a one-bit difference, and in our particular use case the result of multiplication is shifted right by 27 bits. Is it possible that one carry bit from adding BD portion gets propagated far enough to the left to affect the result of this shift? It seems highly unlikely, but the study shows that it can happen.
Let’s fix the same value of B as above (0xFFFFFFFF) and look for A that makes the following true:
ulo (AD + BC) + uhi (BD) > 232
Finding a suitable value of A would have required some sophisticated mathematics just twenty years ago. Thanks to the Moore’s law, these days we can run a simple brute force search – we can test all four billion candidates for A in seconds. The search finds five suitable values of A, the first of which is A = 755792740.
AC + uhi (AD + BC) = 0x662C42B47FFFFFF
(AC + uhi (AD + BC + uhi (BD))) >>> 27 = 3428353385
(AC + uhi (AD + BC)) >>> 27 = 3428353384
The value w(A,B)
= 3246105105149198335 produces 0 as a result of hashCode()
listed above and 946840871 if B*D
is removed from mult_unsigned_hipart_special ()
.
Our attempt to perform a 64-bit multiplication using only three 32-bit multiplications has failed.
The second idea: Karatsuba’s multiplication
This doesn’t mean that such calculation is impossible. The trick, called Karatsuba’s multiplication, after Anatoly Alexeevich Karatsuba, who discovered it in 1960 and published in 1962, and is based on a simple observation:
which implies
Then we can modify the original formula as
264AC + 232((A + B) (C + D) − (AC + BD)) + BD
Interestingly, Knuth (The art of computer programming, vol. 2, 4.3.3.A) suggests different method, based on a subtraction rather than addition:
264AC + 232(AC + BD − (A − B) (C − D)) + BD =
(264 + 232)AC + 232(A − B) (D − C) + (232 + 1) BD
He mentions Karatsuba’s method too, with a remark that his method is “similar, but more complex”. It seems other way around, since Knuth’s method requires signed arithmetic, while Karatsuba’s can be fully done in unsigned numbers (the single subtraction that it contains is guaranteed to produce non-negative result).
Both formulae contain only three multiplications. They were not designed just to multiply 64-bit integers – the primary application is multiplication of very
long numbers. The methods reduce multiplication of two 2N-bit numbers to three N-bit multiplications (more precisely, two N-bit multiplications and a (N+1)-bit one)
and some additions. The additions are performed in O (N) time and do not
therefore affect the overall complexity of algorithms, which is O (Nlog23). The example of recursive Karatsuba’s multiplication can be seen in
the version of java.math.BigInteger
that comes with Java 8. Look at multiplyKaratsuba()
function, which is used for numbers between 80 and 240 int
s long
(that is, between 2560 and 7680 bits). The class uses Toom-Cook algorithm for longer numbers and a classic
square-complexity grade school algorithm for shorter numbers. The BigInteger
in Java 7 always uses the grade school algorithm and works much slower when the inputs are long.
The original Karatsuba’s formula looks very attractive, but one obstacle makes it difficult to use it for general 64-bit multiplication. Both A + B and C + D are 33-bit numbers, and their product can potentially occupy 66 bits (in Knuth’s version A − B and D − C are 33-bit signed numbers, with the similar implication for their product). We can emulate a 33-bit multiplication but that will require additional operations and, probably, branches.
However, the specialisation trick that we employed in the previous section is applicable again. We know that C and D are 30-bit numbers. Their sum
is a 31-bit number. And, as we know, A + B is a 33-bit number.
This means that (A + B) (C + D) is at most a 64-bit number, and the multiplication can be performed in standard Java’s long
s. Similarly,
AD + BC is at most a 63-bit number, and it won’t cause an overflow when added to the high part of BD. Here is the code:
Giving it a try
Let’s add two more classes (LongPoint62
and
LongPoint63
) to our tests
(the entire code is here). Let’s run a micro-benchmark (HashTime
):
Class name | Comment | Time, Java 7 | Time, Java 8 |
---|---|---|---|
LongPoint6 | Original version | 578 | 1655 |
LongPoint60 | Optimised, unsigned | 640 | 610 |
LongPoint61 | Optimised, signed | 729 | 727 |
LongPoint62 | Optimised and specialised | 595 | 547 |
LongPoint63 | Karatsuba's multiplication | 622 | 565 |
NullPoint | Null function (does nothing) | 532 | 455 |
Both new versions are faster than the original optimised one; LongPoint62
is approaching the speed of the original version on Java 7. However, we’ve got a surprise:
Karatsuba’s version is slower than the first specialised one. For anyone who was brought up with the idea that multiplication is an expensive operation to be avoided
at all costs, this is nonsense. Let’s look at the assembly outputs. This is the disassembly of LongPoint62
:
And this is the code for LongPoint63
:
The first code has more multiplications (five instead of four), but less instructions in total (16 vs 19). Three ordinary instructions can’t compensate for one multiplication,
if you look at the total clock count alone. However, according to Agner Fog, one must look at two parameters:
instruction throughput and latency. While multiplication has high latency (3 cycles on Sandy Bridge), it also has high throughput (one cycle), which means that the processor
can start (and finish) another multiplication every cycle, provided that the arguments are ready.
That’s the case in both snippets above – the multiplications do not depend on each other. As a result, the three imul
instructions in the first
case can be executed in 5 cycles, while the four in the second one take 6. This difference will be totally neutralised by the extra three instructions in the second case,
especially taking into account that the shifts and additionss that follow the multiplications form a perfect dependency chain.
The full test
Now we are ready to run the full test, our best hopes now placed on LongPoint62
. Here are the results for our new classes, together with previous division-based ones:
Class name | Comment | Time, Java 7 | Time, Java 8 |
---|---|---|---|
LongPoint6 | Original version | 2032 | 1650 |
LongPoint60 | Optimised, unsigned | 2247 | 1877 |
LongPoint61 | Optimised, signed | 2368 | 1885 |
LongPoint62 | Optimised and specialised | 2191 | 1726 |
LongPoint63 | Karatsuba's multiplication | 2239 | 1776 |
As expected, the new versions are both faster than the first optimised unsigned version, Karatsuba’s one being the slower one of the two. However, the original division-based
one (LongPoint6
) is still the fastest on both Java 7 and Java 8. Our optimisation attempt has failed again.
Conclusions
-
It is possible to perform a double-precision multiplication using three single-precision ones
-
However, this technique, being useful for very long numbers, does not help improve performance on very short ones
-
In general, unlike division, multiplication operation isn’t as expensive as it used to be, and can be used without much hesitation. The overheads of avoiding it may exceed its own cost
-
Specialised functions (those that are designed to handle only a subset of possible inputs) can be made faster than the generic ones. Obviously, there is a price: the entire solution becomes more error-prone and difficult to maintain. This is why this approach must be used with care and only when absolutely necessary. For instance, in our case reducing execution time from 2247 ms to 2191 (by 2.5%) hardly justifies the costs.
-
The original solution (using normal division operation) performed very badly as a micro-benchmark on Java 8, and could be improved a lot. However, all attempts to optimise it in a full test failed miserably. This is a mystery.
Coming soon
I’m going to address this mystery in my next article.