In “Optimising divisionbased hash functions” we optimised a hash function working on long
numbers and based on calculating a remainder.
The idea was to replace a division operation with a multiplication by a reciprocal. In pseudocode this replacement looked like this. The original hash function:
The optimised hash function (for simplicity, we’ll limit today’s consideration to an unsigned version):
This is a pseudocode and not regular Java, because Java does not have operation **
, which multiplies two 64bit numbers and produces a 128bit result.
We’ve emulated this operation using the following code:
This implementation uses five multiplications: four for a 64bit multiplication, and one for computing a remainder. It also uses several additions. This is what is added, as a picture:
high 64 bits of xy  low 64 bits of xy  

hi (A*C)  lo (A*C)  
hi (A*D)  lo (A*D)  
hi (B*C)  lo (B*C)  
hi (B*D)  lo (B*D) 
where
y = 2^{32}C + D
Since the result of a 64bit multiplication is shifted right by 91, we only need the high part of a product.
A rather complex way to add all these values together is caused by a possibility that adding A*D
, B*C
and hi(B*D)
may
produce a 65bit (or even a 66bit) result.
Since publishing this code I came across two more improvements, which I’ll show now.
The first idea: Specialisation
The mult_unsigned_hipart
above is designed for arbitrary x
and y
values. We are, however, working with a special case,
where we know the value of y
: it is 2614885092524444427. Then we have
D = 744639243 = 0x2C624B0B < 2^{30}
In other words, both C and D are 30bit numbers.
A 30bit number multiplied by a 32bit number produces a 62bit number, which means
BC < 2^{62}
BD < 2^{62}
hi(BD) < 2^{30}</sup>
AD + BC + hi(BD) < 2^{62} + 2^{62} + 2^{30} < 2^{63} + 2^{63} = 2^{64}
In short, the sum of two 62bit numbers and a high part of a 62bit number is at most a 64bit number, so there is no carry bits to worry about. As a result, the multiplication routine can be simplified like this:
Note that we can’t ignore the term uhi (B*D)
. It may cause carry when added to the lower part of A*D + B*C
. Let’s assume
Then
B = 4294967295 = 0xFFFFFFFF
ulo (AD + BC) = 0xDBB60FDD
uhi (BD) = 0x2C624B0A
AC + uhi (AD + BC + uhi (BD)) = 608825379
AC + uhi (AD + BC) = 608825378
Right, but this is a onebit difference, and in our particular use case the result of multiplication is shifted right by 27 bits. Is it possible that one carry bit from adding BD portion gets propagated far enough to the left to affect the result of this shift? It seems highly unlikely, but the study shows that it can happen.
Let’s fix the same value of B as above (0xFFFFFFFF) and look for A that makes the following true:
ulo (AD + BC) + uhi (BD) > 2^{32}
Finding a suitable value of A would have required some sophisticated mathematics just twenty years ago. Thanks to the Moore’s law, these days we can run a simple brute force search – we can test all four billion candidates for A in seconds. The search finds five suitable values of A, the first of which is A = 755792740.
AC + uhi (AD + BC) = 0x662C42B47FFFFFF
(AC + uhi (AD + BC + uhi (BD))) >>> 27 = 3428353385
(AC + uhi (AD + BC)) >>> 27 = 3428353384
The value w(A,B)
= 3246105105149198335 produces 0 as a result of hashCode()
listed above and 946840871 if B*D
is removed from mult_unsigned_hipart_special ()
.
Our attempt to perform a 64bit multiplication using only three 32bit multiplications has failed.
The second idea: Karatsuba’s multiplication
This doesn’t mean that such calculation is impossible. The trick, called Karatsuba’s multiplication, after Anatoly Alexeevich Karatsuba, who discovered it in 1960 and published in 1962, and is based on a simple observation:
which implies
Then we can modify the original formula as
2^{64}AC + 2^{32}((A + B) (C + D) − (AC + BD)) + BD
Interestingly, Knuth (The art of computer programming, vol. 2, 4.3.3.A) suggests different method, based on a subtraction rather than addition:
2^{64}AC + 2^{32}(AC + BD − (A − B) (C − D)) + BD =
(2^{64} + 2^{32})AC + 2^{32}(A − B) (D − C) + (2^{32} + 1) BD
He mentions Karatsuba’s method too, with a remark that his method is “similar, but more complex”. It seems other way around, since Knuth’s method requires signed arithmetic, while Karatsuba’s can be fully done in unsigned numbers (the single subtraction that it contains is guaranteed to produce nonnegative result).
Both formulae contain only three multiplications. They were not designed just to multiply 64bit integers – the primary application is multiplication of very
long numbers. The methods reduce multiplication of two 2Nbit numbers to three Nbit multiplications (more precisely, two Nbit multiplications and a (N+1)bit one)
and some additions. The additions are performed in O (N) time and do not
therefore affect the overall complexity of algorithms, which is O (N^{log23}). The example of recursive Karatsuba’s multiplication can be seen in
the version of java.math.BigInteger
that comes with Java 8. Look at multiplyKaratsuba()
function, which is used for numbers between 80 and 240 int
s long
(that is, between 2560 and 7680 bits). The class uses ToomCook algorithm for longer numbers and a classic
squarecomplexity grade school algorithm for shorter numbers. The BigInteger
in Java 7 always uses the grade school algorithm and works much slower when the inputs are long.
The original Karatsuba’s formula looks very attractive, but one obstacle makes it difficult to use it for general 64bit multiplication. Both A + B and C + D are 33bit numbers, and their product can potentially occupy 66 bits (in Knuth’s version A − B and D − C are 33bit signed numbers, with the similar implication for their product). We can emulate a 33bit multiplication but that will require additional operations and, probably, branches.
However, the specialisation trick that we employed in the previous section is applicable again. We know that C and D are 30bit numbers. Their sum
is a 31bit number. And, as we know, A + B is a 33bit number.
This means that (A + B) (C + D) is at most a 64bit number, and the multiplication can be performed in standard Java’s long
s. Similarly,
AD + BC is at most a 63bit number, and it won’t cause an overflow when added to the high part of BD. Here is the code:
Giving it a try
Let’s add two more classes (LongPoint62
and
LongPoint63
) to our tests
(the entire code is here). Let’s run a microbenchmark (HashTime
):
Class name  Comment  Time, Java 7  Time, Java 8 

LongPoint6  Original version  578  1655 
LongPoint60  Optimised, unsigned  640  610 
LongPoint61  Optimised, signed  729  727 
LongPoint62  Optimised and specialised  595  547 
LongPoint63  Karatsuba's multiplication  622  565 
NullPoint  Null function (does nothing)  532  455 
Both new versions are faster than the original optimised one; LongPoint62
is approaching the speed of the original version on Java 7. However, we’ve got a surprise:
Karatsuba’s version is slower than the first specialised one. For anyone who was brought up with the idea that multiplication is an expensive operation to be avoided
at all costs, this is nonsense. Let’s look at the assembly outputs. This is the disassembly of LongPoint62
:
And this is the code for LongPoint63
:
The first code has more multiplications (five instead of four), but less instructions in total (16 vs 19). Three ordinary instructions can’t compensate for one multiplication,
if you look at the total clock count alone. However, according to Agner Fog, one must look at two parameters:
instruction throughput and latency. While multiplication has high latency (3 cycles on Sandy Bridge), it also has high throughput (one cycle), which means that the processor
can start (and finish) another multiplication every cycle, provided that the arguments are ready.
That’s the case in both snippets above – the multiplications do not depend on each other. As a result, the three imul
instructions in the first
case can be executed in 5 cycles, while the four in the second one take 6. This difference will be totally neutralised by the extra three instructions in the second case,
especially taking into account that the shifts and additionss that follow the multiplications form a perfect dependency chain.
The full test
Now we are ready to run the full test, our best hopes now placed on LongPoint62
. Here are the results for our new classes, together with previous divisionbased ones:
Class name  Comment  Time, Java 7  Time, Java 8 

LongPoint6  Original version  2032  1650 
LongPoint60  Optimised, unsigned  2247  1877 
LongPoint61  Optimised, signed  2368  1885 
LongPoint62  Optimised and specialised  2191  1726 
LongPoint63  Karatsuba's multiplication  2239  1776 
As expected, the new versions are both faster than the first optimised unsigned version, Karatsuba’s one being the slower one of the two. However, the original divisionbased
one (LongPoint6
) is still the fastest on both Java 7 and Java 8. Our optimisation attempt has failed again.
Conclusions

It is possible to perform a doubleprecision multiplication using three singleprecision ones

However, this technique, being useful for very long numbers, does not help improve performance on very short ones

In general, unlike division, multiplication operation isn’t as expensive as it used to be, and can be used without much hesitation. The overheads of avoiding it may exceed its own cost

Specialised functions (those that are designed to handle only a subset of possible inputs) can be made faster than the generic ones. Obviously, there is a price: the entire solution becomes more errorprone and difficult to maintain. This is why this approach must be used with care and only when absolutely necessary. For instance, in our case reducing execution time from 2247 ms to 2191 (by 2.5%) hardly justifies the costs.

The original solution (using normal division operation) performed very badly as a microbenchmark on Java 8, and could be improved a lot. However, all attempts to optimise it in a full test failed miserably. This is a mystery.
Coming soon
I’m going to address this mystery in my next article.