Experiments in program optimisation

More optimisations of division-based hash: Karatsuba's method

Story: Conway's Game of Life

Tags: Java optimisation hashmap life

20 Oct 2015

In “Optimising division-based hash functions” we optimised a hash function working on long numbers and based on calculating a remainder. The idea was to replace a division operation with a multiplication by a reciprocal. In pseudo-code this replacement looked like this. The original hash function:

public int hashCode ()
{
    return (int) (v % 946840871);
}

The optimised hash function (for simplicity, we’ll limit today’s consideration to an unsigned version):

public int hashCode ()
{
    long div = (v ** 2614885092524444427L) >> 91;
    return (int) (v - div * 946840871);
}

This is a pseudo-code and not regular Java, because Java does not have operation **, which multiplies two 64-bit numbers and produces a 128-bit result. We’ve emulated this operation using the following code:

static long mult_unsigned_hipart (long x, long y)
{
    long A = uhi (x);
    long B = ulo (x);
    long C = uhi (y);
    long D = ulo (y);

    long AC = A * C;
    long AD = A * D;
    long BC = B * C;
    long BD = B * D;

    long ADl_BCl_BDh = ulo (AD) + ulo (BC) + uhi (BD);
    return AC + uhi (AD) + uhi (BC) + uhi (ADl_BCl_BDh);
}

public int hashCode ()
{
    long div = mult_unsigned_hipart (v, 2614885092524444427L) >>> 27;
    return (int) (v - div * 946840871);
}

This implementation uses five multiplications: four for a 64-bit multiplication, and one for computing a remainder. It also uses several additions. This is what is added, as a picture:

high 64 bits of xy low 64 bits of xy
hi (A*C) lo (A*C)
hi (A*D) lo (A*D)
hi (B*C) lo (B*C)
hi (B*D) lo (B*D)

where

x = 232A + B
y = 232C + D

Since the result of a 64-bit multiplication is shifted right by 91, we only need the high part of a product. A rather complex way to add all these values together is caused by a possibility that adding A*D, B*C and hi(B*D) may produce a 65-bit (or even a 66-bit) result.

Since publishing this code I came across two more improvements, which I’ll show now.

The first idea: Specialisation

The mult_unsigned_hipart above is designed for arbitrary x and y values. We are, however, working with a special case, where we know the value of y: it is 2614885092524444427. Then we have

C = 608825379 = 0x2449F023 < 230
D = 744639243 = 0x2C624B0B < 230

In other words, both C and D are 30-bit numbers.

A 30-bit number multiplied by a 32-bit number produces a 62-bit number, which means

AD < 262
BC < 262
BD < 262
hi(BD) < 230</sup>
AD + BC + hi(BD) < 262 + 262 + 230 < 263 + 263 = 264

In short, the sum of two 62-bit numbers and a high part of a 62-bit number is at most a 64-bit number, so there is no carry bits to worry about. As a result, the multiplication routine can be simplified like this:

static long mult_unsigned_hipart_special (long x, long y)
{
    long A = uhi (x);
    long B = ulo (x);
    long C = uhi (y);
    long D = ulo (y);
                   
    return A*C + uhi (A*D + B*C + uhi (B*D));
}

Note that we can’t ignore the term uhi (B*D). It may cause carry when added to the lower part of A*D + B*C. Let’s assume

x = 4294967295 = 0xFFFFFFFF

Then

A = 0
B = 4294967295 = 0xFFFFFFFF
ulo (AD + BC) = 0xDBB60FDD
uhi (BD) = 0x2C624B0A
AC + uhi (AD + BC + uhi (BD)) = 608825379
AC + uhi (AD + BC) = 608825378

Right, but this is a one-bit difference, and in our particular use case the result of multiplication is shifted right by 27 bits. Is it possible that one carry bit from adding BD portion gets propagated far enough to the left to affect the result of this shift? It seems highly unlikely, but the study shows that it can happen.

Let’s fix the same value of B as above (0xFFFFFFFF) and look for A that makes the following true:

(ulo (AC) + uhi (AD + BC)) mod 227 = 227−1
ulo (AD + BC) + uhi (BD) > 232

Finding a suitable value of A would have required some sophisticated mathematics just twenty years ago. Thanks to the Moore’s law, these days we can run a simple brute force search – we can test all four billion candidates for A in seconds. The search finds five suitable values of A, the first of which is A = 755792740.

AC + uhi (AD + BC + uhi (BD)) = 0x662C42B48000000
AC + uhi (AD + BC) = 0x662C42B47FFFFFF
(AC + uhi (AD + BC + uhi (BD))) >>> 27 = 3428353385
(AC + uhi (AD + BC)) >>> 27 = 3428353384

The value w(A,B) = 3246105105149198335 produces 0 as a result of hashCode() listed above and 946840871 if B*D is removed from mult_unsigned_hipart_special (). Our attempt to perform a 64-bit multiplication using only three 32-bit multiplications has failed.

The second idea: Karatsuba’s multiplication

This doesn’t mean that such calculation is impossible. The trick, called Karatsuba’s multiplication, after Anatoly Alexeevich Karatsuba, who discovered it in 1960 and published in 1962, and is based on a simple observation:

(A + B)(C + D) = AC + BC + AD + BD

which implies

AD + BC = (A + B) (C + D) − (AC + BD)

Then we can modify the original formula as

(232A + B)(232C + D) = 264AC + 232(AD + BC) + BD =
264AC + 232((A + B) (C + D) − (AC + BD)) + BD

Interestingly, Knuth (The art of computer programming, vol. 2, 4.3.3.A) suggests different method, based on a subtraction rather than addition:

(232A + B)(232C + D) =
264AC + 232(AC + BD − (A − B) (C − D)) + BD =
(264 + 232)AC + 232(A − B) (D − C) + (232 + 1) BD

He mentions Karatsuba’s method too, with a remark that his method is “similar, but more complex”. It seems other way around, since Knuth’s method requires signed arithmetic, while Karatsuba’s can be fully done in unsigned numbers (the single subtraction that it contains is guaranteed to produce non-negative result).

Both formulae contain only three multiplications. They were not designed just to multiply 64-bit integers – the primary application is multiplication of very long numbers. The methods reduce multiplication of two 2N-bit numbers to three N-bit multiplications (more precisely, two N-bit multiplications and a (N+1)-bit one) and some additions. The additions are performed in O (N) time and do not therefore affect the overall complexity of algorithms, which is O (Nlog23). The example of recursive Karatsuba’s multiplication can be seen in the version of java.math.BigInteger that comes with Java 8. Look at multiplyKaratsuba() function, which is used for numbers between 80 and 240 ints long (that is, between 2560 and 7680 bits). The class uses Toom-Cook algorithm for longer numbers and a classic square-complexity grade school algorithm for shorter numbers. The BigInteger in Java 7 always uses the grade school algorithm and works much slower when the inputs are long.

The original Karatsuba’s formula looks very attractive, but one obstacle makes it difficult to use it for general 64-bit multiplication. Both A + B and C + D are 33-bit numbers, and their product can potentially occupy 66 bits (in Knuth’s version A − B and D − C are 33-bit signed numbers, with the similar implication for their product). We can emulate a 33-bit multiplication but that will require additional operations and, probably, branches.

However, the specialisation trick that we employed in the previous section is applicable again. We know that C and D are 30-bit numbers. Their sum

C + D = 0x2449F023 + 0x2C624B0B = 0x50AC3B2E

is a 31-bit number. And, as we know, A + B is a 33-bit number.

This means that (A + B) (C + D) is at most a 64-bit number, and the multiplication can be performed in standard Java’s longs. Similarly, AD + BC is at most a 63-bit number, and it won’t cause an overflow when added to the high part of BD. Here is the code:

static long mult_unsigned_hipart_special (long x, long y)
{
    long A = uhi (x);
    long B = ulo (x);
    long C = uhi (y);
    long D = ulo (y);
        
    long AC = A * C;
    long BD = B * D;
    long AD_BC = (A+B) * (C+D) - (AC + BD);
    return AC + uhi (AD_BC + uhi (BD));
}

Giving it a try

Let’s add two more classes (LongPoint62 and LongPoint63) to our tests (the entire code is here). Let’s run a micro-benchmark (HashTime):

Class name Comment Time, Java 7Time, Java 8
LongPoint6 Original version 578 1655
LongPoint60 Optimised, unsigned 640 610
LongPoint61 Optimised, signed 729 727
LongPoint62 Optimised and specialised 595 547
LongPoint63 Karatsuba's multiplication 622 565
NullPoint Null function (does nothing) 532 455

Both new versions are faster than the original optimised one; LongPoint62 is approaching the speed of the original version on Java 7. However, we’ve got a surprise: Karatsuba’s version is slower than the first specialised one. For anyone who was brought up with the idea that multiplication is an expensive operation to be avoided at all costs, this is nonsense. Let’s look at the assembly outputs. This is the disassembly of LongPoint62:

0x00007f101930c610: mov    r11,r10
0x00007f101930c613: shr    r11,0x20
0x00007f101930c617: mov    r8d,r10d
0x00007f101930c61a: imul   r9,r11,0x2449f023    ; C = 608825379
0x00007f101930c621: imul   rcx,r8,0x2c624b0b    ; D = 744639243
0x00007f101930c628: imul   r8,r8,0x2449f023     ; C = 608825379
0x00007f101930c62f: shr    rcx,0x20
0x00007f101930c633: imul   r11,r11,0x2c624b0b   ; D = 744639243
0x00007f101930c63a: add    r11,r8
0x00007f101930c63d: add    r11,rcx
0x00007f101930c640: shr    r11,0x20
0x00007f101930c644: add    r9,r11
0x00007f101930c647: sar    r9,0x1b
0x00007f101930c64b: imul   r11,r9,0x386fa527    ; 946840871
0x00007f101930c652: sub    r10,r11
0x00007f101930c655: mov    eax,r10d

And this is the code for LongPoint63:

0x00007fac89309750: mov    r11,r10
0x00007fac89309753: shr    r11,0x20
0x00007fac89309757: mov    r8d,r10d
0x00007fac8930975a: mov    r9,r11
0x00007fac8930975d: add    r9,r8
0x00007fac89309760: imul   r11,r11,0x2449f023   ; C = 608825379
0x00007fac89309767: imul   r9,r9,0x50ac3b2e     ; C + D
0x00007fac8930976e: imul   r8,r8,0x2c624b0b     ; D = 744639243
0x00007fac89309775: mov    rcx,r11
0x00007fac89309778: add    rcx,r8
0x00007fac8930977b: sub    r9,rcx
0x00007fac8930977e: shr    r8,0x20
0x00007fac89309782: add    r9,r8
0x00007fac89309785: shr    r9,0x20
0x00007fac89309789: add    r9,r11
0x00007fac8930978c: shr    r9,0x1b
0x00007fac89309790: imul   r11,r9,0x386fa527    ; 946840871
0x00007fac89309797: sub    r10,r11
0x00007fac8930979a: mov    eax,r10d

The first code has more multiplications (five instead of four), but less instructions in total (16 vs 19). Three ordinary instructions can’t compensate for one multiplication, if you look at the total clock count alone. However, according to Agner Fog, one must look at two parameters: instruction throughput and latency. While multiplication has high latency (3 cycles on Sandy Bridge), it also has high throughput (one cycle), which means that the processor can start (and finish) another multiplication every cycle, provided that the arguments are ready. That’s the case in both snippets above – the multiplications do not depend on each other. As a result, the three imul instructions in the first case can be executed in 5 cycles, while the four in the second one take 6. This difference will be totally neutralised by the extra three instructions in the second case, especially taking into account that the shifts and additionss that follow the multiplications form a perfect dependency chain.

The full test

Now we are ready to run the full test, our best hopes now placed on LongPoint62. Here are the results for our new classes, together with previous division-based ones:

Class name Comment Time, Java 7Time, Java 8
LongPoint6 Original version 2032 1650
LongPoint60 Optimised, unsigned 2247 1877
LongPoint61 Optimised, signed 2368 1885
LongPoint62 Optimised and specialised 2191 1726
LongPoint63 Karatsuba's multiplication 2239 1776

As expected, the new versions are both faster than the first optimised unsigned version, Karatsuba’s one being the slower one of the two. However, the original division-based one (LongPoint6) is still the fastest on both Java 7 and Java 8. Our optimisation attempt has failed again.

Conclusions

Coming soon

I’m going to address this mystery in my next article.

comments powered by Disqus