In the previous article (“Implementing Peano arithmetic”) we were dealing with a dramatic situation: the hackers from the Anti-Numeric League damaged our Java compiler and made it impossible to use numbers. We needed numbers urgently for a student’s appointment, that’s why we had to implement arithmetic from scratch. We tried to implement the Peano arithmetic directly, using sets as numbers, but that was way too slow (although, surprisingly, worked when the numbers involved were small, and even produced some results). A sequence of improvements brought us to the final implementation, where all the numbers produced during program execution were collected into one doubly-linked list, in their natural order.
The most surprising part was that it worked and even was practically usable. We tested it on two problems: printing all the Pythagorean triples and finding all the perfect numbers, both within the given range. The speed achieved was about 1000 times less than on a native Java code, and that speed was achieved only after some manual optimisation of the user code that imitated the work of the compiler.
Here are the achieved times, in seconds, for Pythagorean, for several upper boundaries:
Version | 100 | 200 | 500 | 1,000 |
---|---|---|---|---|
Unoptimised | 7.2 | 203 | 466 in 14000s | |
Optimised | 0.070 | 0.73 | 22.3 | 360 |
Native | 0.002 | 0.021 | 0.093 | 0.571 |
And these are the times for Perfect
:
Version | 100 | 1000 | 5,000 | 10,000 |
---|---|---|---|---|
Unoptimised | 0.016 | 3.2 | 460 | 4154 |
Optimised | 0.011 | 0.48 | 41 | 336 |
Native | 0.001 | 0.006 | 0.074 | 0.280 |
Another surprising fact was that, while I considered Peano arithmetic a joke appropriate for the festive season, others take it seriously: https://wiki.haskell.org/Peano_numbers, https://hackage.haskell.org/package/peano-inf-0.6.5/peano-inf-0.6.5.tar.gz.
But anyway, the code is way too slow. Now we are going to try another approach. We’ll store numbers in the binary format – as sequences of zeroes and ones. Let’s see if this is faster.
We’ll be using the last column of the above tables as a benchmark: 1,000 for pythagorean
and 10,000 for perfect
.
The new code will be available in this repository.
The implementation details
The numbers will be represented as linked lists, where elements contain digits. We’ll have to make several technical decisions:
1) The representation of digits. We could use Booleans, but the purists may argue that Booleans are also numbers, especially when we start performing operations on them, such as “exclusive OR”. That’s why we’ll avoid them for now (perhaps, later try an implementation used on Booleans as well), and rather use the enums:
2) Singly or doubly linked list? It seems that for our current purpose a singly linked list is sufficient:
Still, if the implementation shows any advantage in having a doubly-linked list, we may consider using it.
3) Which direction do they grow? Since operations must be synchronised (the digits in the same positions must be processed together), the list must grow from the lowest to the highest, the first element being the lowest digit of the number.
4) Two classes or one? We can make a separate number object, which encapsulates the list, or we can make the list element itself a number. The potential advantage of the
first approach is that we can make a null list the default representation of zero, which will make the code more regular. The disadvantage is in the extra object.
We’ll use the second approach, and will represent zero as a list consisting of just one element (ZERO
).
5) Leading zeroes conventions. It takes some effort to get rid of leading zeroes (the subtraction code becomes more complex), but it helps elsewhere (for example, in comparisons).
Our numbers will always have the highest digit ONE
, with the single exception of the number zero. The test for zero looks like this:
6) Mutable or immutable? Sometimes it looks attractive to make the class mutable. For instance, a value 10 (the four-element list consisting of 0
-1
-0
-1
)
could increment itself by just replacing the lowest bit. However, we can do almost as well if we allocate a new element with the digit “1” and point it to the second
element of number 10:
This way we can use both numbers. This requires immutable design, because the number that is referenced more than once mustn’t modify itself. This promises to save a lot of
space, that’s why we’ll make our numbers immutable and declare all the fields as final
.
The implementation
First, we define the smallest numbers:
Then we need the operation to increment a number:
This code shows the mentioned idea of re-using parts of numbers. When we call S()
on 10 (binary 1010), it will create the element with the digit ONE
and link it to the
high part of 10 (binary 101)
Subtracting one is more complex, because we must take care of leading zeroes:
We are temporarily representing zeroes as nulls, until we reach the top level of recursion. We’ll use the same approach in subtraction.
Now we need a routine to add two numbers. We must iterate the lists, each time adding three bits: two from the source numbers, and a carry. This addition produces two bits – the new result bit and the new carry. The code to add three bits is ugly and difficult to read, that’s why it helps to split the routine into two cases: adding with carry and adding without:
The same approach works for subtraction, except we must take care of leading zeroes:
Now we need comparisons:
Unfortunately, comparisons are not very cheap: we must traverse both numbers to find out which one is bigger. Even if one is longer, we don’t know that until we traverse the lists (we can’t keep the list sizes, for this requires numbers).
Now we come to multiplication and division. We’ll perform both using classic school textbook approach. First, we need shifts:
And now we are ready for multiplication:
and for division:
The conversion to strings and back is the same as in the original number implementation (“Implementing Peano arithmetic”). The test routines are also almost identical:
The first test
Now we are ready to run the first test. We are using the original (unoptimised) versions of the user code (by which we mean pythagorean
and perfect
), and also
we haven’t yet applied any optimisations to our algorithms. Here are the times:
Version | Comment | Pythagorean, time for 1,000, sec | Perfect, time for 10,000, sec |
---|---|---|---|
Peano6 | Optimised Peano implementation | 360 | 336 |
Binary | Original binary implementation | 275 | 25 |
The results look good, especially for perfect
. We have already outperformed the best Peano
version.
The equality comparison
Last time we got some (albeit small) improvement by replacing “less than” comparison in the loops with “not equals”. The reason it helped is that equality comparison
was much faster (direct reference comparison instead of a list traversal). In our case it is not so: both call compare()
. However, numbers can be compared for equality
faster: any bit difference means “not equal”, so there is no need to check higher bits if the lower ones differ. Let’s try that (the new class is Binary2
and the new
methods are pythagorean2
and perfect2
):
We’ve got a small improvement again:
Version | Comment | Pythagorean, time for 1,000, sec | Perfect, time for 10,000, sec |
---|---|---|---|
Binary | Original binary implementation | 275 | 25 |
Binary2 | Equality test improved | 263 | 24 |
Improvement of multiplication and division
Let’s return to the code of multiplication:
Tis code has a problem: it shifts the first argument left as it proceeds with multiplication, and then spends time adding these additional zeroes to the result. It is better to add the unshifted argument and shift the result, but this requires performing multiplication starting from high bits, which requires recursion:
Division suffers from the same problem:
We shift the divisor left until it becomes as long as the dividend, and then subtract it from the dividend, together with all the trailing zeroes. We could instead shift the dividend right, and subtract short numbers. Here is the new code:
Here are the results:
Version | Comment | Pythagorean, time for 1,000, sec | Perfect, time for 10,000, sec |
---|---|---|---|
Binary2 | Equality test improved | 263 | 24 |
Binary3 | Multiplication and division improved | 221 | 18 |
We’ve got a really remarkable result: the recursive solutions work faster than iterative ones. Sure, the improvement didn’t come just because we replaced iteration with recursion. In both cases it was due to different algorithm, but both new algorithms (multiplication and division) are very difficult to write down without recursion.
Keeping the links
Until now, all the number objects were independent from each other. They were just linked lists of bits, created fresh when they were needed. Sometimes a portion of the list
could be re-used as a part of another number. The fundamental numbers ZERO
, ONE
and TWO
are often
re-used. However, unused numbers are not kept, which saves memory, but costs time – we have to re-create the numbers that we had before.
It is attractive to keep and re-use all the numbers, although this approach isn’t applicable to all computational programs. Some of them use billions of different numbers.
In our test programs, however, the number of numbers used is at most millions, so we can use this approach.
What we’ll do is keep the numbers resulting from appending ZERO
or ONE
to a given number, as fields in the object, populated on demand. We’ll replace all calls to constructors
with calls to a special method:
This is the example of a use of the new method:
The numbers ZERO
and ONE
are created as before (calling the constructor), but TWO
must be created using new method, to make sure that ONE
gets
correct value in its link0
:
Now there is at most one object for each number, and that object is the head of a link list that ends at ONE
, unless the number is ZERO
. These are the only two numbers with
next
set to null
.
Since the numbers are now singletons, we can replace the equality test with comparison of references:
Here are the new times:
Version | Comment | Pythagorean, time for 1,000, sec | Perfect, time for 10,000, sec |
---|---|---|---|
Binary3 | Multiplication and division improved | 221 | 18 |
Binary4 | Making numbers singletons | 199 | 17.5 |
Caching the increments
The singleton nature of numbers allows another improvement: caching the results of S()
, in the same way we were doing it in Peano
numbers:
We won’t cache results of D()
for now, because D()
is currently never called.
Here are the times:
Version | Comment | Pythagorean, time for 1,000, sec | Perfect, time for 10,000, sec |
---|---|---|---|
Binary4 | Making numbers singletons | 199 | 17.5 |
Binary5 | Caching results of S() | 190 | 16.9 |
Manual optimisations of the user code: pythagorean
While working with Peano numbers, we’ve performed manually some optimisations that the compiler usually does automatically. We can now apply the same transformations again:
Version | Comment | Pythagorean, time for 1,000, sec |
---|---|---|
Binary5.pythagorean2 | Caching results of S() | 190 |
Binary5.pythagorean3 | Some code moved out of loops | 106.6 |
Binary5.pythagorean4 | More code moved out of loops | 60.7 |
Binary5.pythagorean5 | Strength reduction | 30.9 |
Binary5.pythagorean6 | More strength reduction | 33.4 |
We see some performance degradation in pythagorean6
. The difference between the last two versions is that pythagorean5
maintains variables c_square
, b_square
and
a_square
, and calculates
while pythagorean6
maintains c_square_minus_b_square
instead of b_square
.
Possibly, maintenance of this variable is more expensive, because it requires subtraction (two of those per loop), which is slower than addition.
Improving remainder calculation
In Peano
project we improved the speed of perfect
by creating a specialised version of division: the routine that checks if one number is divisible by another one
without reporting quotient or remainder. We can do something similar in our case, too. When only the remainder is required, there is no need to calculate the quotient.
This makes the DivResult
structure unnecessary and saves some memory allocations. There is no need to create a special divisible
method, because internally it would
calculate the remainder anyway. This means that we don’t need a special version of perfect()
, which we had in Peano
(perfect3()
), we’ll rather make a new version of
rem()
(see Binary6.java
):
This is the time:
Version | Comment | Perfect, time for 10,000, sec |
---|---|---|
Binary5 | Caching results of S() | 16.9 |
Binary6 | Improved rem() | 15.9 |
I expected better improvement, but 6% isn’t insignificant.
The final touch
In Peano
project the last improvement we made in perfect()
was to run the internal loop to i/2
instead of i
. Since division by 2 was slow, we implemented it by
introducing the induction variable that equals double the loop variable.
Now division by 2 is fast, so our job is easier:
Here is the result:
Version | Comment | Perfect, time for 10,000, sec |
---|---|---|
Binary6.perfect2 | Improved rem() | 15.9 |
Binary6.perfect4 | Running the inner loop to i/2 | 10.3 |
The binary tree
It is interesting to look at the structure that our numbers form. The number 0
is separate – it consists of a single node with the digit 0
.
Everything else is arranged into a binary tree, with the number 1
as a root.
Each node has a parent reference next
(node 1
has it equal to null
), and two children references link0
and link1
, which are drawn as pointing to the left
and the right child, respectively. For a node representing the number n these children represent numbers 2n and 2n + 1.
The nodes are marked with digits corresponding to their position relative to the parent node: left nodes are marked “0”, right ones “1”.
This is the lowest digit in the binary representation of the number. The path from the number to node 1
gives the entire binary representation,
the root node being the highest digit. Every number, except for 0
, has the highest digit one, which agrees with the convention of avoiding the leading zeroes.
The tree doesn’t have to be fully built; the children are added as they are needed. For instance, while running pythagorean5
, the biggest number seen
is 1,000,000, while the size of the tree is 430,743.
This tree is a pretty picture, but it doesn’t help performing operations with numbers. Some algorithms might exist that perform computations faster using graph processing
techniques, but I don’t know them. Moreover, the value of keeping this tree in memory isn’t great – when we introduced it in Binary4
, the times only improved from 221 sec to
199 sec and from 18 sec to 17.5 sec. If we undo the changes introduced in Binary4
and Binary5
and run the latest version of the user code, we get
Version | Comment | Pythagorean, time for 1,000, sec | Perfect, time for 10,000, sec |
---|---|---|---|
Binary6 | Best versions of both tests | 30.9 | 10.3 |
Binary6X | Best versions of both tests, no tree in memory | 32.7 | 10.7 |
Additional versions: null as zero
While discussing possible strategies for implementing binary numbers, we mentioned representing number zero as an empty list. This could make the program more regular –
for instance, subtraction wouldn’t have to check for null
and convert it to ZERO
. The need for separare D0()
and D()
would disappear, and so on. The problem is that
we can’t invoke any methods on null
. Either all the methods must be replaced with ordinary static binary functions, or another class must be introduced that would
encapsulate the list.
We’ve tried the first approach (see class BinaryN
). The conversion is straightforward. I’ll show some of the new methods. This is the new D()
:
And this is multiplication:
The rest of the code is converted in a similar fashion.
The user code must be converted, too, and this is where the disadvantage of the new structure becomes clearly visible. The new code looks ugly. To illustrate this, I’ll show a single line of the old code and its version in the new code.
This like from pythagorean6
now looks like this:
The object notation looks much prettier and easier to read (although, operator overloading, such as in C++, would have looked even better).
The results aren’t great:
Version | Comment | Pythagorean, time for 1,000, sec | Perfect, time for 10,000, sec |
---|---|---|---|
Binary6 | Best results so far | 30.9 | 10.3 |
BinaryN | null as zero, no object notation | 30.8 | 10.4 |
Additional versions: booleans as digits
We also planned to try using boolean
s to represent digits. The change is easy. This is the example of the new code (class BinaryB
):
Here are the results:
Version | Comment | Pythagorean, time for 1,000, sec | Perfect, time for 10,000, sec |
---|---|---|---|
Binary6 | Best results so far | 30.9 | 10.3 |
BinaryB | Booleans as digits | 30.0 | 9.0 |
This is the best solution we’ve produced.
The summary
Here is the full table of results:
Version | Comment | Pythagorean, time for 1,000, sec | Perfect, time for 10,000, sec |
---|---|---|---|
Peano6 | Optimised Peano implementation | 360 | 336 |
Binary | Original binary implementation | 275 | 25 |
Binary2 | Equality test improved | 263 | 24 |
Binary3 | Multiplication and division improved | 221 | 18 |
Binary4 | Making numbers singletons | 199 | 17.5 |
Binary5 | Caching results of S() | 190 | 16.9 |
Binary5.pythagorean3 | Some code moved out of loops | 106.6 | |
Binary5.pythagorean4 | More code moved out of loops | 60.7 | |
Binary5.pythagorean5 | Strength reduction | 30.9 | |
Binary5.pythagorean6 | More strength reduction | 33.4 | |
Binary6 | Improved rem() | 15.9 | |
Binary6.perfect4 | Running the inner loop to i/2 | 10.3 | |
Binary6X | Best versions of both tests, no tree in memory | 32.7 | 10.7 |
BinaryN | null as zero, no object notation | 30.8 | 10.4 |
BinaryB | Booleans as digits | 30.0 | 9.0 |
Native | Classic Java implementation | 0.571 | 0.28 |
Conclusions
-
This is still a valid computation-free implementation: there are no
int
variables or numeric constants in the code. -
This implementation does not use any classes from the standard library.
-
As we expected, the new version is faster than the strict
Peano
version: the code runs 10-30 times faster. The speed difference may increase if we move to bigger numbers. -
Unlike the
Peano
version, the new code isn’t limited in number size. Even the caching versions only keep the numbers that were seen, and caching can be switched off with relatively small performance loss. -
The code now runs 30 – 60 times slower than the native version, as if it was running on a CPU with 40-80 MHz clock speed, such as the first Pentium (66 MHz). Not just student’s assignments - serious business applications ran on those computers. In the previous article we’ve made a journey into the late 70’s - early 80’s, now we’ve been in middle 90’s.
Coming soon
We’ve had a some fun playing with the aftificial numbers, but this is the end of this story. Now it’s time to get back to our previous activity. We were implementing the game of Life using a hash table. Many improvements have been applied, but I have some more ideas.