After quite some time in researching algorithms and coding a new arbitrary precision library, it's getting close to completion.

The new library uses radix 1e8 instead of 1e9 to store and process the long numbers, which on one side increases memory usage by a 9/8 factor (this is a bad thing), but on the other side it allows for some speed improvements that are not possible with the 1e9 radix that was being used.

Basically, 1e9 can go only twice (and fraction) into a signed 32-bit integer. This forces carry correction after every operation, while 1e8 fits more than 20 times in a signed integer. This allows for example to perform several additions without worrying about carry or overflow.

The new library uses multiply-high methods to perform all divisions, which doesn't have any major impact on the PC because of its hardware division capability, but it is expected to have a major impact on ARM, where divisions were done in software. Also, some routines were optimized to process 3 long-digits (words containing 8-digits packs) at once, taking advantage of the triple pipeline on ARM.

Overall, there's no benchmarks done on ARM yet, but on the PC the new library computes a Taylor series sin() expansion 20% faster than mpdecimal working with the same number of digits, not bad considering it has to do 12.5% more multiplications, additions, etc. The speedup on ARM is supposed to be even higher.

But 20% speedup is not good enough reason to justify a new library from scratch. The new 1e8 radix allows an optimization that should increase transcendental functions performance by a larger factor. The typical decimal CORDIC algorithm requires 20 additions per decimal digit. An optimized algorith was used that requires only 8, but one of the down sides is that requires the multiplication of a long number by a small constant (2 and 5). This can be accomplished by using more than one addition (going back to the 20 additions), and in each case required slow carry propagations when using the higher radix. Now it can be done in a single operation, with the exact same speed cost of an addition. Since additions need to align the digits of one of the parameters, there's an implicit shift operation in all additions, which is performed through multiply-high methods. The small constant was included in the magic numbers, so that secondary multiplication is done at virtually zero-cost.

The down side of all this is the delay of the next demo until the new library completely replaces mpdecimal. But progress doesn't stop.