Skip to content

Integer factoring and primality testing performance #2625

@fredrik-johansson

Description

@fredrik-johansson

Our factoring and primality testing code is probably never really going to be the absolute state of the art since there is a lot of highly specialized software out there for theses tasks. However, a reasonable goal is that we shouldn't be any slower than other general-purpose number theory libraries, e.g. Pari.

A global observation is that we should be using dedicated 128-bit arithmetic for all modular arithmetic that fits in 128 bits and something like mpn_mod (possibly in combination with Montgomery arithmetic) for larger moduli, instead of generic fmpz code. This can improve performance more than 2x in many cases. But the most glaring inefficiencies we have right now are probably more algorithmic in nature.

Here is a quick comparison with Pari/GP for (rigorous) primality testing (with #2624 merged), showing average timings given random input in [0,2^bits-1] up to 1024 bits:

Image

This looks pretty good, which is expected since we use similar algorithms: Brillhart-Lehmer-Selfridge when possible and APRCL as the general fallback. FLINT is a bit slower than Pari between 82-128 bits which should be a high priority to investigate and fix. There is also maybe a regression in APRCL starting around 700 bits.

Pari also has ECPP, but according to the documentation it's only used beyond 1500 bits. Adding ECPP to FLINT would certainly be a worthwhile project, though I think improving the existing code for <1000-bit integers would be more useful in the near term. (To prove primality of huge general-form integer, it's currently best to use the ECPP software developed by Andreas Enge.)

Similar graph for factoring up to 256 bits:

Image

The main observation is that FLINT's factoring code very seriously underperforms around 65-160 bits, being up to 20x slower than Pari. A very small part of this is inefficient trial division; see #2499. My guess is that the main issue is that we aren't using ECM where appropriate and/or that our ECM is poorly optimized for <128-bit factorisations. Or maybe QS is appropriate at least at the higher end but our QS is poorly tuned. Or it's a combination of these factors. There is the serious bug #2250 in our QS to fix: though ostensibly a failure to terminate, the underlying bug is maybe also causing slowdowns in cases where QS does terminate.

I have found at least one issue with our use of ECM for large input; not sure if this also affects smaller input. The rule of thumb is that one should run ECM to look for factors up to 1/3 the bit size of the input, or in our case min(bits/3, 100) since our ECM isn't tuned for smoothness bounds exceeding 100 bits. The factor 1/3 itself is just a heuristic and may want better empirical tuning, which is a separate issue.

Our general-purpose factoring code calls fmpz_factor_smooth which searches for factors up to 1/3 the bit size of the original input. However, we should arguably stop ECM and switch to QS when we reach 1/3 the bit size of the remaining cofactor. This can be illustrated with the example

2056802480868100646375721251575555494408897387375737955882170045672576386016591560879707933101909539325829251496440620798637813 =
280673 * 2756163353 * 598990818061 * 4527716228491 * 248158049830971629 * 117445937227520353139789517076610399 * 33637310674071348724927955857253537

which I took from the README for yafu. Currently FLINT takes 208 seconds to factor this; Pari does it in 29 seconds.

The input has 420 bits, so we start fmpz_factor_smooth with a 100-bit goal. At 50 bits, we've found all the "small" factors (<= 248158049830971629). Now the remaining cofactor has only 232 bits, so we ought to adjust the smoothness bound down to 232/3 = 77 bits. If we implement this early termination in fmpz_factor_smooth, the time drops to 53 seconds. If we assume that the 1/3 heuristic isn't optimal and stop ECM already at 60 bits, it takes 30 seconds.

An independent issue is that our ECM code (unlike QS) isn't multithreaded. The benchmarks results above are all single-threaded, but we should obviously focus on multithreaded performance too. I'm not too familiar with the algorithm, but it looks like one should be able to parallelize ECM by just doing the curves in parallel.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions