Skip to content

UPSTREAM PR #30864: Ml dsa speedup#670

Open
loci-dev wants to merge 6 commits into
mainfrom
loci/pr-30864-ml-dsa-speedup
Open

UPSTREAM PR #30864: Ml dsa speedup#670
loci-dev wants to merge 6 commits into
mainfrom
loci/pr-30864-ml-dsa-speedup

Conversation

@loci-dev
Copy link
Copy Markdown

Note

Source pull request: openssl/openssl#30864

Drop value barrier from ML-DSA reduce_once

[ The second commit is the substance of this PR, the first commit is just the CT tests from #30863 ]

This mirrors the corresponding code in ML-KEM and works under
the same conditions/assumptions.

Instentionally uses the constant time instrumentation PR as its
merge-base, so to be merged after than has baked in for a few
days and shows working CT tests in daily CI runs.

Sample before/after performance pairs for one X86_64 CPU:

                keygens/s    sign/s  verify/s
   -  ML-DSA-44   18066.4    6014.1   23375.7
   +  ML-DSA-44   20404.4    7105.4   26455.0
   -  ML-DSA-65   10131.3    3567.9   14172.5
   +  ML-DSA-65   11148.6    4358.6   15762.0
   -  ML-DSA-87    7239.2    2912.2    8214.2
   +  ML-DSA-87    8098.4    3518.5    9299.8
Checklist
  • documentation is added or updated
  • tests are added or updated

Viktor Dukhovni added 2 commits April 16, 2026 20:02
Also slightly refactor the ML-KEM version to share the necesasry
defines, and add a daily CI run to check both (presently, for just some
platforms with known working valgrind support).
Don't declassify rho_prime, that needs to stay protected.
Move constish_time_non_zero() to <internal/constant_time.h> as requested
by reviewers, and rename it constish_time_true(), better reflecting the
expected 0/1 boolean input.
Viktor Dukhovni added 3 commits April 18, 2026 15:06
- New CONSTTIME_SECRET_VECTOR() and CONSTTIME_DECLASSIFY_VECTOR() macros
  simplify CT labeling of ML-DSA vectors and avoid incorrect sizing.

- New constant_time_declassify_u32() inline function mirrors a similar
  function in BoringSSL, with this we declassify the output pass/fail
  of rejection tests, rather than its numeric inputs, matching similar
  code in BoringSSL.
Use rank not 2 in ML-KEM decap classify_bytes
This mirrors the corresponding code in ML-KEM and works under
the same conditions/assumptions.  Also adjusted related
functions with unnecessary 2-layers of constant_time selects
where one suffices (now also matching BoringSSL).

Intentionally uses the constant time instrumentation PR as its
merge-base, so to be merged after than has baked in for a few
days and shows working CT tests in daily CI runs.

Sample before/after performance pairs and percent throughput
increases for one X86_64 CPU:

              keygens/s    sign/s  verify/s
    ML-DSA-44   18728.3    6061.2   23251.6
    ML-DSA-44   21077.2    7392.4   27244.3
    ML-DSA-44     12.5%     22.0%     17.2%

    ML-DSA-65   10084.3    3603.0   13988.6
    ML-DSA-65   11197.9    4549.7   16208.4
    ML-DSA-65     11.0%     26.3%     15.9%

    ML-DSA-87    7184.8    2917.3    8141.0
    ML-DSA-87    8132.4    3693.7    9430.7
    ML-DSA-87     13.2%     26.6%     15.8%

and here's the same for an Apple silicon M2:

              keygens/s    sign/s  verify/s
    ML-DSA-44   17235.7    3099.3   15744.5
    ML-DSA-44   21855.2    4907.6   22849.0
    ML-DSA-44     26.8%     58.3%     45.1%

    ML-DSA-65    9165.8    1908.5   10058.3
    ML-DSA-65   11262.7    3069.6   14348.1
    ML-DSA-65     22.9%     60.8%     42.6%

    ML-DSA-87    6596.1    1563.6    6330.8
    ML-DSA-87    8404.9    2584.6    8767.6
    ML-DSA-87     27.4%     65.3%     38.5%
@loci-dev loci-dev force-pushed the loci/pr-30864-ml-dsa-speedup branch from 7df28a7 to 231580f Compare April 20, 2026 03:43
@loci-dev loci-dev force-pushed the main branch 5 times, most recently from 421b135 to 770bf14 Compare April 28, 2026 03:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant