At wolfSSL we’re excited about stateful hash-based signature schemes and the CNSA 2.0, and we just had a webinar on this subject. If you recall, previously we added initial support for LMS/HSS and XMSS/XMSS^MT, through external integration with the hash-sigs and xmss-reference implementations.
Recently however we have completed our own wolfCrypt implementations of these algorithms, and would like to share benchmarking results and some of the build options available. Generally the wolfCrypt implementations of these signature methods are faster, with more options available to tune build size and performance.
With that said, we’ll review some of the more relevant build options and benchmarking data for LMS/HSS, and XMSS/XMSS^MT. These benchmarks were obtained on a Fedora 38 workstation with an Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz. Only a single core was used. wolfSSL was built with –enable-intelasm to utilize assembly speedups for all tests. Note: LMS/HSS and XMSS/XMSS^MT support a very wide range of parameters. For the sake of conciseness only a targeted range is benchmarked here.
LMS build options and benchmarking
The five main defines that customize the wolfCrypt LMS/HSS build are the following:
- WOLFSSL_LMS_LARGE_CACHES
- WOLFSSL_WC_LMS_SMALL
- WOLFSSL_LMS_MAX_LEVELS=N
- WOLFSSL_LMS_MAX_HEIGHT=H
- WOLFSSL_LMS_VERIFY_ONLY
The define WOLFSSL_LMS_LARGE_CACHES will cache more of the authentication path into memory, speeding up signing operations for larger height trees.
The define WOLFSSL_WC_LMS_SMALL reduces code size and memory use overall, with the tradeoff of much slower signing operations. However the performance impact for verification is negligible.
The defines WOLFSSL_LMS_MAX_LEVELS, and WOLFSSL_LMS_MAX_HEIGHT set compile time limits on the size of the LMS/HSS hypertree, and mainly reduce code footprint without impacting performance. These can be used to slim the build size if you are only interested in a specific parameter set range. More specifically, WOLFSSL_LMS_MAX_LEVELS sets the max allowed levels in HSS (the number of trees in the hypertree), while WOLFSSL_LMS_MAX_HEIGHT sets the max allowed height per tree for both LMS and HSS.
The define WOLFSSL_LMS_VERIFY_ONLY restricts the build to a smaller verify-only subset (LMS API and data structures needed for keygen/signing are omitted). This does not impact verify performance, and is intended for embedded targets that need verify-only functionality (e.g. wolfBoot). WOLFSSL_LMS_VERIFY_ONLY can be combined with WOLFSSL_WC_LMS_SMALL, WOLFSSL_LMS_MAX_LEVELS, and WOLFSSL_LMS_MAX_HEIGHT for further footprint reduction.
In Table 1 we show benchmarking results (obtained with ./wolfcrypt/benchmark/benchmark -lms_hss) for these different build options, with the external LMS/HSS implementation provided for comparison.
In general we see the default wolfCrypt LMS/HSS performance (wc_lms) is much faster than the external integration (ext_lms) for all categories of operation (keygen, signing, verifying). The WOLFSSL_LMS_LARGE_CACHES (wc_lms large) option speeds up signing operations for larger height trees, but otherwise does not impact performance. The small variations in verify speed across wc_lms, wc_lms large, and wc_lms small are likely just system noise and do not represent a systematic trend. The WOLFSSL_WC_LMS_SMALL option (wc_lms small) significantly reduces signing speed, but leaves verification speed basically unchanged, making this option attractive for verify-only applications in embedded systems.
Table 1: Comparison of wolfCrypt LMS/HSS (wc_lms), wolfCrypt LMS/HSS with WOLFSSL_LMS_LARGE_CACHES (wc_lms large), wolfCrypt LMS/HSS with WOLFSSL_WC_LMS_SMALL (wc_lms small), and the external integration implementation (ext_lms). All values in units of ops/sec.
|
wc_lms |
wc_lms large |
wc_lms small |
ext_lms |
L2_H10_W2 keygen |
6.482 |
6.494 |
12.828 |
1.330 |
L2_H10_W2 sign |
4437.469 |
5521.796 |
6.526 |
786.083 |
L2_H10_W2 verify |
13954.450 |
14087.794 |
13874.450 |
4789.383 |
L2_H10_W4 keygen |
3.567 |
3.592 |
6.954 |
0.764 |
L2_H10_W4 sign |
2452.361 |
3052.326 |
3.562 |
443.225 |
L2_H10_W4 verify |
6482.891 |
6707.271 |
6962.215 |
2281.440 |
L3_H5_W4 keygen |
70.926 |
73.673 |
227.376 |
17.467 |
L3_H5_W4 sign |
4660.370 |
4669.019 |
74.653 |
820.640 |
L3_H5_W4 verify |
4632.118 |
4670.963 |
4790.742 |
1756.355 |
L3_H5_W8 keygen |
9.395 |
9.413 |
29.041 |
2.265 |
L3_H5_W8 sign |
609.408 |
605.199 |
9.542 |
106.059 |
L3_H5_W8 verify |
561.759 |
554.635 |
573.341 |
214.093 |
L3_H10_W4 keygen |
2.384 |
2.368 |
7.128 |
0.569 |
L3_H10_W4 sign |
2459.698 |
3067.848 |
2.376 |
444.601 |
L3_H10_W4 verify |
4895.203 |
4345.130 |
4793.853 |
1618.676 |
L4_H5_W8 keygen |
7.045 |
7.017 |
29.258 |
1.770 |
L4_H5_W8 sign |
608.915 |
607.318 |
7.168 |
106.881 |
L4_H5_W8 verify |
446.384 |
443.804 |
438.542 |
145.672 |
Graph 1: Signing speeds for wolfCrypt LMS/HSS (wc_lms), wolfCrypt LMS/HSS with WOLFSSL_LMS_LARGE_CACHES (wc_lms large), and the external integration implementation (ext_lms). All values in units of ops/sec.
XMSS build options and benchmarking
Three important defines that customize the wc_xmss build are:
- WOLFSSL_WC_XMSS_SMALL
- WOLFSSL_XMSS_MAX_HEIGHT=N
- WOLFSSL_XMSS_VERIFY_ONLY
The define WOLFSSL_WC_XMSS_SMALL reduces code size and memory use overall, with the tradeoff of much slower signing operations, and 20-30% slower verification.
The define WOLFSSL_XMSS_MAX_HEIGHT=N sets compile time limits on the max height of the hypertree, and mainly reduces code size without impacting performance.
The define WOLFSSL_XMSS_VERIFY_ONLY restricts the build to a smaller verify-only subset, and can be combined with WOLFSSL_WC_XMSS_SMALL, and WOLFSSL_XMSS_MAX_HEIGHT for further size reduction. It does not impact verify performance.
In Table 2 we show benchmarking results for XMSS/XMSS^MT for these options (obtained with ./wolfcrypt/benchmark/benchmark -xmss_xmssmt_sha256), with the external XMSS/XMSS^MT implementation for comparison. The default wolfCrypt XMSS/XMSS^MT (wc_xmss) is in general better than the external integration (ext_xmss), for all operations. There is a smaller difference between wc_xmss and ext_xmss as compared to wc_lms and ext_lms though, because ext_xmss can benefit from assembly speedups whereas ext_lms cannot. Similar to LMS, the WOLFSSL_WC_XMSS_SMALL option (wc_xmss small) significantly reduces signing performance, but verify speeds remain fast, making this a good option for embedded verify-only targets.
Table 2: Comparison of wolfCrypt XMSS/XMSS^MT (wc_xmss), wolfCrypt XMSS/XMSS^MT with WOLFSSL_WC_XMSS_SMALL (wc_xmss small), and the external integration implementation (ext_xmss). All values in units of ops/sec.
|
wc_xmss |
wc_xmss small |
ext_xmss |
XMSS-SHA2_10_256 keygen |
1.587 |
1.079 |
0.943 |
XMSS-SHA2_10_256 sign |
363.693 |
1.106 |
226.782 |
XMSS-SHA2_10_256 verify |
3050.276 |
2044.995 |
1892.234 |
XMSSMT-SHA2_20/2_256 keygen |
0.808 |
1.100 |
0.472 |
XMSSMT-SHA2_20/2_256 sign |
298.138 |
0.551 |
191.214 |
XMSSMT-SHA2_20/2_256 verify |
1307.295 |
982.836 |
852.348 |
XMSSMT-SHA2_20/4_256 keygen |
9.880 |
35.274 |
7.309 |
XMSSMT-SHA2_20/4_256 sign |
390.942 |
8.681 |
290.516 |
XMSSMT-SHA2_20/4_256 verify |
729.433 |
517.298 |
443.444 |
XMSSMT-SHA2_40/4_256 keygen |
0.406 |
1.107 |
0.237 |
XMSSMT-SHA2_40/4_256 sign |
294.738 |
0.276 |
161.656 |
XMSSMT-SHA2_40/4_256 verify |
750.591 |
487.257 |
424.986 |
XMSSMT-SHA2_40/8_256 keygen |
5.604 |
35.318 |
3.755 |
XMSSMT-SHA2_40/8_256 sign |
469.764 |
4.374 |
293.184 |
XMSSMT-SHA2_40/8_256 verify |
361.289 |
262.160 |
225.254 |
XMSSMT-SHA2_60/6_256 keygen |
0.266 |
1.099 |
0.159 |
XMSSMT-SHA2_60/6_256 sign |
280.160 |
0.185 |
144.637 |
XMSSMT-SHA2_60/6_256 verify |
521.610 |
352.718 |
295.882 |
XMSSMT-SHA2_60/12_256 keygen |
4.143 |
35.280 |
2.505 |
XMSSMT-SHA2_60/12_256 sign |
514.658 |
2.910 |
292.371 |
XMSSMT-SHA2_60/12_256 verify |
247.682 |
170.459 |
152.471 |
Graph 2: Verify speeds for wolfCrypt XMSS/XMSS^MT (wc_xmss), wolfCrypt XMSS/XMSS^MT with WOLFSSL_WC_XMSS_SMALL (wc_xmss small), and the external integration implementation (ext_xmss). All values in units of ops/sec.
Conclusions
In general our wolfCrypt implementations for LMS/HSS and XMSS/XMSS^MT are significantly faster than the external reference implementations, with speedups of 20-30% to even 3x-4x possible depending on the combination of operation, algorithm, and parameters.
The small footprint build shows fast verification speeds for all parameters, making it an attractive choice for embedded verify-only applications (e.g. wolfBoot).
Overall our LMS/HSS implementation is faster than XMSS/XMSS^MT (at least on x86), which is consistent with what is known about these two methods. However which of the two is more appropriate for your use case will ultimately depend on other factors as well, such as signature size, target environment, and parameters used.
If you’re interested in learning more about our post-quantum work, or want to learn more about stateful hash-based signature schemes, contact us at wolfSSL by emailing facts@wolfSSL.com or calling us at +1 425 245 8247 to reach out to your regional wolfSSL business director.
Download wolfSSL Now