wolfCrypt implementations of LMS/HSS and XMSS/XMSS^MT signatures: build options and benchmarks (Intel x86)

At wolfSSL we’re excited about stateful hash-based signature schemes and the CNSA 2.0, and we just had a webinar on this subject. If you recall, previously we added initial support for LMS/HSS and XMSS/XMSS^MT, through external integration with the hash-sigs and xmss-reference implementations.

Recently however we have completed our own wolfCrypt implementations of these algorithms, and would like to share benchmarking results and some of the build options available. Generally the wolfCrypt implementations of these signature methods are faster, with more options available to tune build size and performance.

With that said, we’ll review some of the more relevant build options and benchmarking data for LMS/HSS, and XMSS/XMSS^MT. These benchmarks were obtained on a Fedora 38 workstation with an Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz. Only a single core was used. wolfSSL was built with –enable-intelasm to utilize assembly speedups for all tests. Note: LMS/HSS and XMSS/XMSS^MT support a very wide range of parameters. For the sake of conciseness only a targeted range is benchmarked here.

LMS build options and benchmarking

The five main defines that customize the wolfCrypt LMS/HSS build are the following:

  • WOLFSSL_LMS_LARGE_CACHES
  • WOLFSSL_WC_LMS_SMALL
  • WOLFSSL_LMS_MAX_LEVELS=N
  • WOLFSSL_LMS_MAX_HEIGHT=H
  • WOLFSSL_LMS_VERIFY_ONLY

The define WOLFSSL_LMS_LARGE_CACHES will cache more of the authentication path into memory, speeding up signing operations for larger height trees.

The define WOLFSSL_WC_LMS_SMALL reduces code size and memory use overall, with the tradeoff of much slower signing operations. However the performance impact for verification is negligible.

The defines WOLFSSL_LMS_MAX_LEVELS, and WOLFSSL_LMS_MAX_HEIGHT set compile time limits on the size of the LMS/HSS hypertree, and mainly reduce code footprint without impacting performance. These can be used to slim the build size if you are only interested in a specific parameter set range. More specifically, WOLFSSL_LMS_MAX_LEVELS sets the max allowed levels in HSS (the number of trees in the hypertree), while WOLFSSL_LMS_MAX_HEIGHT sets the max allowed height per tree for both LMS and HSS.

The define WOLFSSL_LMS_VERIFY_ONLY restricts the build to a smaller verify-only subset (LMS API and data structures needed for keygen/signing are omitted). This does not impact verify performance, and is intended for embedded targets that need verify-only functionality (e.g. wolfBoot). WOLFSSL_LMS_VERIFY_ONLY can be combined with WOLFSSL_WC_LMS_SMALL, WOLFSSL_LMS_MAX_LEVELS, and WOLFSSL_LMS_MAX_HEIGHT for further footprint reduction.

In Table 1 we show benchmarking results (obtained with ./wolfcrypt/benchmark/benchmark -lms_hss) for these different build options, with the external LMS/HSS implementation provided for comparison.

In general we see the default wolfCrypt LMS/HSS performance (wc_lms) is much faster than the external integration (ext_lms) for all categories of operation (keygen, signing, verifying). The WOLFSSL_LMS_LARGE_CACHES (wc_lms large) option speeds up signing operations for larger height trees, but otherwise does not impact performance. The small variations in verify speed across wc_lms, wc_lms large, and wc_lms small are likely just system noise and do not represent a systematic trend. The WOLFSSL_WC_LMS_SMALL option (wc_lms small) significantly reduces signing speed, but leaves verification speed basically unchanged, making this option attractive for verify-only applications in embedded systems.


Table 1: Comparison of wolfCrypt LMS/HSS (wc_lms), wolfCrypt LMS/HSS with WOLFSSL_LMS_LARGE_CACHES (wc_lms large), wolfCrypt LMS/HSS with WOLFSSL_WC_LMS_SMALL (wc_lms small), and the external integration implementation (ext_lms). All values in units of ops/sec.

wc_lms wc_lms large wc_lms small ext_lms
L2_H10_W2 keygen 6.482 6.494 12.828 1.330
L2_H10_W2 sign 4437.469 5521.796 6.526 786.083
L2_H10_W2 verify 13954.450 14087.794 13874.450 4789.383
L2_H10_W4 keygen 3.567 3.592 6.954 0.764
L2_H10_W4 sign 2452.361 3052.326 3.562 443.225
L2_H10_W4 verify 6482.891 6707.271 6962.215 2281.440
L3_H5_W4 keygen 70.926 73.673 227.376 17.467
L3_H5_W4 sign 4660.370 4669.019 74.653 820.640
L3_H5_W4 verify 4632.118 4670.963 4790.742 1756.355
L3_H5_W8 keygen 9.395 9.413 29.041 2.265
L3_H5_W8 sign 609.408 605.199 9.542 106.059
L3_H5_W8 verify 561.759 554.635 573.341 214.093
L3_H10_W4 keygen 2.384 2.368 7.128 0.569
L3_H10_W4 sign 2459.698 3067.848 2.376 444.601
L3_H10_W4 verify 4895.203 4345.130 4793.853 1618.676
L4_H5_W8 keygen 7.045 7.017 29.258 1.770
L4_H5_W8 sign 608.915 607.318 7.168 106.881
L4_H5_W8 verify 446.384 443.804 438.542 145.672

Graph 1: Signing speeds for wolfCrypt LMS/HSS (wc_lms), wolfCrypt LMS/HSS with WOLFSSL_LMS_LARGE_CACHES (wc_lms large), and the external integration implementation (ext_lms). All values in units of ops/sec.

XMSS build options and benchmarking

Three important defines that customize the wc_xmss build are:

  • WOLFSSL_WC_XMSS_SMALL
  • WOLFSSL_XMSS_MAX_HEIGHT=N
  • WOLFSSL_XMSS_VERIFY_ONLY

The define WOLFSSL_WC_XMSS_SMALL reduces code size and memory use overall, with the tradeoff of much slower signing operations, and 20-30% slower verification.

The define WOLFSSL_XMSS_MAX_HEIGHT=N sets compile time limits on the max height of the hypertree, and mainly reduces code size without impacting performance.

The define WOLFSSL_XMSS_VERIFY_ONLY restricts the build to a smaller verify-only subset, and can be combined with WOLFSSL_WC_XMSS_SMALL, and WOLFSSL_XMSS_MAX_HEIGHT for further size reduction. It does not impact verify performance.

In Table 2 we show benchmarking results for XMSS/XMSS^MT for these options (obtained with ./wolfcrypt/benchmark/benchmark -xmss_xmssmt_sha256), with the external XMSS/XMSS^MT implementation for comparison. The default wolfCrypt XMSS/XMSS^MT (wc_xmss) is in general better than the external integration (ext_xmss), for all operations. There is a smaller difference between wc_xmss and ext_xmss as compared to wc_lms and ext_lms though, because ext_xmss can benefit from assembly speedups whereas ext_lms cannot. Similar to LMS, the WOLFSSL_WC_XMSS_SMALL option (wc_xmss small) significantly reduces signing performance, but verify speeds remain fast, making this a good option for embedded verify-only targets.

Table 2: Comparison of wolfCrypt XMSS/XMSS^MT (wc_xmss), wolfCrypt XMSS/XMSS^MT with WOLFSSL_WC_XMSS_SMALL (wc_xmss small), and the external integration implementation (ext_xmss). All values in units of ops/sec.

wc_xmss wc_xmss small ext_xmss
XMSS-SHA2_10_256 keygen 1.587 1.079 0.943
XMSS-SHA2_10_256 sign 363.693 1.106 226.782
XMSS-SHA2_10_256 verify 3050.276 2044.995 1892.234
XMSSMT-SHA2_20/2_256 keygen 0.808 1.100 0.472
XMSSMT-SHA2_20/2_256 sign 298.138 0.551 191.214
XMSSMT-SHA2_20/2_256 verify 1307.295 982.836 852.348
XMSSMT-SHA2_20/4_256 keygen 9.880 35.274 7.309
XMSSMT-SHA2_20/4_256 sign 390.942 8.681 290.516
XMSSMT-SHA2_20/4_256 verify 729.433 517.298 443.444
XMSSMT-SHA2_40/4_256 keygen 0.406 1.107 0.237
XMSSMT-SHA2_40/4_256 sign 294.738 0.276 161.656
XMSSMT-SHA2_40/4_256 verify 750.591 487.257 424.986
XMSSMT-SHA2_40/8_256 keygen 5.604 35.318 3.755
XMSSMT-SHA2_40/8_256 sign 469.764 4.374 293.184
XMSSMT-SHA2_40/8_256 verify 361.289 262.160 225.254
XMSSMT-SHA2_60/6_256 keygen 0.266 1.099 0.159
XMSSMT-SHA2_60/6_256 sign 280.160 0.185 144.637
XMSSMT-SHA2_60/6_256 verify 521.610 352.718 295.882
XMSSMT-SHA2_60/12_256 keygen 4.143 35.280 2.505
XMSSMT-SHA2_60/12_256 sign 514.658 2.910 292.371
XMSSMT-SHA2_60/12_256 verify 247.682 170.459 152.471

Graph 2: Verify speeds for wolfCrypt XMSS/XMSS^MT (wc_xmss), wolfCrypt XMSS/XMSS^MT with WOLFSSL_WC_XMSS_SMALL (wc_xmss small), and the external integration implementation (ext_xmss). All values in units of ops/sec.

Conclusions

In general our wolfCrypt implementations for LMS/HSS and XMSS/XMSS^MT are significantly faster than the external reference implementations, with speedups of 20-30% to even 3x-4x possible depending on the combination of operation, algorithm, and parameters.

The small footprint build shows fast verification speeds for all parameters, making it an attractive choice for embedded verify-only applications (e.g. wolfBoot).

Overall our LMS/HSS implementation is faster than XMSS/XMSS^MT (at least on x86), which is consistent with what is known about these two methods. However which of the two is more appropriate for your use case will ultimately depend on other factors as well, such as signature size, target environment, and parameters used.

If you’re interested in learning more about our post-quantum work, or want to learn more about stateful hash-based signature schemes, contact us at wolfSSL by emailing facts@wolfSSL.com or calling us at +1 425 245 8247 to reach out to your regional wolfSSL business director.

Download wolfSSL Now