Thumb2 Gets Assembly Code for AES and SHA-2 Algorithms in WolfSSL 5.6.4

In an effort to improve our Thumb2 support for Cortex-M4 and the like, wolfSSL 5.6.4 includes assembly code for the AES-ECB/CBC/CTR/GCM, SHA-256 and SHA-512 algorithms.

Of particular interest is the AES-CBC and AES-GCM performance improvements you will see when changing from the C code implementations in wolfSSL 5.6.3. Take for example running wolfSSL on a Cortex-M4 at 80MHz. With wolfSSL 5.6.3 the performance numbers for the AES-CBC and AES-GCM algorithms are:

AES-128-CBC-enc            425 KiB took 1.000 seconds,  425.000 KiB/s
AES-128-CBC-dec            450 KiB took 1.024 seconds,  439.453 KiB/s
AES-192-CBC-enc            375 KiB took 1.039 seconds,  360.924 KiB/s
AES-192-CBC-dec            375 KiB took 1.008 seconds,  372.024 KiB/s
AES-256-CBC-enc            325 KiB took 1.027 seconds,  316.456 KiB/s
AES-256-CBC-dec            325 KiB took 1.000 seconds,  325.000 KiB/s
AES-128-GCM-enc            325 KiB took 1.062 seconds,  306.026 KiB/s
AES-128-GCM-dec            325 KiB took 1.063 seconds,  305.738 KiB/s
AES-192-GCM-enc            275 KiB took 1.012 seconds,  271.739 KiB/s
AES-192-GCM-dec            275 KiB took 1.015 seconds,  270.936 KiB/s
AES-256-GCM-enc            250 KiB took 1.024 seconds,  244.141 KiB/s
AES-256-GCM-dec            250 KiB took 1.023 seconds,  244.379 KiB/s

Add the following defines so the assembly code is compiled in:

#define WOLFSSL_ARMASM
#define WOLFSSL_ARMASM_INLINE
#define WOLFSSL_ARMASM_NO_HW_CRYPTO
#define WOLFSSL_ARMASM_NO_NEON
#define WOLFSSL_ARM_ARCH 7

And now, with wolfSSL 5.6.4, the performance is:

AES-128-CBC-enc           1000 KiB took 1.008 seconds,  992.063 KiB/s
AES-128-CBC-dec            850 KiB took 1.007 seconds,  844.091 KiB/s
AES-192-CBC-enc            850 KiB took 1.020 seconds,  833.333 KiB/s
AES-192-CBC-dec            825 KiB took 1.023 seconds,  806.452 KiB/s
AES-256-CBC-enc            725 KiB took 1.008 seconds,  719.246 KiB/s
AES-256-CBC-dec            700 KiB took 1.000 seconds,  700.000 KiB/s
AES-128-GCM-enc            425 KiB took 1.000 seconds,  425.000 KiB/s
AES-128-GCM-dec            425 KiB took 1.004 seconds,  423.307 KiB/s
AES-192-GCM-enc            400 KiB took 1.020 seconds,  392.157 KiB/s
AES-192-GCM-dec            400 KiB took 1.019 seconds,  392.542 KiB/s
AES-256-GCM-enc            375 KiB took 1.032 seconds,  363.372 KiB/s
AES-256-GCM-dec            375 KiB took 1.027 seconds,  365.141 KiB/s

AES-CBC encryption is more than double the C code performance while decryption is 90% better! AES-GCM gets an impressive 35-50% boost.

The SHA-256 and SHA-512 see modest improvements but are worthwhile in order to get the best out of wolfSSL for your embedded device.

Let us know if there are other cryptographic algorithms on Thumb2 for which you would like to see better performance.

If you have questions about any of the above, please contact us at facts@wolfSSL.com or call us at +1 425 245 8247.

Download wolfSSL Now

Thumb2 and Arm32 Public Key Gets Massive Speedup in wolfSSL 5.6.4

In the latest release of wolfSSL, version 5.6.4, a significant effort has been put into improving the performance of public key algorithms for 32-bit ARM chips.

wolfSSL now has arguably the best performance for P256 ECC, Curve25519 and Ed25519 for Cortex-M4 and Cortex-A32. With highly optimized assembly implementations of multiplication and squaring operations you now get about twice the number of operations performed!

By compiling in the high performance SP code and using the assembly versions you get the best performance for your embedded device.

Take for example running wolfSSL on a Cortex-M4 at 80MHz with the following defines:

#define WOLFSSL_HAVE_SP_ECC
#define WOLFSSL_SP_NO_MALLOC
#define WOLFSSL_SP_ARM_CORTEX_M_ASM
#define WOLFSSL_SP_SMALL

With wolfSSL 5.6.3 the performance numbers for the ECC and Curve25519/Ed25519 algorithms are:

ECC   [      SECP256R1]   256  key gen        32 ops took 1.000 sec, avg 31.250 ms, 32.000 ops/sec
ECDHE [      SECP256R1]   256    agree        16 ops took 1.098 sec, avg 68.625 ms, 14.572 ops/sec
ECDSA [      SECP256R1]   256     sign        24 ops took 1.019 sec, avg 42.458 ms, 23.553 ops/sec
ECDSA [      SECP256R1]   256   verify        12 ops took 1.141 sec, avg 95.083 ms, 10.517 ops/sec
CURVE  25519  key gen        32 ops took 1.020 sec, avg 31.875 ms, 31.373 ops/sec
CURVE  25519    agree        32 ops took 1.012 sec, avg 31.625 ms, 31.621 ops/sec
ED     25519  key gen        80 ops took 1.000 sec, avg 12.500 ms, 80.000 ops/sec
ED     25519     sign        64 ops took 1.031 sec, avg 16.109 ms, 62.076 ops/sec
ED     25519   verify        28 ops took 1.011 sec, avg 36.107 ms, 27.695 ops/sec

But with wolfSSL 5.6.4 the performance is massively improved:

ECC   [      SECP256R1]   256  key gen        72 ops took 1.027 sec, avg 14.264 ms, 70.107 ops/sec
ECDHE [      SECP256R1]   256    agree        34 ops took 1.036 sec, avg 30.471 ms, 32.819 ops/sec
ECDSA [      SECP256R1]   256     sign        44 ops took 1.020 sec, avg 23.182 ms, 43.137 ops/sec
ECDSA [      SECP256R1]   256   verify        24 ops took 1.082 sec, avg 45.083 ms, 22.181 ops/sec
CURVE  25519  key gen        80 ops took 1.000 sec, avg 12.500 ms, 80.000 ops/sec
CURVE  25519    agree        84 ops took 1.020 sec, avg 12.143 ms, 82.353 ops/sec
ED     25519  key gen       165 ops took 1.000 sec, avg 6.061 ms, 165.000 ops/sec
ED     25519     sign       110 ops took 1.000 sec, avg 9.091 ms, 110.000 ops/sec
ED     25519   verify        74 ops took 1.008 sec, avg 13.622 ms, 73.413 ops/sec

Most operations are twice as fast while the Curve25519 operations and Ed25519 Verify are more than 2.5 times faster!

RSA has seen more modest gains when compiling for small SP code. Before, 5.6.3:

RSA     2048   public        38 ops took 1.043 sec, avg 27.447 ms, 36.433 ops/sec
RSA     2048  private         2 ops took 2.016 sec, avg 1008.000 ms, 0.992 ops/sec

And after, 5.6.4:

RSA     2048   public        42 ops took 1.039 sec, avg 24.738 ms, 40.423 ops/sec
RSA     2048  private         2 ops took 1.329 sec, avg 664.500 ms, 1.505 ops/sec

But notably, the RSA private key operation, that corresponds to RSA signing, is 50% faster. (Watch the space for further improvements to these numbers!)

Equivalent improvements are seen with Arm32 CPUs that have the UMAAL instruction. This includes all CPUs implementing ARMv7-A and ARMv8-A.

Try it out and get the best public key cryptography performance for your device.

If you have questions about any of the above, please contact us at facts@wolfSSL.com or call us at +1 425 245 8247.

Download wolfSSL Now

Posts navigation

1 2 3 6 7 8 9