Hi @zkertesz,
The short answer is parallelization of the decrypts.
The software solution does an encrypt block by block and a decrypt block by block in sequence thus you see similar performance times. There is no way around the encrypt block by block in sequence because you can't encrypt the next block until the current block is finished (this is the whole point of "block chaining").
However when decrypting you can parallelize the decrypts because all blocks are already encrypted and to decrypt a given block you just need the block that preceeded it so let's imagine this scenario:
BlockA -> BlockB -> BlockC -> BlockD ->BlockE
During encryption you have to encrypt in sequence:
BlockA before you can encrypt BlockB, BlockB before BlockC and so on. HOWEVER not with decrypt.
During Decryption you can decypt BlockE using BlockD's encrypted version while at the same time decrypting BlockD with BlockCs' encrypted version while at the same time decrypting BlockC with BlockBs' encrypted version etc. It takes more memory because you load a copy of the encrypted BlockC for decrypting blockD while also loading another copy of encrypted BlockC to decrypt BlockC (hope this all makes sense).
Anyway long story short we could achieve something similar in Software also, we just have not yet added a parallelized software solution where the AESNI and intel hardware have implemented parallelized decrypt.
Hope this helps.
- K