designated ARM microcontrollers as an important benchmarking platform for
its Post-Quantum Cryptography standardization process (NISTPQC).
In view of this, we explore the design
space of the NISTPQC finalist Saber on the Cortex-M4 and
its close relation, the Cortex-M3. In the process, we investigate
various optimization strategies and memory-time tradeoffs for number-theoretic
Recent work by Chung et al. has shown that NTT multiplication is superior
Toom–Cook multiplication for unprotected Saber implementations on the
Cortex-M4 in terms of speed.
However, it remains unclear if NTT multiplication can outperform Toom–Cook
in masked implementations of Saber.
Additionally, it is an open question if Saber with NTTs can outperform Toom–Cook
in terms of stack usage.
We answer both questions in the affirmative.
Additionally, we present a Cortex-M3 implementation of Saber using NTTs
outperforming an existing Toom–Cook implementation.
Our stack-optimized unprotected M4 implementation uses around the same
amount of stack as the most
stack-optimized implementation using Toom–Cook while being 33%-41% faster.
Our speed-optimized masked M4 implementation is 16% faster than the
fastest masked implementation using Toom–Cook.
For the Cortex-M3, we outperform existing implementations by 29%-35% in speed.
We conclude that for both stack- and speed-optimization purposes, one
should base polynomial multiplications in Saber on the NTT rather than Toom–Cook for the Cortex-M4
In particular, in many cases, composite moduli NTTs perform best.