NTTX is an FFTX-like package that leverages SPIRAL's capability of autonomous code generation and platform-based autotuning to generate high-performance code of Number Theoretic Transforms (NTTs), which are widely used to accelerate Fully Homomorphic Encryption (FHE), a post-quantum encryption scheme. Mirroring the structure of FFTW and FFTX, the NTTX package offers FFTW-style C/C++ API in line with FFTX-style code generation.
Fully Homomorphic Encryption (FHE) allows direct computation on sensitive information through mathematical transforms upon encrypted data. FHE uses lattice-based cryptography to encode vectors of numerical data onto mathematical structures (lattices and rings), thereby enabling a set of basic arithmetic operations to be performed while encrypted. These operations can be combined for applications like pattern matching, linear algebra, basic statistics, and machine learning.
To encode and operate on different types of data, several schemes have been proposed, including BGV, BFV, and CKKS. All these schemes are based on lattice cryptography, requiring a core set of mathematical operations in the form of integer modulo vector arithmetic. A major optimization for FHE schemes is using Number Theoretic Transform (NTT) to accelerate the computation of vector convolutions, similar to the fast Fourier Transform (FFT) in the signal processing domain. Akin to FFT, the need for automatic high-performance code generation and autotuning NTT is therefore imminent.
SPIRAL is a code generation system that takes in high-level mathematical abstractions and synthesizes highly-optimized implementations. Leveraging SPIRAL's capability of autonomous code generation and platform-based autotuning, we expand SPIRAL to the NTT domain based on prior work on FFTX and modular FFT. While the current prototype supports ANSI C, CUDA, and a specialized vector ISA, the code generation backend targets all SPIRAL-supported platforms.
GPU
We benchmarked SPIRAL-generated batch NTTs' performance on Bridges-2 GPU nodes at Pittsburgh Supercomputing Center, using one NVIDIA Tesla V100 SXM2 node with 32GB GPU memory and compute capability 7.0.
Ring Processing Unit
The Ring Processing Unit (RPU) is a high-performance multi-tile architecture built specifically for FHE acceleration. A cycle-level performance and functional simulator was developed to study RPU and conduct design space exploration. It was written in C++ and models each aspect of the design, faithfully processing B512 code, the vector ISA supported by RPU, generated by SPIRAL. The simulator is configurable to consider different VDM banking strategies, allocations of HPLEs, pipeline depths, and component IP (e.g., multiplier).
A “First Look” version of the SPIRAL NTTX package is coming soon.
Copyrights to many of the above papers are held by the publishers. The attached PDF files are preprints. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder. Some links to papers above connect to IEEE Xplore with permission from IEEE, and viewers must follow all of IEEE's copyright policies.
Please email Naifeng Zhang at [naifengz at cmu dot edu] or visit Franz Franchetti’s projects page.