Wafer-Scale Engine and SPIRAL

What is it?

We explore programming scientific computing and cryptographic kernels on Cerebras’ Wafer-Scale Engine 2 (WSE-2) through the Neocortex project at the Pittsburgh Supercomputing Center. With 40GB of distributed SRAM, 20 PetaBytes/s memory bandwidth, and 850,000 programmable cores, WSE-2 offers the high bandwidth and low latency required to accelerate these workloads. Our ultimate goal is to extend SPIRAL to automatically generate optimized code in Cerebras Software Language (CSL) to fully leverage the performance capabilities of WSE-2.

Generating cryptographic kernels on WSE-2

SPIRAL is an advanced code generation system that originally focused on Fast Fourier Transforms (FFT) and has since expanded to include a wide range of performance-critical applications, including cryptographic kernels such as the Number Theoretic Transform (NTT) in the NTTX project. With 40GB of distributed SRAM, 20 PetaBytes/s memory bandwidth, and 850,000 programmable cores, WSE-2 offers the high bandwidth and low latency essential for accelerating NTT kernels and NTT-based applications like Fully Homomorphic Encryption (FHE) and Zero-Knowledge Proofs (ZKPs). We have hand-coded NTT kernels in CSL and validated their correctness on WSE-2. In the next section, we showcase a small portion of our codebase, and we are actively working on using SPIRAL to automatically generate optimized CSL code.

WSE-2 held by a SPIRAL researcher

WSE-2

Example code in CSL

task butterfly() void {
   @block(butterfly_id);
   @block(recv_north_id);
   @block(recv_south_id);
   var a : u32 = pop_south();
   var b : u32 = pop_north();
   b = modMul(b, twiddle);
   (X_ptr.*)[0] = modAdd(a, b);
   (X_ptr.*)[1] = modSub(a, b);
   send_east();
}

var curr_ntt : u32 = 0;
   // send result of butterflies to next col
   fn send_east() void {
curr_ntt += 1;
var send_east_dsd = @get_dsd(fabout_dsd, .{.fabric_color = east_color_out,
         .extent = 2,
         .output_queue = send_east_oq});
   if (pe_x == 1) {
      @mov32(send_east_dsd, X_dsd, .{.async = true, .unblock = recv_stage_input_id});
   } else {
      @mov32(send_east_dsd, X_dsd, .{.async = true, .activate = ublk_stage_1_id});
   }
   if (curr_ntt == num_ntts) memcpy.unblock_cmd_stream();
}

// two stage ublk in order to avoid a race with the redirect operations
task ublk_stage_1() void {
   if (north_push_idx != north_pop_idx) @unblock(butterfly_id);
   @unblock(recv_north_id);
   @activate(ublk_stage_2_id);
}

task ublk_stage_2() void {
   if (south_push_idx != south_pop_idx) @activate(butterfly_id);
   @unblock(recv_south_id);
}

References

Coming soon.

Copyrights to many of the above papers are held by the publishers. The attached PDF files are preprints. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder. Some links to papers above connect to IEEE Xplore with permission from IEEE, and viewers must follow all of IEEE's copyright policies.