Tom Herbert

8 hours ago6 min read

Domain Specific Accelerators: The Good, the Bad, and the Ugly

The writing is on the wall! In order to keep computing technology moving forward we need to integrate more hardware into the programmable critical path. Domain Specific Accelerators, or DSA, is the (proposed) answer. The premise is simple enough, use specialized hardware components that accelerate certain workloads within a specific domain or application. But like almost everything else these days, they're no panacea! I think it's a case of the good, the bad, and the ugly. There are different solutions in the space each of which has its pluses and minuses (i.e. the good and the bad). The ugly is the impediments for taking full advantage of Domain Specific Accelerators.

In this discussion, we'll focus on the networking domain, but a lot of this will apply to other domains. We propose three fundamental models for Domain Specific Accelerators in networking:

Offloads
Acceleration instructions
Accelerator engines

These models have been applied in different solutions for networking. The good and the bad of each model means there is no one size fits all solution-- sometimes the models are complementary and sometimes there's been tension between them. My belief is that hybrid solutions that leverage the best of three models are the future! Let's take a look at the good, the bad, and the ugly of the three models and how we can address the ugly.

Example system architecture for accelerators. This diagram shows a hybrid accelerator model that employs offloads, acceleration instructions, and accelerator engines.

Offloads

Offload is an early form of DSA. Almost every commercial NIC (Network Interface Card) in existence offers some form of offload. The basic idea is that functionality running in the host CPU is run in a hardware device instead. Use cases of offloads include checksum offload, TCP segmentation offload, receive side coalescing, TLS offload, and TCP offload.

The programming model of offloads is a “tail call” from a program running in a host CPU to a device, or from the device to the host CPU. In the transmit path, instructions to request offloaded functions are specified in a transmit descriptor, the descriptor is then interpreted by the NIC and the requested offload processing is performed (for instance, computing and setting the TCP checksum). Receive offload works in reverse, the NIC autonomously performs offload processing on received packets and reports results in receive descriptors (for instance, the ones’ complement sum over the packet for verifying checksums).

The good of offloads is its simple programming and security model-- the details of an offload are abstracted out from the programmer by the operating system and device driver. The "fire and forget" aspect means that the program doesn't need to worry about dealing with status or errors about the offloaded operations as the system takes care of that.

The bad of offloads is that there’s no means to leverage offload functionality outside of the networking path with any granularity. For instance, if a NIC has an encryption engine it can only be used in the context of sending a packet, a user program running in the CPU program can’t call the engine to encrypt an arbitrary block of host memory.

The ugly of offloads is that deployment and ubiquity of offloads is underwhelming. Only a few simple offloads like checksum offload have been widely deployed. The impediment is that hardware offloads are only approximations of software functionality; even small discrepancies in functionality might lead to nondeterministic or incorrect behaviors.

To address the ugliness we define the “fundamental requirement of offloads”:

The functionality of a hardware offload must be exactly the same as that in the CPU software being offloaded

While the fundamental requirement is easily stated, that begs the obvious question: How can this requirement be met? Our answer is run the same code in the CPU and offload device.

Programmable devices are an enabler for this. We can infer an offload is viable if the same program code runs in both the host CPU and the device. Specifically, we would generate two images from the same source code, one that runs in the CPU and one that runs in the offload device. There is quite a bit of work needed to make this a first class solution, but we are on the right path. For instance, target agnostic programming models, like the Common Parser Representation, help a lot.

TCP transmit checksum offload. This diagram shows the processing and players for doing TCP checksum offload on transmit

Acceleration instructions

Acceleration instructions are CPU instructions that perform some domain specific functionality (more than just typical integer operations). Examples of accelerator instructions for networking are AES, CRC, and parser instructions.

The good of accelerator instructions is the super simple programming model. Typically, the instructions are used in the back end of library functions and are emitted by the compiler such that the programmer doesn’t even know they’re using them.

The bad of accelerator instructions is that they're limited by the bounds of the CPU. Acceleration instructions work best on data already in the CPU cache and with functions that don’t require a lot of state. For memory intensive functions, the benefits of acceleration instructions can easily be negated by the costs of moving data in and out of the CPU.

The ugly of accelerator instructions is that they need to be implemented in the CPU, and modifying a CPU isn't for the faint of heart! That may require licensing the ISA, standardizing new instructions, and compiler support for new instructions. Fortunately, Open Instruction Set Architecture (ISA), like that of RISC-V, is a game changer. SiPanda has developed a general accelerator instruction for RISC-V, called the DISC instruction (Dynamic Instruction Set Computer), that supports a large class of custom accelerations. An immediate operand to the instruction is a function number that is dynamic and is mapped to a hardware function at runtime. As an example, we implemented SipHash as a DISC instruction and below is the code to compute a SipHash over sixty-four bytes-- performance is 97 cycles for RISC-V with DISC instructions compared to 793 cycles using plain integer instructions!

// Set up length and keys

li a4, 64

mv a5,a10

mv a6, a11

// Hash first 32 bytes

ld a0, 0(a9)

ld a1, 8(a9)

ld a2, 16(a9)

ld a3, 24(a9)

disc a8,siphash_start

// Hash 16 more bytes

ld a0, 32(a9);

ld a1, 40(a9)

disc a8,siphash_round

// Hash final 16 bytes, write result in a8

ld a0, 48(a9);

ld a1, 56(a9)

disc a8,siphash_end2$1

DISC acceleration instructions. This diagram shows accelerator functions for computing SipHass and the processing flow for disc instructions.

Accelerator Engines

Accelerator engines are external IP blocks that are accessed by the CPU over a physical interconnect that might be based on BoW and AXI. A message based interface allows a CPU to send requests to accelerators and get replies. The programming model is a type of Remote Procedure Call (RPC). Accelerator engines usually operate on large amounts of memory (up to Gigabytes), so they optimize for data movement to and from memory. Use cases include encryption, compression, CRC, or regex matching, as well as lookups like route lookup or firewall rule matching.

The good of accelerator engines is the RPC-like programming model which allows a lot of granularity in accelerator functions. Since the accelerator engines are external to the CPU, a fair amount of flexibility and customization is facilitated. Accelerator engines have been implemented in chiplets using ASICs or FPGAs.

Ironically, the bad of accelerator engines is also the RPC-like model. Allowing a CPU to interact directly with accelerator hardware opens the specter of isolation and security issues. For lowest overhead and highest performance, we want to invoke accelerator engines directly from user space applications, but that requires a lot of infrastructure for security and isolation when running a general purpose CPU and OS. An alternative is to only invoke accelerator engines from CPUs in a closed and secure environment like a CPU in a SmartNIC.

The ugly is the lack of a standard message interface between CPUs and accelerators. What we really want is a simple and common messaging interface for the command logic, to the extent that we'd be able to build a system that mixes and matches different CPUs and accelerator hardware from different vendors and things would just work!

SiPanda has developed a simple message format for accelerator requests and replies. This format defines a sixty-four byte message where the first sixty-four bits hold a control header for message delivery, and the following fifty-six bytes can be used as arguments to the accelerator function. The arguments may contain pointers to memory including references to scatter/gather lists of memory buffers. For hardware friendly scatter/gather lists, we propose Packet Vector Buffers, or Pvbufs, that are glorified iovec arrays with the twist an iovec may point to another array. We also propose accelerator pipelining where accelerators can be linked together such that output from one accelerator is the immediate input to another thereby creating “super accelerators” that run in sequence without CPU intervention. We'll talk more about Pvbufs and accelerator pipelines in future blogs!

Example accelerator engine. A function is called to compute the SipHash over a large block of memory. The SipHash accelerator engine is invoked to perform the hash.

SiPanda

SiPanda was created to rethink the network datapath and bring both flexibility and wire-speed performance at scale to networking infrastructure. The SiPanda architecture enables data center infrastructure operators and application architects to build solutions for cloud service providers to edge compute (5G) that don’t require the compromises inherent in today’s network solutions. For more information, please visit www.sipanda.io. If you want to find out more about PANDA, you can email us at panda@sipanda.io. IP described here is covered by patent USPTO 12,026,546 and other patents pending.

Domain Specific Accelerators: The Good, the Bad, and the Ugly

Offloads

Acceleration instructions

Accelerator Engines

SiPanda

Recent Posts