top of page
Search
Writer's pictureTom Herbert

CPU-in-the-datapath: We can have our cake and eat it too!

Tom Herbert, SiPanda CTO, October 21, 2024

A few years ago I was attending a conference and happened to overhear a conversation between a couple of engineers working for router vendors. The gist of it was that a CPU could never be in the router fast path, we'll always need specialized hardware! Well that got me thinking... I know it's conventional wisdom that CPUs are "too slow" for the performance critical datapath, but is that an inherent truism? Can we achieve full datapath programmability at the speed of hardware so that we can have our cake and eat it too?!?!


Domain Specific CPUs to the rescue!

I've come to realize that our router engineer friends made one critical assumption: they were only considering general purpose, commodity CPUs. Nowadays we can turn to Domain Specific Architecture where we can architect the solution around the problem. This motivates pursuing Domain Specific CPUs: The idea is to customize the CPU and supporting uncore to optimize network datapath processing. Just five years ago this might have been a non-starter, but thanks to the recent trend towards Open ISA (Instruction Set Architecture), and RISC-V in particular, Domain Specific CPUs are now feasible. The upshot is that we can buck conventional wisdom to get the programmability and flexibility of CPUs with performance rivaling that of dedicated hardware! We call this CPU-in-the-datapath!


To pull off CPU-in-the-datapath we have to leave no stone unturned to squeeze out every ounce of performance. To that end, we came up with seven design principles:

  • Chuck unnecessary CPU features: Sometimes it's not what you put in, but what you take out! The Floating Point Unit, Vector Unit, and probably even the MMU aren't needed in the high performance network datapath so they're out. RISC-V is a modular design so we can easily remove them to benefit PPA (Power, Performance, and Area).

  • Adios OS: At the performance levels we're talking about we just can't afford overheads of an OS. A big OS like Linux is right out, but even running an RTOS would be dicy. So we'll run application code on a bare metal CPU, and OS-like services are provided by the infrastructure either in custom instructions or hardware accelerators.

  • Parallelism and threading model is paramount: Parallelism is critical to getting any sort of performance from CPUs, so we'll want a bunch of CPUs running in parallel. We need the techniques of horizontal and vertical parallelism, as well as threading and synchronization. All this is supported by a specialized hardware thread scheduler.

  • Manage memory access: Cache misses can quickly kill performance, so our goal is zero cache misses in the critical path! The hardware scheduler can prefetch all the data a function might access before even running a thread. A nice side effect and design simplification is that we don't need CPUs to be cache coherent.

  • Use accelerator instructions: Once we've bought in to an open ISA, developing custom accelerator instructions is pretty straightforward (we've already developed a few dozen instructions for the network processing domain).

  • Use accelerator engines: Some things are better left to hardware rather than trying to force everything through the CPU. In particular, functions on packet payload, like encryption and compression, as well as lookup tables that need a lot of memory, are better handled by accelerator engines. The program in the CPU is the orchestrator, and we want them to seamlessly invoke inline accelerators with near zero overhead.

  • An easy-to-use programming model: The programming model for hardware needs to be easy-to-use or no one's going to want the hardware! In our design, we want to let users easily program their datapath in the language of their choice and allow compiling to a bunch of software and hardware targets-- including, but not only, domain specific CPUs.


Design for CPU-in-the-datapath

Now that we've spelled out the design principles, let's talk a little about actual design. The architecture we're presenting here is super high level, and there's a lot of details and a whole bunch of IP we're leaving out for brevity, but nevertheless this should be enough to reflect an elegant and feasible design. The picture below illustrates the top-level design.

The design features a number of components that form a loosely defined pipeline. Components communicate by sending messages on lockless hardware FIFOs. Packets are input though front end interfaces and are immediately dispatched by the Dispatcher to CPU clusters for processing. CPU clusters are where the programmable processing happens. For the performance levels we've been talking about, it's going to take a lot of CPUs running in parallel. For high end performance of 1 terabit throughput and 1 billion packets per second, we project we'll want between 128 and 256 CPUs. To keep things manageable, we split up the CPUs into clusters of about eight CPUs each. The anatomy of a cluster is shown below.

The basic flow for processing a received packet is straightforward. Referring to the diagram below, the processing steps are:

  1. A packet is received on one of the interfaces. This could be a physical interface like Ethernet, or a PCIe interface to the host CPU running Linux, or some sort of virtual interface like one that might be used to reinject packets.

  2. The Dispatcher gets an indication from the interface and creates a work item describing all the pertinent information for processing the packet (including packet pointer, packet length, packet checksum, timestamp, etc.). The Dispatcher selects a cluster and sends a START message containing the work item to commence processing. All clusters are expected to be identical and we don't rely on flow affinity, so the Dispatcher doesn't need a fancy algorithm to select a cluster-- round robin or by random should be sufficient.

  3. The Cluster Frontend  dequeues START messages and allocates a "packet state" in cluster local memory that consists of a parsing buffer, a metadata buffer, and a cluster work item for the packet.

  4. The Cluster Frontend streams the first N bytes of the packet into the parsing buffer, copies the external work item to the local packet state work item, and zeroes the metadata buffer

  5. The Custer Frontend selects a parser (round robin is fine) and sends a START PACKET message with a reference to the packet on the parser's input FIFO.

  6. The parser dequeues a START PACKET message and runs the user's parser program in the CPU. The program can use parser instructions.

  7. At various points in the parser program the prs.runthread instruction can be executed. For each instance, a thread work item is created that describes all the pertinent information for processing the associated protocol header including the offset and length of the current header (or offset and length of a data header in the case of a TLV).

  8. For each thread request, the parser sends a message to the cluster scheduler with a reference to the thread work item. Multiple threads might be scheduled for processing a packet. The last request message for a packet has type CLOSE_THREAD_SET and messages before the last one have type START_THREAD_SET.

  9. The cluster scheduler dequeues messages from its input FIFO. When the first thread request for a packet is received, the cluster scheduler allocates a thread set to process the packet. For each thread request message in the train (up to and including the one with type CLOSE_THREAD_SET), the cluster scheduler selects an available worker CPU and sends a START THREAD message to its CPU scheduler. The message includes a reference to the thread work item.

  10. The CPU scheduler dequeues a START_THREAD message and schedules the function to run in the requested CPU thread. The message includes the function number (from prs.runthread) as well as input arguments to the function. When the thread completes, the CPU scheduler sends a "DONE" message to the cluster scheduler, and once all the threads finish for a thread set then the thread set is done and can be reused.

The CPU scheduler, cluster scheduler and global scheduler work together to do dependency resolution; and the CPU scheduler, cluster scheduler, and accelerator engines work together to allow programs to invoke accelerator engines. Will talk more about this in later blogs!


SiPanda

SiPanda was created to rethink the network datapath and bring both flexibility and wire-speed performance at scale to networking infrastructure. The SiPanda architecture enables data center infrastructure operators and application architects to build solutions for cloud service providers to edge compute (5G) that don’t require the compromises inherent in today’s network solutions. For more information, please visit www.sipanda.io. If you want to find out more about PANDA, you can email us at panda@sipanda.io. IP described here is covered by patent USPTO 12,026,546 and other patents pending.

3 views0 comments
bottom of page