Tom Herbert, SiPanda CTO, October 28, 2024
Back in the late 1990's to early 2000's there was a ton of hype around TCP Offload Engines. The basic premise of TCP Offload Engines, or just TOE, was to run TCP/IP networking stack in NIC hardware instead of in host software. By throwing hardware at the problem we expected to increase TCP performance by at least an order of magnitude. There was a lot of effort and investment in TOE, in fact I even worked on it at a startup called Silverback Systems. Oh how naive we were! We completely underestimated the realities of putting a protocol as complex as TCP in hardware, and I think we can now safely say that TOE was an abysmal failure. But like some many dumb ideas before, the motivation for TOE, or more specifically TCP Offload, never completely went away. Today we'll discuss a modern take on it!
Where did TCP Offload Engines go wrong?
So why did TOE fail so spectacularly? It turns out there were several issues mostly having to do with the inflexibility of hardware:
A TCP implementation in hardware is difficult to fix which becomes especially problematic when security issues with the TCP stack are found
Under some edge conditions, a hardware implementation of TCP may actually yield worse performance than a TCP stack running in the host OS
The cost for all TCP connection offloading is fixed; there's no way for the operating system to optimize specific use cases
The NIC code wasn't written with TCP in mind, thus not all TCP features are implemented
To be transparent, TCP acceleration required seamless integration into the host OS. Microsoft was able to do this with Chimney, however any attempt to introduce that sort of integration into Linux is simply a non-starter
A modern-day rationale for TCP Offload
TOE as TCP in hardware may have failed, but there is still a rationale for TCP Offload that doesn't necessarily put all TCP in hardware. As it turns out, the problem wasn't so much about running TCP fast, it's about where in the system TCP is being run! Let me explain...
The first observation we can make is that even while generic TOE solutions failed to get traction, one last bastion of it has seen deployment: RDMA/TCP and later NVMe/TCP. The difference between these and generic TOE is that they assume a constrained set of use cases and workloads that mitigate TOE problems 1-4. Another advantage is that the offload doesn't need to be integrated into the host TCP stack, instead the device provides block interfaces like RDMA queue pairs to the host and internally maps the queues over TCP connections which addresses problem 5.
The second observation is that cloud providers want to make money! One way they do that is by monetizing their server CPUs by charging for CPU cycles used, so when those CPUs are doing something other than executing customer's code the providers tend to get cranky! Burning precious CPU cycles on network stack processing, or really any sort of infrastructure processing, is perceived as lost monetization. TCP Offload is one way to free up CPU cycles from infrastructure processing, even if that doesn't mean TCP runs faster it's still a benefit to cloud providers.
Introducing TCP Offload via AF_XDP Sockets
Okay, let's assume that there's good motivation for TCP Offload. The question is then how to do TCP offload without hitting the problems of TOE. Our answer is called TCP Offload via AF_XDP Sockets. In a nutshell, we run a software TCP stack in a NIC CPU with an "offload proxy". An application running on the server CPU communicates with the proxy to create connections and perform data transfer operations by using a lightweight message protocol that runs over AF_XDP sockets. The net result is that almost all of the "heavy lifting" for processing TCP is handled by the NIC CPU thereby freeing up host CPU cycles for application use. The picture below gives the gist of the idea.
The diagram below provides a little more detail. The server application, shown on the left, links with the xdp_transport library. The library implements the XTS sockets interface that provides connection management and data operations for offloaded TCP connections. XTS sockets looks a whole bunch like BSD sockets-- for instance, there are xts_socket, xts_connect, xts_accepts, xts_sendmsg, and xts_recvmsg calls. XTS sockes provides the same semantics and functions take the same arguments as those in BSD sockets, except that sockets are referred to by an XTS handle instead of a file descriptor. The user can write their network program using XTS sockets following the same basic program flow with BSD sockets.
The offload proxy, to the right in the diagram below, runs in userpace on the NIC CPU. A "NIC CPU", or "App(lication) CPU", is a CPU integrated into the NIC that can run application code for management and data operations; for instance, ARM CPUs are commonly integrated into SmartNICs as Application CPUs. On one side of the proxy is the control interface to the host application. AF_XDP sockets are mapped to the NIC hardware queues, and offloaded connections are created and managed by the proxy via messages sent and received on the AF_XDP sockets. On the other side of the proxy are normal TCP connections represented by TCP sockets. The offload proxy receives commands to create connections from the host application, establishes TCP connection, and signals success to the host application. Once the connection is established, data operations commence where the host application can send data over the proxied connection, and the offload proxy can deliver received data to the host application.
A key facet of this design is that zero changes to the kernel are required, neither in the server side nor in the NIC CPU. Minimally, we just need a kernel that supports AF_XDP socket mapped to NIC queues (conceptually, DPDK could alternatively be used). Changes are needed to the application, but those are mostly straightforward-- it's a matter of linking to the xdp_transport library and using xts_* socket calls in the same way normal socket calls would be used. The hardware queues used for TCP offload pretty much look like any other NIC hardware queues, they are just dedicated for the particular purpose of internal messaging. The one hard requirement in this design is that these queues must be lossless: if a message is enqueued on one side it must be dequeued and processed in order by the other side. The hardware should support proper flow control mechanisms to ensure losslessness.
Application <-> offload proxy messages
A bidirectional message based interface is defined for managing offload connections. xdp_xport_msg_hdr is a common message header: the first byte indicates the message type and the second byte is a sequence number used to detect lost or out of order messages. The message type indicates the format of the rest of the message. Messages with types to create connection, close connection, send data on connection would be sent from the host application to the offload proxy; and messages with types like connection opened, new incoming connection, received data would be sent from the offload proxy to the host application. As shown below, the messages are encapsulated in an Ethernet frame with an experimental EtherType. This allows the internal messages to be distinguished from normal Ethernet packets that might be sent on the AF_XDP sockets, and ensures that the packets are dropped if for some reason the NIC were to send them on the network..
ULP offload
The concepts described here for TCP Offload can be extended to support ULP (Upper Layer Protocol) offload sockets thereby offloading even more processing from server CPUs. These effectively create user space sockets in the host application that allow sending and receiving of discrete protocol messages in the TCP stream. The protocol used with a ULP socket is completely programmable such that ULP sockets of various protocols can be used including RDMA/TCP, NVMe/TCP, TLS, and RPC protocols like gRPC. The picture below shows an example for creating "HTTP/2 sockets". The application just sends and receives HTTP/2 messages over the socket. As the picture suggests, functionality like TLS or message delineation can be achieved by logically pushing modules on a TCP socket.
Souped up performance
As we described, a core concept of this design is that the offload TCP stack is run in a Linux stack on the NIC CPU. This includes running the TCP state machine and datapath operations in a software stack. Using a plain commodity CPUs achieves the offload functionality and gives us the flexibility of software that was lacking in TOE, but without more elaboration all this is really doing is moving the same work from one CPU to another. For higher performance, we want to use any available hardware accelerations. Since NIC CPUs run in a closed environment and in close proximity to hardware, there is a ample opportunity to integrate various datapath accelerations including those we've talked about before like segmentation offloads, Domain Specific Accelerators, CPU-in-the-datapath, and programmable parsers.
SiPanda
SiPanda was created to rethink the network datapath and bring both flexibility and wire-speed performance at scale to networking infrastructure. The SiPanda architecture enables data center infrastructure operators and application architects to build solutions for cloud service providers to edge compute (5G) that don’t require the compromises inherent in today’s network solutions. For more information, please visit www.sipanda.io. If you want to find out more about PANDA, you can email us at panda@sipanda.io. IP described here is covered by patent USPTO 12,026,546 and other patents pending.