Tom Herbert

Sep 308 min read

Segmentation offload and protocols: Let's be friends!

Tom Herbert, SiPanda CTO, September 30, 2024

One of the lessons that host networking developers have learned over the years is that packet processing is expensive! Each packet requires processing by the NIC, processing by the device driver, and processing by each layer of the host stack. A common technique to address this expense is to reduce the number of packets that need to be processed without reducing the amount of data processed-- this is called segmentation offload. We'll give an overview of segmentation offload and provide some guidelines for new protocol development taking segmentation offload into consideration.

High level design

Segmentation Offload coalesces small packets into larger ones. The coalescing of packets can happen both in the receive path and the transmit path. Segmentation offload has variants to support IPv4 and IPv6, TCP and UDP, and has been implemented in host software and NIC hardware. Most of the cost in packet processing, in either the transmit or receive path, is incurred as each packet traverses the software protocol stack layers; segmentation offload amortizes per-packet overhead and reduces the cost.

Segmentation offload is widely deployed and is supported in Linux, FreeBSD, Windows, and DPDK. It's considered essential to achieving high throughput and managing power for servers. For instance, anyone serving video on the Internet is almost certainly using TSO with TCP or USO with QUIC. When used in tandem with zero copy send and checksum offload, serving video requires a fraction of the number of CPUs that would otherwise be needed without these optimizations.

TSO and GSO

TCP Segmentation Offload (TSO) and Generic Segmentation Offload (GSO) perform segmentation offload when transmitting TCP packets. TSO is a hardware offload dating back to the 1990's, and GSO is a pure software implementation that came a bit later. Since GSO is in software, it offers more flexibility and support for different protocols than TSO, but since TSO goes all the way to hardware it can give somewhat better performance.

TSO and GSO pretty much work the same way:

The networking stack creates a large, greater than MTU size, packet comprised of data enqueued on the send socket buffer
The big packet is processed by various layers in the host stack in the transmit path
A segmentation offload engine splits the big packet into a set of less than MTU sized packets for transmission on the wire
- In the case of GSO, this happens right before the packet is given to the NIC device
- For TSO, the big packet is sent to the NIC and a TSO engine in the device splits the packet
Each of the small packets that were created are transmitted on the wire. At this point they're just plain packets with no hint that they were created by segmentation offload

A big packet is split up by dividing the TCP payload into segments and then appending Ethernet, IP, and TCP header to create a packet. The segment size is chosen so that the created packet size is less than or equal to the MTU. The headers for the small packets are mostly copied from the big packet with adjustments made to certain packet header fields on a per packet basis. The IPv4 total length field or IPv6 payload length field is updated to match the shorter payload. Hop-by-Hop Options and Destination Options extension headers, are copied as is (other extension headers, including AH, ESP, and Fragmentation, aren't compatible with segmentation offload).

The TCP header, including options, is copied to each small packet with some adjustments. The sequence number in each small packet is set to the sequence number of the previous packet plus segment size (the sequence number in the first packet is set to sequence number in the big packet being split up). The FIN, PSH flags are only reflected in the last segment, and CWR is only reflected in the first segment. The TCP checksum is computed for each packet with a simple adaptation of checksum offload.

Example of TSO/GSO and LRO/GRO. On the left GSO and TSO are demonstrated, and on the right LRO and GRO are demonstrated.

LRO and GRO

Large Receive Offload (LRO) and Generic Receive Offload (GRO) are the receive side complements of TSO and GSO. LRO is a hardware offload, and GRO is a pure software implementation. GRO is more flexible than LRO and due to the inherent complexities of receive segmentation offload, LRO has not seen much deployment and GRO is now the preferred technique.

LRO and GRO pretty much work the same way:

Receive TCP packets and validate their checksum
Identify TCP received packets that belong to the same flow
Collect the segments for each flow and attempt to reassemble them-- that is create a larger segment from smaller segments that are contiguous per their sequence numbers.
At some point, like a timer expires or the TCP PSH bit is set, create a fully qualified packet for the reassembled segment and send it up the stack
- For LRO, the big packet is enqueued for the host stack to process. The host stack will dequeue the packet and process it
- For GRO, the packet is passed to the higher layers of the host stack for processing
A reassembled packet is processed normally by the stack, the fact it was created by segmentation offload isn't pertinent

When packets for a flow are being reassembled a temporary state is created that holds the packet being reassembled. If a TCP packet is received, a lookup is performed to see if there is a reassembly in progress for the flow. If no state is found for the flow then a new reassembly state is created and the received packet is set to be the reassembled packet.

If a reassembly state is found for a received packet then the packet is compared against the packet in the reassembly state using a kind of header prediction. This entails:

Checking that most the fields in the IP header TCP header, including TCP options, are equal to those in the packet being reassembled
Fields that can vary per packet, like checksums and payload length, aren't compared but are checked for validity in the received packet
Some fields are ignored in in the comparison, in particular IPv6 Hop-by-Hop and Destination Options are not compared
The sequence number of the received packet is checked to equal the sequence number of the packet in reassembly plus its length (i.e. the packet carries the next in order data).

If the received packet is matched then its payload is merged into the reassembled packet, and the payload length of the reassembled packet is updated accordingly. If the FIN flag or PSH flag is set in the received packet then the flag is set in the reassembled packet. The TCP checksum of the reassembled packet can be set by adding up the checksums covering the payload in each packet being reassembled. After reassembly, if the URG, RST, PSH, or FIN flags are set then the reassembled packet is sent up the stack and the reassembly state is cleared.

If the packet is not matched to an existing reassembly state then the packet being reassembled can be passed to the stack and the received packet takes its place as the packet being reassembled for the flow. A timer is also set in the reassembly state and when it fires the reassembled packet is passed to the stack and the state is cleared.

USO and URO

UDP Segment Offload, or USO, is a variant of transmit segmentation offload for UDP. In USO the host stack creates one super-sized UDP packet that contains a list of payloads for individual packets. The super-sized packet is a UDP/IP packet where the UDP payload is the list of payloads, or segments. All segments are the same size, the segment size, except for the last one.

UDP Receive Segment Coalescing, or URO, is a variant of receive segmentation offload for UDP. In URO, the URO engine, which can be in the host stack or NIC device, collects UDP packets of the same flow and creates a super-sized UDP/IP packet containing a list of segments holding the UDP payloads. Each of the constituent segments must be the same size except for the last one. The segment size is attached to the packet's metadata so that upper layers can derive the original UDP packets.

USO and URO follow similar procedures as GSO and GRO described above except that the UDP header is considered instead of TCP. This makes USO and URO significantly simpler since the only field in the UDP header that can vary per packet is the UDP checksum. Also, UDP has no concept of sequence numbers, so there are no concerns about segment order.

Example USO and URO with QUIC and Hop-by-Hop Options. On the left USO is demonstrated and on the right URO is demonstrated.

Jumbograms and segmentation offload

In segmentation offload, the largest payload size in a reassembled packet is 64K since the IPv6 payload length and IPv4 total length fields are sixteen bits. A solution to this limitation is to use jumbograms with segmentation offload for packets larger than 64K bytes. Jumbograms are specified in RFC2675 and employ an IPv6 Hop-by-Hop option to extend the maximum length of a packet to four gigabytes.

The use of jumbograms with segmentation offload is straightforward. When creating a big packet with a payload size greater than 64K, the jumbogram option is used to reflect the larger size. When splitting a big packet with the jumbogram option into small packets, the option isn't copied into the small packets. Note this means the jumbogram option isn't actually sent on the wire.

Using jumbograms with segmentation offload. In this example a big TCP packet with 100,500 bytes of payload is processed by the host stack. GSO would create 101 packets to send the data, and GRO could reassemble 101 packets to create a big packet with 100,500 bytes of payload.

Segmentation offload friendly protocols

Segmentation offload has been implemented and deployed without needing any protocol changes. The implementation adapted to the protocols as defined, albeit sometimes with a bit of creativity like in the use of jumbograms. So given its pervasiveness, it follows that segmentation offload should be taken into consideration when developing new network layer or transport layer protocols. We provide some general guidelines for that...

The simplest guidance in protocol development is to try to avoid adding new protocol fields that have to be set on a per packet basis or necessarily change in each packet. Specifically, new length fields, checksums, and sequence numbers that vary per packet make life more difficult for segmentation offload.

Hop-by-Hop and Destination Options are compatible with segmentation offload. In GSO or USO, they are copied from the big packet to each of the small packets. In GRO and URO, the extension headers in the reassembled packet are taken from the first packet, and extension headers in later packets are forgotten. This means that new HBH or DestOpt options should be designed to allow replication and be expendible (that is if they are dropped from a packet then correctness is not sacrificed). Note that the jumbogram option does not have these properties, however in that case the IP payload length field is zero which is a hint to skip segmentation offload.

Homogeneity in the packets sent on a flow is a good thing! It increases the potential efficiency of receive segmentation offload, but also benefits other mechanisms in the path like ECMP. So the guidance is to try to make packets sent on the same flow look the same. For instance, it's not great for segmentation offload if different TCP options are sent in every other packet of a flow.

Also, even though RFC2675 might never be actually used on the wire, it is a big win for segmentation offload (there is a variant for IPv4 that relies on some internal signaling attached to the packet, but it's not nearly as elegant). It would be nice if we could continue to use jumbograms! :-)

SiPanda

SiPanda was created to rethink the network datapath and bring both flexibility and wire-speed performance at scale to networking infrastructure. The SiPanda architecture enables data center infrastructure operators and application architects to build solutions for cloud service providers to edge compute (5G) that don’t require the compromises inherent in today’s network solutions. For more information, please visit www.sipanda.io. If you want to find out more about PANDA, you can email us at panda@sipanda.io. IP described here is covered by patent USPTO 12,026,546 and other patents pending.