Tom Herbert

Sep 237 min read

How 'bout compiling data structures and not just the language?

Updated: Sep 30

maintaingTom Herbert, SiPanda CTO, September 23, 2024

We typically think of compilers in terms of compiling a program written in some language. The goal of the compiler is to parse and interpret the program with regards to the syntax and semantics of the programming language. That’s great! But what if we take compilers one step farther? What if compilers had a deeper understanding of the semantics of the program itself? For instance, instead of just compiling the code in language constructs, what if compilers could compile and optimize based on the specialized semantics of data structures in the program? In other words: What if the data structure was the code?

Data structures as the code

It’s better to define and specify a parser by declarative representation rather than imperative representation. “declarative representation” is just a fancy term for representation in data structures (and “imperative representation” is just a fancy term for code with if-then-else constructs). This makes perfect sense. A parser is intuitively defined by a parse graph which is a type of graph, and a graph can be programmed as a set of data structures for nodes with pointers as links connecting nodes.

So what the developer really wants is to code up their parser as a set of data structures populated with various field values and functions. Pretty much any generic programming language supports this. For example in C, the nodes of the parser could be coded as data structures that include pointers to functions. In C++, the nodes could be objects of a class with constant values and member functions.

In a very real sense, the data structure for a parse graph is the code for the parse graph, hence we can say the data structure is the code. If we augment the compiler to understand the semantics of the parser data structures then the compiler can emit optimized binary with regard to those semantics. This implies some new compiler functionality:

A compiler is able to create an internal representation of the data structures. Specifically, it would be taught how to identify data structures with specialized semantics.
The compiler is able to compile the functions in a specialized data structure in the context of the data structure semantics (we show some examples of this below in the parameterized functions discussion).
The compiler can interpret the semantics of the data structures and emit an Intermediate Representation or binary that implements the semantics. This may include necessary support infrastructure like parser flow logic, a parsing loop, and bookkeeping for header lengths and checks.

Example annotated parse graph. This shows a parse graph for common protocols with some annotations. The parser structure is shown in my_parser at the upper right. The annotations for the ipv4_node structure are at the left, and the table for the IP Proto switch (ip_table) is shown in the center. The annotations match the data structure fields and are shown in the equivalent Common Parser Representation JSON.

Parser in C data structures

To illustrate the concepts, let's look at defining a parser in C with a focus on the IPv4 node.

Data structures

We’ll define three rudimentary data structures: parser, parse_node, and proto_table_entry.

struct parser { const struct parse_node *root_node; }

struct parse_node { int min_len; int (*len)(const void *hdr); int (*next_proto)(const void *hdr); const struct proto_table_entry *proto_table; };

struct proto_table_entry { int value; const struct parse_node *node; };

Parser definition

A parser can be declared using these data structures as shown below. The parser my_parser is defined with root node ether_node. The parse node ipv4_node is defined and includes functions to compute the IPv4 header length and next protocol number (other parse nodes including ether_node are not shown). The protocol table for IPv4 is ipv4_table (other protocol tables are not shown).

const struct parser my_parser = { .root_node = &ether_node };

static inline int ipv4_len(const void *viph) { return ((struct iphdr )viph)->ihl * 4; }

static inline int ipv4_proto(const void *viph) { return ((struct iphdr )viph)->protocol; }

const struct parse_node ipv4_node = {

.min_len = sizeof(struct iphdr);

.len = ipv4_len;

.next = ipv4_proto;

.proto_table = &ipv4_table;

};

const struct proto_table_entry ipv4_table[] = {

{ 6, &tcp_node },

{ 17, &udp_node },

. . .

};

Compiled Intermediate Representation

The parser in data structures can be compiled to the Common Parser Representation. Note the correspondence between “name” attributes in the JSON and the variable names of the parser structures in C.

"parsers": [ { "name": "my_parser", "root-node": "ether_node" } ],

“parse-nodes”: [

{

“name”: “ipv4_node”, “min-hdr-length”: 20,

“next-proto”: { “field-off”: 9, “field-len” : 1, “table” “ip_table”},

“hdr-length”: { “field-off”: 0, “mask”: “0xf”, “field-len”: 1, “multiplier”: 4 }

. . . ]

"proto-tables": [

{ "name": "ip_table",

"ents": [

{ "key": "6", "node": "tcp_node" },

{ "key": "17", "node": "udp_node" }

. . . ]

. . .

]

How does the compiler deal with this?

For the most part, all the compiler needs to do is identify the declaration of parser data structures and output an equivalent structure in Common Parser Representation. The variable name of a C parser structure is set as the “name” of the corresponding structure in Common Parser Representation; for instance, struct parser my_parser maps to "name": "my_parser" in JSON. Constant fields are instantiated by the values set in the constant data structure; for instance .min_len = sizeof(struct iphdr) becomes “min-hdr-length”: 20. Pointer references to other parser structures are followed to deduce a text name from the variable name; for instance { 6, &tcp_node } in C becomes { "key": "6", "node": "tcp_node" } in JSON.

For functions, the compiler tries to infer a parameterized function. These are functions that are specifically interpreted in the context of parser operations. We’ll discuss the use of parameterized functions for header length and next protocol.

Parameterized functions

A function for deriving the header length can be specified by a parameterized function:

HLEN<pfield-off, pfield-len, pmask, pshift-right, pmultiplier, padd> = [(READ(HDR, pfield-off, pfield-len) & pmask) >> pshift-right] * pmultiplier + padd

The READ function reads data from the header pointer in HDR. pfield-off is the offset of the length field in the header, and pfield-len is the size of the length field in bytes. pmask is a mask, pshift-right is number of bits to shift right, pmultiplier is a multiplier value, and padd is a value added to the final result. An instance is given by <pfield-off, pfields-len, pmask, pshift-right, pmultiplier, padd>. For example, the function to compute the IPv4 header length is denoted by HLEN<0, 1, 0xf, 0, 4, 0>, and the function to compute the length of an IPv6 Hop-by-Hop Options Extension Header is denoted by HLEN<1, 1, 0xff, 0, 8, 8>.

From the C code above, the IPv4 header length is computed as:

((struct iphdr )viph)->ihl * 4;

The compiler will logically convert this to something like:

((__u8 *)viph)[0] & 0xf << 2;

Now it’s just a matter of pattern matching to see if this can be written as a parameterized length function. The [0] implies pfield-len equals 0 and __u8 * implies pfield-len equals 1, so ((u8 )viph)[0] is equivalent to READ(HDR, pfield-off= 0, pfield-len= 1). The & 0xf implies pmask equals 0xf, and << 2 implies pmultiplier equals 4. There’s no right shift or added value, so pmultiplier and padd equal 0. Putting all these together, the parameterized function for IPv4 header length derived from the C code is HLEN<0, 1, 0xf, 0, 4, 0> (just like we mentioned above!).

The JSON “hdr-length” attribute in Common Parser Representation encodes the parameterized function for header length. The sub-attributes correspond to the parameters of the HLEN function: “field_off” is pfield-off, “field_len” is pfield-len, “mask” is pfield-mask, “shift_right” is pshift-right, ”multiplier” is pmultiplier, and ”add” is padd.

Putting this all together, the C code function for IPv4 length:

((struct iphdr )viph)->ihl * 4

compiles to Common Parser Representation in JSON IR as:

“hdr-length”: { “field-off”: 0, “mask”: “0xf”, “field-len”: 1, “multiplier”: 4 }

where default values of 0 are assumed for “shift-right” and “add”.

The next protocol can similarly be defined by a parameterized function:

PTYPE<pfield-off, pfield-len, pmask, pshift_right> = (READ(HDR, pfield-off, pfield-len) & pmask) >> pshift-right

An instance is given by <pfield-off, pfield-len, pmask, pshift-right>. For example, the function to derive the next protocol from an IPv4 header is denoted by PTYPE<9, 1, 0xff, 0>, and the function to derive the next header for an IPv6 Hop-by-Hop Options Header is denoted by PTYPE<0, 1, 0xff, 0>.

Following the same logic for deriving the header length, the C code function for IPv4 next protocol:

iph->protocol

compiles to Common Parser Representation in JSON IR as:

“next-proto”: { “field-off”: 9, “field-len” : 1, “table”: “ip_table” }

where default a values of 0 is assumed for “mask” and “add”. Note that “table” is an additional attribute that is not part of the PTYPE function.

Generalizing the idea

The concept of the data structure being the code is pretty profound. It’s applicable to a whole class of programming problems where the program is defined by a constant data structure. This includes not only parsers, but potentially any sort of Finite State Machine. Most mainstream programming languages have data structure constructs, so this is a simple and intuitive way to introduce declarative representation in an imperative language without forcing developers to switch to a Domain Specific Language.

The changes needed in the compiler aren't as scary as you might think! We implemented all of this using LLVM tools without any changes to the core compiler. This entails employing a standalone tool that operates on the Clang AST (Abstract Syntax Tree). The tool identifies and applies semantics to parser data structures in the AST, and uses the pattern matching capabilities of AST to infer parameterized parser functions.

SiPanda

SiPanda was created to rethink the network datapath and bring both flexibility and wire-speed performance at scale to networking infrastructure. The SiPanda architecture enables data center infrastructure operators and application architects to build solutions for cloud service providers to edge compute (5G) that don’t require the compromises inherent in today’s network solutions. For more information, please visit www.sipanda.io. If you want to find out more about PANDA, you can email us at panda@sipanda.io. IP described here is covered by patent USPTO 12,026,546 and other patents pending.