Bringing CAMs to the masses (or at least to software!)

When I was growing up, at least in grad school and early in my career :-), I heard about this wondrous invention called a CAM (Content Addressable Memory). We just give the hardware engine an input key and it immediately spits out an associated value. It's like magic! I couldn't wait to use these things in my code!

Here we are now, so many years later, I have never, not even once, been able to use a CAM in software. Needless to say, I'm disappointed :-(. It turns out that CAMs are really expensive hardware components relegated to uses deep in the system like to support set associative caches or do MAC address filtering in a NIC. I learned this the hard way when I asked the hardware guys to incorporate a programmable CAM into our design and they laughed at me! (okay, maybe they just pushed back a little bit...). The good news is that I view initial feedback from the hardware team to be the start of negotiations, and after a few rounds we were able to define a user programmable CAM that fit our needs and doesn't blow the budget!

So what's a CAM?

A CAM is a type of memory used for very fast key-value lookups. It can be organized as a table of key-value pairs. When a key is input into the engine, all of the keys are compared in parallel and if there is a match the associated value is returned. To do this, each memory bit of each key entry, called a cell, has its own comparison circuit. Additionally, match outputs from each cell need to be combined to yield a match signal for each entry, and those signals need to be considered to find the matching entry. A CAM is expensive because of the comparison logic and logic for combining output signals, and the total cost of the CAM is a function of the number of bits in the key and the number of entries. Note that we're just talking about CAMs today, TCAMs are a whole 'nother story...

Why do we need these things?

I like to think of a CAM being like a switch statement in C. As an example, consider a switch statement on EtherType that might be used when parsing an Ethernet header:

 switch (v) {
 case 0x800:
      /* Process IPv4 packet */
      break;
 case 0x86DD:
      /* Process IPv6 packet */
      break;
   ...
default:
      /* Process unknown protocol */
}

The switch statement in C is brilliant in its elegance of syntax. Unfortunately, compiling this into CPU instructions isn't so easy. In the worst case, the compiler will build a binary search tree of cases and searching the tree is O(log2(N)) where N is the number of cases. There are some tricks that compilers can do to try to reduce the cost. If all the case values are small integer numbers then a jump table might be employed. For instance, if all cases are greater than or equal to zero and less than sixteen, then something like this might work:

 if (v >= 16)
    goto default;
 else
    goto jump_table[v & 0xf]

jump_table would be a sixteen element array of addresses. Each element must be set, either to an address of code for processing the case, or the address of a default handler. The downside to jump tables is that they don't scale well if case values get big. For instance, the EtherType lookup above could require allocating a 65,536 element array which is prohibitive.

There are other techniques for dealing with sparse tables and large case values, including using combinations of hashes, Hamming codes, and jump tables-- but at the end of the day no method comes close in efficiency to what a CAM could do for switch statements. It could be as simple as:

 goto CAM_LOOKUP(v);

Where CAM_LOOKUP is a hardware function that returns either an address for a matched key or a default address if no key is matched. This is an O(1) operation.

Hurray, we finally get to use CAMs!

The whole reason I asked for CAMs in the first place was that we had a particular use case in mind. Namely, we need CAMs for very fast lookups on protocol numbers in a programmable parser (like the EtherType lookup shown above). With a particular use case in mind (domain specific use case if you're into the latest buzz words), we are able to tailor the CAM design exactly to the problem eliciting an elegant and cost effective solution.

In our design, each CAM entry has a 20-bit key and a 32-bit target. The target may be an encoded address or a parser error code, or have some other interpretation based on the context of the lookup. There is only one CAM in the system, so the key is encoded with a selector to allow multiple sub-tables. The number of entries is configurable (a power of two), and in our current designs we are looking at CAMs with thirty-two or sixty-four entries.

CAM keys

A CAM key is structured in one of two ways as indicated by the four high order bits of the key:

union {
 struct {
        Kind: 4 // Set to non-zero
        MatchVal16: 16
   } Match16
   struct {
        Kind: 4 // Set to zero
        Selector: 8
        MatchVal8 : 8
   } Match8
}

This is a union where the high order bits of the key, the Kind field, determine which structure is used. If Kind equals zero, then the Match8 structure is used: Selector selects one of 256 sub-tables with an 8-bit value to match in MatchVal8. If Kind is non-zero then the Match16 structure is used: Kind selects one of fifteen sub-tables (numbered 1 to 15) with a 16-bit value to match in MatchVal16. This arrangement allows matching either 16-bit values (like EtherType) using Match16 or 8-bit values (like IP protocol) using Match8.

Programming the CAM

The CAM is nominally part of the parser, and it's invoked and programmed by parser instructions. The CAM is programmed by writing entries in the CAM table. CAM entries are written using the prs.cam.write instruction where the first source operand is a table index, and the second operand encodes the key and the target value; the key occupies the high order thirty-two bits of the second operand, and the target occupies the low order thirty-two bits. A CAM entry is removed by writing a key value of 0xFFFFFF which is an invalid key value. A CAM entry can be read using prs.cam.read. Example instructions are:

prs.cam.write a0, paccum

prs.cam.read a5, paccum

Instructions for CAM lookups

CAM lookup instructions specify the CAM sub-table in a four bit immediate (this refers to the Kind field in the key structure above). Values one through fifteen refer to the Match16 sub-tables-- these are shared tables since they can be referred to by multiple instructions. If the immediate is zero this refers to a Match8 table-- this is a non-shared table and the Selector derived from the low order bits of the program counter (PC) for the instruction by:

Selector = (PC >> 6) & 0xFF

There are different variants of parser instructions for CAM lookup. Common operands include:

The MatchVal part of the key from a parser register (8 or 16 bit values)
A table selector immediate value expressed as number between 1 and 15 for shared tables, or pc for a non-shared table
A "miss" directive that indicates the action to take on a CAM miss. Actions are indicated by a suffix with the mnemonic:
- No suffix: continue
- .wild: Jump to handler in WildCard register
- .stop: Exit parser with success
- .stopsub: Stop the current sub-node or loop iteration with success
- .fail: Exit parser with failure

Referring to the example table below, some examples of lookup instructions are:

prs.cam.h paccum,paccum[2],10

Lookup an Ethertype in the third word of the accumulator register and return the target in the accumulator. If no match is found just proceed to the next instruction

prs.cam.b.wild.stp pnext,pflags[3],pc

Lookup an IP protocol in the fourth byte of the flags register and set the target in the Next register. If no match is found then jump to the wildcard handler. Note that (PC >> 6) & 0xff == 0xd7 (the selector is set in the table is set to match the address of the instruction)

prs.camjumptlvloop.b.stopsub.stp paccum,pc

Lookup a TCP option for the first byte in the accumulator register and jump to the target node in the context of a TLV loop. If no match is found then exit the TLV loop with success. Note that (PC >> 6) & 0xff == 0x63 (the selector is set in the table is set to match address of the instruction)

SiPanda

SiPanda was created to rethink the network datapath and bring both flexibility and wire-speed performance at scale to networking infrastructure. The SiPanda architecture enables data center infrastructure operators and application architects to build solutions for cloud service providers to edge compute (5G) that don’t require the compromises inherent in today’s network solutions. For more information, please visit www.sipanda.io. If you want to find out more about PANDA, you can email us at panda@sipanda.io. IP described here is covered by patent USPTO 12,026,546 and other patents pending.