2,986 Words. Plan about 19 minute(s) to read this.
As I study software defined networking architectures, I’ve observed that none of them are exactly alike. There are common approaches, but once diving into the details of what’s being done and how, even the common approaches seem to have as many differences as similarities.
One of the most interesting elements of SDN architectures is traffic forwarding. How does traffic get from one point to another in an SDN world? Cisco ACI’s traffic forwarding approach is intriguing in that it neither relies on the controller for forwarding instructions, nor does it rely on a group of network engineers to configure it. Rather, Cisco ACI fabric is a self-configuring cloud of leaf and spine switches that forward traffic between endpoints without forwarding tables being programmed by the Application Policy Infrastructure Controller (APIC). APIC assumes forwarding will happen, worrying instead about policy.
The notion of a self-configuring fabric that delivers traffic with no outside instruction sounds mystical. How, exactly, does Cisco ACI forward traffic? Wanting to understand the basics myself, I spent time reviewing presentations by engineers from the Cisco ACI team, and have distilled the information down as best as I could.
Let’s start at the beginning.
What does an ACI topology look like?
A Cisco ACI switch topology is a two-tier, leaf-spine design. The big idea is to keep the fabric simple, and a two-tier architecture has efficiency, predictability, and resilience to commend it. Two-tier leaf-spine design is the most common data center network reference architecture recommended by the industry today. It’s hard to read a data center related technical whitepaper without leaf-spine being mentioned. Cisco has not done anything strange here.
What I can’t tell you is if three-tier leaf-spine designs — sometimes used to scale for host density — are supported, but I tend to think not. The Cisco ACI behavior discussed in the videos imply a non-blocking, two-tier fabric throughout. Three-tier leaf-spine designs are not non-blocking. Therefore, very large data centers wishing to run a thousands of physical hosts within a single ACI fabric would scale horizontally, adding spine switches. Considering the high port density of certain switches in the Nexus 9000 family, I can’t imagine scale being a limitation for almost anyone.
Initial ACI fabric setup
The initial setup process seems simple enough. I’m probably oversimplifying it, but the way I understand it, the Application Policy Infrastructure Controller (APIC) is connected to the ACI switch fabric. APIC discovers the switch topology. APIC then assigns IP addresses to each node (a leaf switch), to be used as VxLAN tunnel endpoints (VTEPs). VXLAN is the sole transport between leaf and spine across the fabric. Why? Well, a couple of reasons. One, policy is enforced in part by shipping traffic between specific leaf nodes. Two, you get 16M segments instead of 4K VLANs or a limited number of VRFs. VxLAN offers virtualized network segments with essentially no practical limit. For what it’s worth, NVGRE and STT have the same 16M limit, and presumably Geneve does as well, although I’d have to dig out the IETF draft and check.
It’s worth noting that in Cisco ACI terminology, this leaf-spine fabric is known as the infrastructure space. The uplinks facing endpoints are the user space. Within the infrastructure space, there are only VTEPs. Nodes are switches. Endpoints are hosts – you know, the things actually generated traffic.
Packet flow through the ACI fabric
How does a packet make it through the Cisco ACI fabric? First, be aware that the default gateway for any network segment is distributed across all leaf nodes in the ACI fabric. Therefore, whatever leaf node an endpoint is connected to will be the layer 3 gateway. This could hardly be otherwise when considering what happens when a packet arrives at a leaf node.
When a packet flows into a leaf switch, headers are stripped. Whether the frame or packet had it’s own 802.1q, NVGRE, or VxLAN tags or wrappers when arriving at the leaf switch – or if it was untagged – it just doesn’t matter to ACI. That information is not important to deliver a packet across the fabric. Once the headers are stripped, the remaining packet is encapsulated in VxLAN and forwarded across the fabric from the ingress leaf node to the egress leaf node. ACI determines where the egress leaf node is, by sorted out where the packet needs to go based on the ACI policy. When the VXLAN packet arrives at the egress leaf node, the VXLAN header is removed, and the packet encapsulated in whatever way is needed to deliver it to the destination host. The packet does NOT have to be re-encapsulated the same way it was originally encapsulated when arriving at the ingress leaf node. So, a packet could arrive at the ACI fabric edge with a VxLAN encapsulation, but leave it with an 802.1q tag.
The key to grasping all of this is understanding that endpoints (hosts) are part of ACI groups. The groups form the common bond. Moving packet traffic between group members is the job of an invisible transport that the network engineer does not have to configure. Yes, I’m talking in this article about what’s going on under the hood, but in actual practice, network operators shouldn’t have to open ACI’s hood to fire up the engine.
Let’s consider an example here of forwarding VMware NSX traffic across an ACI fabric. Yes, you can do that. ACI is a more than viable underlay for NSX. Okay, let’s assume NSX is sending VxLAN encapsulated traffic from ACI user space into ACI infrastructure space. The original NSX VxLAN encapsulation will be stripped by the ingress leaf node, and a new ACI-specific VxLAN encapsulation with a new “VNID” (virtual network identifier) completed instead. The packet will be shipped across the fabric to the egress leaf switch. The egress leaf switch will place the original NSX VxLAN header back onto the packet, and forward it to the endpoint. If you’re wondering why Cisco doesn’t just do a double encapsulation (wrapping the ACI VxLAN encapsulation around the original encapsulation), their response is that a double encap wastes bandwidth. Fair enough.
Another point worth making here is that endpoints have no idea about any of this encapsulation and decapsulation. To hosts, ACI presents itself as plain old Ethernet. The special things ACI fabric does are transparent to the hosts. For example, hosts still use standard Address Resolution Protocol (ARP) to map a known IP address to an unknown MAC address. ARP queries are broadcast packets. (Hey, if any of you listening on this segment own IP address A.B.C.D, could you respond so that I can learn your MAC address and send you an Ethernet frame?) These broadcast queries still flow into a leaf switch, but then a bit of unique ACI handling occurs. The broadcast from user space is turned effectively into a unicast shipped across infrastructure space. Where is it sent? To the leaf switch uplinking the host that can answer the ARP. ACI already knows where all the endpoints are in the fabric, so why bother broadcasting ARP queries to the entire network segment? Flooding is an undesirable networking behavior that assumes ignorance. In this case, knowledge equals efficiency.
A final interesting detail is that an ACI fabric is treated as a single routed hop, even though there will be three physical devices a packet is forwarded through: the ingress leaf node, a spine switch, and an egress leaf node. The “appears as a single routed hop” notion is logical as the specific physical devices in the ACI fabric are NOT individual routed hops from the perspective of the endpoint.
More about spine forwarding
Part of the Cisco’s pitch for ACI is scale. You can build a really large ACI fabric on Nexus 9000 hardware that Cisco has priced aggressively. The Nexus 9000 gear is based in part on Broadcom silicon, just like what’s found in a lot of other switches, so where’s the unique value proposition? The uniqueness comes in the “plus” silicon as in the term “merchant plus” used to describe the ASICs found in the Nexus 9000. Here’s how I understand it.
- The fabric cards found in spine switches are based on the Broadcom chipset.
- The line cards in the spine switches are where the “plus” ASICs built by the Insieme team are found. The line cards are what allow the “million address database” to be built. That seems like a sizable enough fabric to handle the data centers of most of us.
Remember that the ACI fabric auto-configures forwarding. Neither network engineers nor APIC program the forwarding tables or configure routing. So, how does an ACI fabric learn about all the VTEPs? There are three main components.
- A DHCP pool is used in the spine to assign addresses to the VTEPs.
- Automatic learning of VTEPs is via LLDP.
- ISIS builds a routing table of VTEP IP addresses using the information discovered via LLDP.
What happens during multicast forwarding?
Multicast over ACI follows after the theme of efficiency. Simply, the ACI fabric builds a multicast tree. Incoming multicast frames are copied by spines to leaf switches with those who have requested to see the stream. Leaf switches will replicate the multicast frame to the endpoints that should receive them. At a glance, this strikes me as similar to how Avaya Shortest Path Bridging forwards multicast traffic through an SPB fabric.
What happens when a leaf uplink to a spine fails?
One of the most likely physical failures an ACI fabric will experience is that of a leaf-spine link failing. When this happens, there is nearly instant convergence and re-hash of the ECMP forwarding on that leaf node, as fast as 125 microseconds (not milliseconds). The convergence is not dependent on a link aggregation protocol.
What happens when a node or indirect link fails?
Another failure that ACI fabric might need to respond to is a node (leaf switch) falling out of the fabric. From a topological standpoint, the failure of an indirect link is similar to a node failure. How does ACI react to these events?
IS-IS, the routing protocol between leaf and spine switches will converge, as quickly as any IGP. IS-IS is building a routing table between VTEPs, any VTEP to any VTEP, and practically speaking, any leaf node to any leaf node. ISIS convergence happens independently of APIC.
Remember that APIC is not interested in how to forward traffic; packet delivery is assumed by APIC, which is a contrast with the OpenFlow model that is focused on flow forwarding primarily. This is key from a couple of different perspectives.
- There is no latency penalty incurred while APIC figures out the new topology, runs an algorithm to settled on how best to forward traffic, and then updates forwarding tables in the fabric. APIC isn’t involved in that part of the process.
- The assumption of traffic delivery is somewhat parallel to overlay designs that assume the underlay will deliver traffic, but are not controlling how that forwarding is done. ACI has a leg up in the sense that although APIC is not building the underlay, ACI as a system can still provide insight into what’s going on in the physical fabric. The decoupling of policy (APIC) from delivery (ACI fabric) hasn’t left network operators blind.
How large does ACI fabric scale?
ACI scales up to 1M entries for MAC, IPv4, and IPv6. All reachability information is stored in all spine switches, while leaf nodes only learn information about the remote hosts it needs to know. Therefore, not every leaf switch knows about every endpoint everywhere.
So, what happens if a leaf needs to forward to an endpoint it doesn’t know about? The leaf switch uses the spine switch as a penalty-free proxy (aka a “zero penalty miss”). “Hey, I don’t know how to deliver this. You know everything, so please send this where it needs to go.” This is a zero penalty miss because the spine would have received this traffic anyway. In a sense, this is like default routing.
How does ACI fabric connect to the outside world?
There are a couple of discussions worth having about connecting ACI fabric to the rest of the network — one about layer three, and one about layer two.
First, ACI fabric can peer with the outside world using traditional routing protocols such as BGP or OSPF. You can even use static routes if you like. This is done at a leaf node, not a spine, which might be counterintuitive if you were thinking of a spine switch like a core switch. The spine switch isn’t in the role of core switch when it comes to forwarding – remember that the spine’s primary role is to move data between VTEPs terminating on leaf nodes. Okay. So, when the leaf node peers with the outside world, the leaf sends what it has learned to the spine, and the spine, functionally, becomes a route-reflector. Remember how the spine is a proxy for destinations the leaf doesn’t know? The same logic applies to external routes as for routes within the fabric.
Second, ACI leaf switches can connect to existing switches you might already have on your network at layer two. The main point of this sort of interconnect is to allow pre-existing network hardware to leverage ACI policy groups. To accomplish this, an existing switch can connect to an ACI leaf node switch. The default gateways are moved to the ACI fabric. Traffic then flows from the host, through the legacy switch, and into the ACI leaf switch. From there, the standard ACI forwarding process applies.
Surely, APIC has something to do with all this forwarding wizardry, no?
I think I’ve hammered this point home, but it’s such a key distinguisher in what Insieme built for Cisco here that it’s worth dwelling on. So, one more time, APIC is not in the traffic forwarding business. If the ACI fabric itself tells packets where to go, APIC tells packets when to go, so to speak. APIC’s role is to make the connection between endpoint groups. APIC is about policy — who can talk to who via what chain of services or rules — and not about forwarding. Forwarding is a function that the controller assumes is taken care of.
Another item to note about Cisco’s APIC is that, at least today, APIC needs Nexus 9K switches running in ACI mode to accomplish anything. APIC cannot be used to control other network hardware. APIC and ACI come as a matched hardware and software set. I don’t know if this changes in the future or not. Remember that the Nexus 9K platform has custom ASICs that help make ACI work, so it’s not as if a simple software upgrade on other Nexus gear (or other Broadcom-based switches from other vendors) can make something APIC can control. How ACI will roll out across the rest of the Cisco infrastructure remains to be seen.
Does the way the Cisco ACI fabric forwards traffic actually matter?
That might seem like a dumb question to ask at the end of a long blog post about how Cisco ACI forwards traffic, but I think the answer is sort of interesting. In a certain sense, how ACI fabric forwards traffic doesn’t matter at all. Cisco intends ACI fabric to be an invisible transport. Packets come in, get forwarded, and go out.
You could think of ACI fabric as a chassis switch with line cards. Do you know what happens inside of a chassis switch? You might have researched the topic. There are whitepapers out there explaining ASICs, front port mappings, crossbar fabrics, virtual output queues, and so on. But aside from perhaps raw capacity planning and QoS capabilities, how much of what happens in a chassis switch does a network engineer really have to know? Not a lot — a chassis switch can be treated like a black box. Is it helpful to know a bit more, especially when a chassis switch underperforms or starts to fail? Yes, as that knowledge can speed problem resolution. But still, deep chassis switch knowledge is not a requirement to operate a chassis switch.
My point is that it will be possible for organizations to deploy Cisco ACI fabric with essentially no knowledge of how traffic is forwarded between leaf nodes. Is the information interesting? Yes. Is it important information to fully grasp ACI scale? Absolutely. Is understanding fabric latency useful data? Yes, especially for those applications expecting certain network fabric characteristics. But generally speaking, organizations should be able to think about ACI fabric as a chassis switch. Traffic goes in, gets forwarded, and comes out again.
The APIC is the real magic, and where organizational attention should fall. With Cisco ACI, the network is all about the policy, thus the moniker “Application Centric Infrastructure.” It’s not Network Centric Infrastructure. Strategically, ACI is attempting to get the network out of the way.
I spent a lot of time putting this post together. While I believe I got the major parts right, there’s enough nuance and detail here that I possibly got something wrong. I enthusiastically welcome any corrections. You can comment below (best, so that everyone can see) or e-mail me directly. Either way, I’ll incorporate changes to this post as soon as I reasonably can.
I strongly recommend that you watch the following videos where Cisco folks Joe Onisick and Lilian Quan talk about ACI fabric forwarding in even more detail than I do here. These videos are where I got practically all of my information, and are valuable references if this sort of nerdery is interesting to you.