Ethan Banks Getting work done in a world of distractions.

Cisco ACI Fabric Forwarding In A Nutshell

C

As I study software defined networking architectures, I’ve observed that none of them are exactly alike. There are common approaches, but once diving into the details of what’s being done and how, even the common approaches seem to have as many differences as similarities.

One of the most interesting elements of SDN architectures is traffic forwarding. How does traffic get from one point to another in an SDN world? Cisco ACI’s traffic forwarding approach is intriguing in that it neither relies on the controller for forwarding instructions, nor does it rely on a group of network engineers to configure it. Rather, Cisco ACI fabric is a self-configuring cloud of leaf and spine switches that forward traffic between endpoints without forwarding tables being programmed by the Application Policy Infrastructure Controller (APIC). APIC assumes forwarding will happen, worrying instead about policy.

The notion of a self-configuring fabric that delivers traffic with no outside instruction sounds mystical. How, exactly, does Cisco ACI forward traffic? Wanting to understand the basics myself, I spent time reviewing presentations by engineers from the Cisco ACI team, and have distilled the information down as best as I could.

Let’s start at the beginning.

What does an ACI topology look like?

A Cisco ACI switch topology is a two-tier, leaf-spine design. The big idea is to keep the fabric simple, and a two-tier architecture has efficiency, predictability, and resilience to commend it. Two-tier leaf-spine design is the most common data center network reference architecture recommended by the industry today. It’s hard to read a data center related technical whitepaper without leaf-spine being mentioned. Cisco has not done anything strange here.

What I can’t tell you is if three-tier leaf-spine designs — sometimes used to scale for host density — are supported, but I tend to think not. The Cisco ACI behavior discussed in the videos imply a non-blocking, two-tier fabric throughout. Three-tier leaf-spine designs are not non-blocking. Therefore, very large data centers wishing to run a thousands of physical hosts within a single ACI fabric would scale horizontally, adding spine switches. Considering the high port density of certain switches in the Nexus 9000 family, I can’t imagine scale being a limitation for almost anyone.

APIC/ACI

Initial ACI fabric setup

The initial setup process seems simple enough. I’m probably oversimplifying it, but the way I understand it, the Application Policy Infrastructure Controller (APIC) is connected to the ACI switch fabric. APIC discovers the switch topology. APIC then assigns IP addresses to each node (a leaf switch), to be used as VxLAN tunnel endpoints (VTEPs). VXLAN is the sole transport between leaf and spine across the fabric. Why? Well, a couple of reasons. One, policy is enforced in part by shipping traffic between specific leaf nodes. Two, you get 16M segments instead of 4K VLANs or a limited number of VRFs. VxLAN offers virtualized network segments with essentially no practical limit. For what it’s worth, NVGRE and STT have the same 16M limit, and presumably Geneve does as well, although I’d have to dig out the IETF draft and check.

It’s worth noting that in Cisco ACI terminology, this leaf-spine fabric is known as the infrastructure space. The uplinks facing endpoints are the user space. Within the infrastructure space, there are only VTEPs. Nodes are switches. Endpoints are hosts – you know, the things actually generated traffic.

Packet flow through the ACI fabric

How does a packet make it through the Cisco ACI fabric? First, be aware that the default gateway for any network segment is distributed across all leaf nodes in the ACI fabric. Therefore, whatever leaf node an endpoint is connected to will be the layer 3 gateway. This could hardly be otherwise when considering what happens when a packet arrives at a leaf node.

When a packet flows into a leaf switch, headers are stripped. Whether the frame or packet had it’s own 802.1q, NVGRE, or VxLAN tags or wrappers when arriving at the leaf switch – or if it was untagged – it just doesn’t matter to ACI. That information is not important to deliver a packet across the fabric. Once the headers are stripped, the remaining packet is encapsulated in VxLAN and forwarded across the fabric from the ingress leaf node to the egress leaf node. ACI determines where the egress leaf node is, by sorted out where the packet needs to go based on the ACI policy. When the VXLAN packet arrives at the egress leaf node, the VXLAN header is removed, and the packet encapsulated in whatever way is needed to deliver it to the destination host. The packet does NOT have to be re-encapsulated the same way it was originally encapsulated when arriving at the ingress leaf node. So, a packet could arrive at the ACI fabric edge with a VxLAN encapsulation, but leave it with an 802.1q tag.

The key to grasping all of this is understanding that endpoints (hosts) are part of ACI groups. The groups form the common bond. Moving packet traffic between group members is the job of an invisible transport that the network engineer does not have to configure. Yes, I’m talking in this article about what’s going on under the hood, but in actual practice, network operators shouldn’t have to open ACI’s hood to fire up the engine.

Let’s consider an example here of forwarding VMware NSX traffic across an ACI fabric. Yes, you can do that. ACI is a more than viable underlay for NSX. Okay, let’s assume NSX is sending VxLAN encapsulated traffic from ACI user space into ACI infrastructure space. The original NSX VxLAN encapsulation will be stripped by the ingress leaf node, and a new ACI-specific VxLAN encapsulation with a new “VNID” (virtual network identifier) completed instead. The packet will be shipped across the fabric to the egress leaf switch. The egress leaf switch will place the original NSX VxLAN header back onto the packet, and forward it to the endpoint. If you’re wondering why Cisco doesn’t just do a double encapsulation (wrapping the ACI VxLAN encapsulation around the original encapsulation), their response is that a double encap wastes bandwidth. Fair enough.

Another point worth making here is that endpoints have no idea about any of this encapsulation and decapsulation. To hosts, ACI presents itself as plain old Ethernet. The special things ACI fabric does are transparent to the hosts. For example, hosts still use standard Address Resolution Protocol (ARP) to map a known IP address to an unknown MAC address. ARP queries are broadcast packets. (Hey, if any of you listening on this segment own IP address A.B.C.D, could you respond so that I can learn your MAC address and send you an Ethernet frame?) These broadcast queries still flow into a leaf switch, but then a bit of unique ACI handling occurs. The broadcast from user space is turned effectively into a unicast shipped across infrastructure space. Where is it sent? To the leaf switch uplinking the host that can answer the ARP. ACI already knows where all the endpoints are in the fabric, so why bother broadcasting ARP queries to the entire network segment? Flooding is an undesirable networking behavior that assumes ignorance. In this case, knowledge equals efficiency.

A final interesting detail is that an ACI fabric is treated as a single routed hop, even though there will be three physical devices a packet is forwarded through: the ingress leaf node, a spine switch, and an egress leaf node. The “appears as a single routed hop” notion is logical as the specific physical devices in the ACI fabric are NOT individual routed hops from the perspective of the endpoint.

More about spine forwarding

Part of the Cisco’s pitch for ACI is scale. You can build a really large ACI fabric on Nexus 9000 hardware that Cisco has priced aggressively. The Nexus 9000 gear is based in part on Broadcom silicon, just like what’s found in a lot of other switches, so where’s the unique value proposition? The uniqueness comes in the “plus” silicon as in the term “merchant plus” used to describe the ASICs found in the Nexus 9000. Here’s how I understand it.

  • The fabric cards found in spine switches are based on the Broadcom chipset.
  • The line cards in the spine switches are where the “plus” ASICs built by the Insieme team are found. The line cards are what allow the “million address database” to be built. That seems like a sizable enough fabric to handle the data centers of most of us.

Remember that the ACI fabric auto-configures forwarding. Neither network engineers nor APIC program the forwarding tables or configure routing. So, how does an ACI fabric learn about all the VTEPs? There are three main components.

  1. A DHCP pool is used in the spine to assign addresses to the VTEPs.
  2. Automatic learning of VTEPs is via LLDP.
  3. ISIS builds a routing table of VTEP IP addresses using the information discovered via LLDP.

What happens during multicast forwarding?

Multicast over ACI follows after the theme of efficiency. Simply, the ACI fabric builds a multicast tree. Incoming multicast frames are copied by spines to leaf switches with those who have requested to see the stream. Leaf switches will replicate the multicast frame to the endpoints that should receive them. At a glance, this strikes me as similar to how Avaya Shortest Path Bridging forwards multicast traffic through an SPB fabric.

What happens when a leaf uplink to a spine fails?

One of the most likely physical failures an ACI fabric will experience is that of a leaf-spine link failing. When this happens, there is nearly instant convergence and re-hash of the ECMP forwarding on that leaf node, as fast as 125 microseconds (not milliseconds). The convergence is not dependent on a link aggregation protocol.

What happens when a node or indirect link fails?

Another failure that ACI fabric might need to respond to is a node (leaf switch) falling out of the fabric. From a topological standpoint, the failure of an indirect link is similar to a node failure. How does ACI react to these events?

IS-IS, the routing protocol between leaf and spine switches will converge, as quickly as any IGP. IS-IS is building a routing table between VTEPs, any VTEP to any VTEP, and practically speaking, any leaf node to any leaf node. ISIS convergence happens  independently of APIC.

Remember that APIC is not interested in how to forward traffic; packet delivery is assumed by APIC, which is a contrast with the OpenFlow model that is focused on flow forwarding primarily. This is key from a couple of different perspectives.

  1. There is no latency penalty incurred while APIC figures out the new topology, runs an algorithm to settled on how best to forward traffic, and then updates forwarding tables in the fabric. APIC isn’t involved in that part of the process.
  2. The assumption of traffic delivery is somewhat parallel to overlay designs that assume the underlay will deliver traffic, but are not controlling how that forwarding is done. ACI has a leg up in the sense that although APIC is not building the underlay, ACI as a system can still provide insight into what’s going on in the physical fabric. The decoupling of policy (APIC) from delivery (ACI fabric) hasn’t left network operators blind.

How large does ACI fabric scale?

ACI scales up to 1M entries for MAC, IPv4, and IPv6. All reachability information is stored in all spine switches, while leaf nodes only learn information about the remote hosts it needs to know. Therefore, not every leaf switch knows about every endpoint everywhere.

So, what happens if a leaf needs to forward to an endpoint it doesn’t know about? The leaf switch uses the spine switch as a penalty-free proxy (aka a “zero penalty miss”). “Hey, I don’t know how to deliver this. You know everything, so please send this where it needs to go.” This is a zero penalty miss because the spine would have received this traffic anyway. In a sense, this is like  default routing.

How does ACI fabric connect to the outside world?

There are a couple of discussions worth having about connecting ACI fabric to the rest of the network — one about layer three, and one about layer two.

First, ACI fabric can peer with the outside world using traditional routing protocols such as BGP or OSPF. You can even use static routes if you like. This is done at a leaf node, not a spine, which might be counterintuitive if you were thinking of a spine switch like a core switch. The spine switch isn’t in the role of core switch when it comes to forwarding – remember that the spine’s primary role is to move data between VTEPs terminating on leaf nodes. Okay. So, when the leaf node peers with the outside world, the leaf sends what it has learned to the spine, and the spine, functionally, becomes a route-reflector. Remember how the spine is a proxy for destinations the leaf doesn’t know? The same logic applies to external routes as for routes within the fabric.

Second, ACI leaf switches can connect to existing switches you might already have on your network at layer two. The main point of this sort of interconnect is to allow pre-existing network hardware to leverage ACI policy groups. To accomplish this, an existing switch can connect to an ACI leaf node switch. The default gateways are moved to the ACI fabric. Traffic then flows from the host, through the legacy switch, and into the ACI leaf switch. From there, the standard ACI forwarding process applies.

Surely, APIC has something to do with all this forwarding wizardry, no?

I think I’ve hammered this point home, but it’s such a key distinguisher in what Insieme built for Cisco here that it’s worth dwelling on. So, one more time, APIC is not in the traffic forwarding business. If the ACI fabric itself tells packets where to go, APIC tells packets when to go, so to speak. APIC’s role is to make the connection between endpoint groups. APIC is about policy — who can talk to who via what chain of services or rules — and not about forwarding. Forwarding is a function that the controller assumes is taken care of.

Another item to note about Cisco’s APIC is that, at least today, APIC needs Nexus 9K switches running in ACI mode to accomplish anything. APIC cannot be used to control other network hardware. APIC and ACI come as a matched hardware and software set. I don’t know if this changes in the future or not. Remember that the Nexus 9K platform has custom ASICs that help make ACI work, so it’s not as if a simple software upgrade on other Nexus gear (or other Broadcom-based switches from other vendors) can make something APIC can control. How ACI will roll out across the rest of the Cisco infrastructure remains to be seen.

Does the way the Cisco ACI fabric forwards traffic actually matter?

That might seem like a dumb question to ask at the end of a long blog post about how Cisco ACI forwards traffic, but I think the answer is sort of interesting. In a certain sense, how ACI fabric forwards traffic doesn’t matter at all. Cisco intends ACI fabric to be an invisible transport. Packets come in, get forwarded, and go out.

You could think of ACI fabric as a chassis switch with line cards. Do you know what happens inside of a chassis switch? You might have researched the topic. There are whitepapers out there explaining ASICs, front port mappings, crossbar fabrics, virtual output queues, and so on. But aside from perhaps raw capacity planning and QoS capabilities, how much of what happens in a chassis switch does a network engineer really have to know? Not a lot — a chassis switch can be treated like a black box. Is it helpful to know a bit more, especially when a chassis switch underperforms or starts to fail? Yes, as that knowledge can speed problem resolution. But still, deep chassis switch knowledge is not a requirement to operate a chassis switch.

My point is that it will be possible for organizations to deploy Cisco ACI fabric with essentially no knowledge of how traffic is forwarded between leaf nodes. Is the information interesting? Yes. Is it important information to fully grasp ACI scale? Absolutely. Is understanding fabric latency useful data? Yes, especially for those applications expecting certain network fabric characteristics. But generally speaking, organizations should be able to think about ACI fabric as a chassis switch. Traffic goes in, gets forwarded, and comes out again.

The APIC is the real magic, and where organizational attention should fall. With Cisco ACI, the network is all about the policy, thus the moniker “Application Centric Infrastructure.” It’s not Network Centric Infrastructure. Strategically, ACI is attempting to get the network out of the way.


Disclaimer

I spent a lot of time putting this post together. While I believe I got the major parts right, there’s enough nuance and detail here that I possibly got something wrong. I enthusiastically welcome any corrections. You can comment below (best, so that everyone can see) or e-mail me directly. Either way, I’ll incorporate changes to this post as soon as I reasonably can.

I strongly recommend that you watch the following videos where Cisco folks Joe Onisick and Lilian Quan talk about ACI fabric forwarding in even more detail than I do here. These videos are where I got practically all of my information, and are valuable references if this sort of nerdery is interesting to you.

18 comments

  • Great write-up. I’ve read many articles on Cisco ACI and in most cases you finish reading it scratching your head trying to understand if the person who wrote it understood what it was about because as a reader you didn’t.

    I’d like to challenge your response to “Does the way the Cisco ACI fabric forwards traffic actually matter?”. In my mind, network engineers running critical services inside a major data center MUST understand how their chassis works. They need to understand various performance aspects (latency, bandwidth, pps, etc.), they need to understand what will failover smoothly through clustering and what will break, what will be the impact of certain maintenance work, etc. Our experience, by the way, is that the chassis devices are always at the top of the list of devices our customers want us to cover.

    I see ACI going the same way. The magical behavior is based on software, hardware and a ton of levers you can play around with. Even at its young age, the 9K already has an extensive troubleshooting guide (http://www.cisco.com/c/en/us/td/docs/switches/datacenter/nexus9000/sw/6-x/troubleshooting/guide/b_Cisco_Nexus_9000_Series_NX-OS_Troubleshooting_Guide.pdf). Engineers who don’t fully understand what’s going on risk getting caught off guard.

    • Yoni,

      I agree completely that network engineers will want to have access to the inner workings of fabric forwarding. We intend for this to be completely transparent, not black box wizardy. If something is missing it’s because we missed it, or are writing it. Feel free to point us to information that is lacking.

      I think Ethan’s point was more to the idea that if we got it right, you won’t be worried about forwarding on a daily basis. Instead you’ll be focused on adapting, and building new business initiatives (apps and services.)

      Joe Onisick
      Principal Engineer, Cisco ACI/Nexus 9000

  • If link-state forwarding using IS-IS is mystical, that’s a sad commentary on the networking industry. IMO we should have had fabrics in the base firmware image 5 years ago.

    One thing I would add is that packet headers can trigger a policy before they get stripped; they’re not ignored. For example, mapping a VLAN ID to a group ID.

  • Great article – at some point you mention “Forwarding is function that is assumed by the controller.”…is that a typo ?..I don’t believe the controller forwards any packets.

  • So, BigF*Fabric, abstracted and transparent (until it goes wrong). That’s fine and if its reliable enough that I don’t have to think about it (like firmware or the knobs of old) that’s great. This is a big deal because….? ECMP is ‘dumb’ on a number of levels, it’s not attractive or sophisticated. CLOS less so but there’s nothing here that excites.

    • In fairness to ACI, the fabric is only one aspect. I could argue that the fabric is not even the key aspect of the ACI strategy (although the fabric *is* a critical component ACI can’t live without at present).

      I think ACI’s real magic is in APIC. APIC leverages ACI fabric to control how traffic travels between endpoint groups using policy described at the APIC console. APIC policy integrates with third parties who want to take part. Cisco’s made this part of ACI “open” in the sense that they’ll work with anyone who wants to get on board.

      Case in point, I spoke to Nathan Pierce at F5 yesterday who discussed briefly how F5 Scale-N integrates with APIC/ACI using Cisco’s device package concept. The way he explained it to me, you get drag/drop ADC functionality, all from APIC. Not full-featured implementation of everything you can do with BIG-IP, but a good start. That sort of functionality is where ACI becomes intriguing to me. Bringing it all together.

      • Great write up, thanks for sharing, when do you think NSX folks will realize their white-box-fabric is wrong ;)

  • So I have two data centers 20 Km apart with fiber between the sites.
    Can this over time work for us too without putting a routing layer between sites in place ?

    • Let’s assume that you have enough fiber (or other means) available to you to wire up the leaf-spine topology in the expected way over the 20Km distance, so that you’ve still got a 2-tier leaf-spine, just distributed across 2DCs 20Km apart. You’d still have different latency characteristics for different flows. Traffic between endpoints in the same DC would have lower latency than traffic between endpoints in a different DC. This could introduce anomalies in your application. 20Km doesn’t introduce much latency, but it’s not nothing. Depending on the application and the tiers involved, this is a possible issue to consider.

      Another issue to consider is split-brain. What happens to an ACI fabric if the 20Km links are severed, leaving 2 ACI fabric islands? I’m not certain. Assuming the two fabrics just converge on themselves and continue forwarding, there’s another question of the APIC. Let’s assume the APIC design is redundant with one in each DC. Okay – maybe this works for a while. But what happens when the links are returned to service? How do the APICs sort themselves out? What are the caveats involved in a major fabric convergence like that? Again, I’m not certain.

      All this to say that I can conceive of stretched ACI fabric being possible as easily as not possible, so it’s a question for Cisco whether they have an ACI reference design for this sort of a topology. They very well might, but I’ve been through some of the whitepapers, and don’t recall that kind of design. Could be my memory is faulty, though.

      I know some of the folks at Cisco were monitoring this post, so perhaps they can share some design wisdom here.

  • It’s true enough that APIC is what gets my attention. The policy related concepts in particular.

    I want the fabric to be a black box that ‘just works’ (sorry low level VLAN creating, STP tweaking types). I’ve no practical knowledge or understanding (yet) to know if I can ‘trust’ it to that level today but I’m sure the industry will get there. When it does, I still expect the ‘old guard’ to be standing around with their pitch forks talking about rebasing cost metrics.

    As you may know, based on my recent PP article, I’m rather frustrated by the limited functionality and useability of a number of ‘integrations’ I’ve seen or used lately, with F5 unfortunately being no exception. A chicken and egg situation I know, why invest before there is real demand, but there will be no demand unless you do. And who/what to choose?

    Ultimately I’ve reluctantly come to the conclusion that although I would rather it were not the case a big vendor is needed to bring a number of concepts I think we all agree upon to life. I’m talking about policy, automation, smooth and simple integration and to some extent ‘plug and play’ functionality for layer three and below. For layer four and above things are more complex but I still think there’s much room for progress and opportunities for simplification. By definition this means more complex software (and perhaps related hardware functions) but less complex user interfaces and interaction.

    I see some huge ‘gaps’ that simply aren’t being addressed when they should be, for many reasons, mostly commercial. I also see opportunities for those with vision and the skill to execute on it (not me sadly). Currently, real options for those with the skill and opex capital to move forward are lacking. But of course this circles back to that big vendor.

    Business and operations culture is what must be changed but without suitable tools to execute on intent and desire, networking has nothing but the past and memories to bring it into the future; hardly an inspiring foundation to reach from.

    On that note, yawn, standards are slow. Can we add a CPU usage value to LLDP and base some decisions on that? Same for LAG but b/w related. Simple, minor things that make a big difference.

  • Hi,
    Just to be super-duper-stick-in-the-ass correct.
    “The line cards in the spine switches are where the “plus” ASICs built by the Insieme team are found. The line cards are what allow the “million address database” to be built. ” Nnnnot precisely true. Insieme ASIC in spine switch (Alpine chip) doesn’t actually store those addresses. Those are actually store in “backplane” Tridents (those fabric cards in the back of the switch).
    Whole thing is a bit more complicated (synthetic IP and MAC and DHT and whatnot), but just to be precise.

    As for DC distance, AFAIK 10km (long-distance QSFP) is currently supported.

    • THANK YOU for being pedantic. More information is better. I will make a note to dig into that architecture more deeply, and update my article when I’ve had the chance. In the meantime, your comment is here for folks to consider as they research the N9K platform.

  • Oh, forgot the split-brain.
    Very briefly – spines are Oracles, and they have a Council*. As in ancient Greece, they know their shit.
    One of the spines in master-Oracle (Pythia). Pythia decides what’s right and what’s not. And we all bow to her wisdom.
    And if Pythia dies, another one takes her place.

    * And they have a plan. Council of Oracles (Protocol).

    (and no, it’s not a joke)

By Ethan Banks
Ethan Banks Getting work done in a world of distractions.

You probably know Ethan Banks because he writes & podcasts about IT. This site of his covers personal productivity.

Get the details on his about page.