Over the last few years, I have been sprawled in so many technologies that I have forgotten where my roots began in the world of data center. Therefore, I decided to delve deeper into what’s prevalent and headed straight to Ivan Pepelnjak EVPN webinar hosted by Dinesh Dutt.
I knew of the distinguished Dinesh since he was the chief scientist at Cumulus Networks and for me; he is a leader in this field. Before reading his book on EVPN, I decided to give Dinesh a call to exchange our views about the beginning of EVPN. We talked about the practicalities and limitations of the data center. Here is an excerpt from our discussion.
The truth is, we are currently in the process of transition and the data center architecture has gone through a number of design phases: let217;s call them waves. Ideally, transitioning through each wave is a type of network hygiene. We are now in wave 3, which is based on Ethernet VPN (EVPN). There is a bias in technology that tilts it in certain ways and presently, the tilt is towards the layer 3 networks. As more enterprises adopt the leaf and spine topology, EVPN is the technology to how they are making traditional applications work on this architecture. For better understanding, let’s learn about the mechanics of the 3 waves.
The data center starting point
The first wave was an attempt to build a data center similar to how the enterprise networks were infrastructured in the past. It included the standard access, aggregation and core layers which led to an end-of-row architecture. The wave one comprised of a pair of “God” boxes central to the network that served as the gateway to the external world.
The wave one architecture was suitable for north to south traffic flows. This type of traffic flow goes in and out of the data center. However, many applications operating inside of the data center required cross communication and the ability to talk amongst themselves. On the other hand, virtualization was a big driver for the new east to west traffic flow.
Unfortunately, the wave one design did not work efficiently for east to west traffic. The primary problems were scale and reliability issues caused by the reliance on layer 2. Layer 2 networks relied on a brittle protocol soup, and in many cases, included proprietary extensions.
The entire access, aggregation and core architecture were structured around the layer 2 switching model. This means the entire traffic was forwarded using layer 2 headers until it reached the “God” boxes located in the center of the network. Technologies such as spanning tree protocol (STP), multi-chassis link aggregation (MLAG), transparent interconnection of lots of links (TRILL) and Cisco’s FabricPath were all structured around the idea of using the layer 2 switching model.
However, the layer 2 switching model surfaced many disadvantages to network redundancy, scale, and reliability in contrast to the robust layer 3 network model that started the second design wave.
The primary difference between layer 2 and layer 3 networks is the way packet forwarding works. Within a layer 2 network, addresses are looked up at the media access control (MAC) address level and the packet forwarding is performed by MAC addresses. This is in contrast to layer 3 networks where the addressing works at the networking layer, which is the Internet Protocol (IP) address.
How to achieve a robust network
A robust network’s rule should not incorporate single failure domains, which may cause the network to partition. Therefore, you need to design with redundancy in mind. However, when you have redundancy in the network, how do you forward packets in a way that does not cause a loop?
Favorably, layer 3 networks have, what’s known as, routing protocols, for example, BGP, OSPF, IS-IS and Cisco’s EIGRP. The routing protocol enables the construction of a loop-free forwarding topology.
The second aspect is a field in the IP header known as the time to live (TTL). Every time a packet traverses a layer 3 device, the TTL is decremented by one. Once the TTL goes to zero, the packet is dropped.
As a result, in layer 3 network designs, we have two mechanisms to prevent the formation of loops. Firstly, the TTL field in the IP header and secondly the routing protocols that build a loop-free forwarding path for a given prefix. However, in the layer 2 networks, we are not fortunate enough to have such mechanisms.
The layer 2 ease of use model
The layer 2 model is structured around the ease of use model. In the earlier times, routing protocols were considered complex and complicated to use. Configuring routing protocols was considered complex, hardware packet switching and hence higher throughput and lower latency started with layer 2 networks; most vendors, even to this day, charge additional license fee to use routing, but none for layer 2. The introduction of layer 2 networks followed complete ease of connectivity.
However, under the hood of the layer 2 model opens up the Pandora’s box of some pretty dangerous drawbacks. The layer 2 model did not have a protocol, enabling the construction of a loop-free forwarding path. Rather, it is based on flooding. Essentially, if a layer 2 switch don’t know the destination MAC address, it will flood the packet out to all ports other than the port the packet came in on and then take a note of the source MAC address.
We need to have redundancy in the layer 2 network so that a single failure domain does not cause a network partition or for a node to be blackholed. In order to circumvent such issues, the network requires redundant paths.
However, if you don’t have a TTL or a mechanism to construct a loop-free forwarding path for a given prefix, then a loop can cause the packet to loop forever. It simply floods the packet, and in the case of redundant paths, a flooded packet in a breeze can result in complete network meltdown.
Preventing loops in layer 2 networks
In order to prevent layer 2 network loops, the spanning tree protocol (STP) was introduced. Spanning tree looks at all the redundant paths and then removes them. This is what I call an interesting approach!
The spanning tree protocol constructs a loop-free topology that applies to every packet in the network, unlike a routing protocol that constructs a loop-free topology per source router.
Vendors introduced various tricks to make spanning tree perform more efficient, but the fundamental part that was causing network instability was the flooding. It caused the end stations to suffer, which became a potential for DoS attacks.
Fail closed or fail open
A routing protocol fails closed. If a routing protocol does not know how to get to a destination, it will not send the packet to that destination.
However, the spanning tree protocol does the opposite and fails open. Spanning tree was constructed in a way that if I don’t hear from you then I assume you need a packet from me. This is in contrast to routing protocols that followed if I don’t hear a hello from you, I presume you don’t want to talk to me.
Failing open is dangerous in a number of ways. For example, a bad cable causing a unidirectional connection, or a loaded CPU unable to send out a hello packet in time, could construct a loop. With routing, the TTL will eventually kill the packet, thereby, preventing the loop. However, since there is no TTL in layer 2, you have inherently created an unstable network.
Layer 2 networks have a large blast radius and there is no fine-grained failure domain. A single link failure can affect the entire network. All these factors deemed the first wave of data center design as ineffective. A cleaner design is to move from the layer 2 switching model to IP and network routing protocols.
The 2nd wave of data center design
The next wave of data center design was to build scalable networks, the ones that were predictable, robust, and supported fine-grained failure domain designs. In addition to efficiently supporting east to west traffic, we needed as much forwarding capacity as possible. STP prevented the use of additional bandwidth and so an alternative protocol had to be used.
This challenge was no different from what was faced in telephony networks back in the early 1950s. Charles Clos solved this with the Clos network topology. The Clos network topology can scale to a number of tiers. Pioneering web scale companies used small white box switches with the low port count so they need 8 layers of the Clos network. However, for the majority, two tiers are good enough.
The new wave of data center design is known as the leaf and spine, which essentially is a Clos network. The design allows you to build a network that is not limited by the scale of a single unit. However, if you think of layer 2 networks, the scale was limited by the two central “God” boxes.
The capacity was not just controlled by the number of ports but also by the control plane. Besides, what mattered was how quickly it could send out STP packets without the risk of causing a network meltdown. Whereas, the leaf and spine network topology enables high capacity by utilizing the capacity of all the redundant links. IP and routing work flawlessly with this topology since it eliminates the need for vendor-specific kludges.
Therefore, by switching to IP, the instability of layer 2 evaporated and we could enable high capacity straightforward networks with IP forwarding. The leaf and spine network topology design offers very simple building blocks.
The move to white box
So now, we had the opportunity to build very simple networks with IP forwarding. However, to deploy the big players at the time, such as Cisco or Juniper came with a heavy premium that you didn’t need as the network topology was now very simple.
All that was needed was IP routing and a forwarding protocol. As a result, we began to see the introduction of white box switches. Initially, white box switches did not have good CPUs; therefore, pioneering companies like Google could not run a routing protocol on them. The centralization of all the control logic was moved out of the box, leaving the local device to only program the merchant silicon. This lead to the rise of the OpenFlow model.
As time progressed, other web scale companies designed with the traditional distributed routing protocol to set up the forwarding rather than pulling it out to a centralized location.
Routing protocols with leaf and spine architectures
Firstly, there are two types of routing protocols: Distance Vector and Link State. To understand the difference between them you need to know how they convey information.
With distance vector, you tell your neighbors about your entire view of the world. However, the link state protocols piece together the global perspective by stitching together everyone’s local perspective.
Generally, network operators prefer link state protocols over distance vector. This is because the link state protocols are better and faster at resolving paths to destinations in the presence of failures. Whereas, the distance vector protocols can sometimes get confused and run into problems, such as count-to-infinity.
However, in the case of link state, the link state database scale can become an issue. They have the notion of areas or levels that are used to break domains into a hierarchy to avoid scaling problems.
Border gateway protocol (BGP) is a variation of the distance vector known as a path vector protocol. BGP runs the Internet and is simple to operate, yet a sophisticated protocol. It is a mature protocol with probably the most deployment and operational experience of all the routing protocols. It can be implemented in a number of ways including open source routing suites.
The 3rd wave of data center design
However, the problem that still existed was that the applications assumed they live in the old first wave model. The applications assumed they were operating in a layer where they could talk to everyone in the neighborhood by broadcasting a “hello.”
However, with routing a broadcast “hello,” the packet was suppressed and thrown away. An end station could not shout out and expect everyone to hear. Applications designed during the first wave still wanted to operate as if they were in the first wave of infrastructure even though they were in the second wave of data center design. They still wanted to shout a “hello” for discovery instead of doing DNS with service record for service discovery.
As a result, the need surfaced to find a way to marry the first wave application with the second wave data center, which gave rise to Ethernet VPN. It is still structured around the notion of a fabric but it is built using an overlay. The overlay gives the illusion that the first wave applications are sitting in a layer 2 network.
Network virtualization is the technology used to create the above-mentioned illusion and is carried out by layering, i.e., the creation of an overlay.
Network virtualization builds a network tunnel, which is exactly like the real-world tunnel. Two endpoints at either end of the real-world tunnel cannot communicate unless you somehow go around or pass through the tunnel.
The way you build a tunnel in a network is essentially done by adding another layer of headers to an existing packet. Typically, a well-known layer was the multiprotocol label switching (MPLS), which is a layer on an IP packet. However, MPLS was complex in the data center, which eventuated the need; why not use a technology that is IP based? This gave rise to virtual extensible LAN (VXLAN).
So let us summarize; in layer 2 networks, we have a control plane of STP and layer 3 networks that use routing protocols. Here, the questions that come to mind is, what control plane can be used for VXLAN? Essentially, the control protocol for VXLAN needs to do two things. Firstly, which endpoints are available through which tunnels, i.e., the mapping of a destination to a tunnel, and secondly it needs to tell you where, and how many tunnels you have. EVPN is the answer to that.
To create the illusion, BGP as the protocol already supports the transportation of MAC reachability information and not just the IP addresses. Within EVPN, BGP is the control plane protocol used to construct the virtual tunnels, enabling first wave applications to operate on the second wave networks.
This article is published as part of the IDG Contributor Network. Want to Join?