Thursday, August 20, 2015

BGP in an Arista Data Center



The following is a practical analysis of the use of BGP in the DC on Arista platforms based largely on Petr Lapukhov's work with BGP in hyperscale DCs

Why Layer 3 (L3)?


There are several reasons to run a L3 routing protocol over legacy layer 2 (L2) designs in the data center. Leveraging standards-based routing protocols to avoid vendor lock-in, provide for faster convergence, minimize L2 fault domains, and provide for better traffic engineering.

Extension of L2


Naturally something that comes into question in a L3 switch fabric is, “What if I need L2 adjacency between hosts?” For Arista, the extension of L2 services across a L3 switch fabric is provided by Virtual eXtensible LAN (VXLAN). While closely related, in-depth discussion of the “network overlay” provided by VXLAN is outside the scope.

Why BGP?


Some might question the use of BGP within the data center due to it being designed for, and in the past primarily leveraged as, an EGP. However, BGP provides several benefits in a data center switch fabric, such as:
  • Less complexity in protocol design
  • Relies on TCP rather than adjacency formation/maintenance and/or flow control
  • Less “chatty”
  • Supports third-party (recursively-resolved) next-hops
  • With proper ASN usage, built-in loop prevention via AS_PATH
  • Better support for traffic engineering

General Operations Within the Fabric


ASN Scheme Option 1(Per-Rack Leaf AS)

Below is a diagram derived mostly from an IETF Draft that is a joint effort by Facebook, Google, and Arista Networks for very large data centers. It is shown in entirety to help visualize the reuse of ASNs at the leaf layer.



Notice that ASNs are reused at the leaf layer for each “cluster” or “pod”. This is achievable by configuring the leaf switches with allowas-in, and minimizes the need for 4-Byte ASNs which not all vendors support. Loop prevention is provided at the spine layer by placing spine switches within the same AS at each cluster/pod.

Going forward, focus will be on the spine/leaf elements.


ASN Scheme Option 2 (Leaf Layer AS)

Another option could be to use the same AS at every leaf:




Advantages of this design include less ASN usage, simple configuration templatization, and with that, better support for automation. A disadvantage is losing the ability to trace routes via AS_PATH (At least without additional configuration). These topics are further discussed in later sections.
Intra-AS Spine-to-Spine Reachability

Due to the spine switches being in the same AS, direct reachability from spine-to-spine within the same AS is not supportable without breaking loop prevention within the architecture itself. However, the impact of this is very minimal and should mostly be ignored.

Dynamic Neighbor Discovery


As a data center scales out, it is possible that many more leaf switches will be added. To make life easier, it is very desirable to automate and reduce configuration however possible. BGP dynamic neighbors configured at the spine layer provides a mechanism to dynamically add new leaf switches as BGP peers based on an IP address range. To support this, care must be taken in IP addressing as described in the “IP Addressing and Prefix Advertisement” section.

BGP Dynamic Neighbors with Per-Rack Leaf AS

Below is a diagram depicting the utilization of a separate AS at each rack:


In this configuration, a bgp listen router configuration command is added for each new rack with the IP address range accounting for the routed links going to each leaf switch within the rack. This reduces the amount of required configuration versus static configuration for each individual leaf.

BGP Dynamic Neighbors with Leaf Layer AS

Below, the second option leverages the same AS at every leaf:




In this configuration, a single bgp listen router configuration command is all that is required at each spine. No further configuration is needed at the spine as additional leaf switches are added to the environment. This makes the configuration easy to templatize and automate across all spine switches.

Making it Easy to Automate


It is much easier to automate configuration when it can be applied to a container of devices rather than having to have specific, per-device configuration.

For example, Ansible is a YAML-based automation platform that can be leveraged to push configuration to Arista switches. When configuration is simple and repeatable on as many devices as possible, that configuration can be compartmentalized into a template more easily. Ansible uses “playbooks” which are basically a list of resources and what tasks should be performed on those resources. Below is an excerpt from a playbook showing a single template configuration that could be applied to all spine switches in a per-rack Leaf AS design (Given a whopping two racks, but you get the idea):




In summary, the more configuration you can compartmentalize and template, the less items need to be defined as variables while still being able to be repeated on multiple devices - and still meet requirements of course - the better.

Route Source Tracing via AS_PATH with Leaf Layer AS

As mentioned earlier, a disadvantage to using the same AS at every leaf is not having the ability to trace routes via AS_PATH:


To get around this, an ASN could be prepended using a route map on each leaf switch, for example:

BGPDC-LEAF1(config)#route-map BGP_PREPEND
BGPDC-LEAF1(config-route-map-BGP_PREPEND)#set as-path prepend 65101

BGPDC-LEAF2(config)#route-map BGP_PREPEND
BGPDC-LEAF2(config-route-map-BGP_PREPEND)#set as-path prepend 65102


Then on both leaf switches, apply the route-map to all advertised routes:

router bgp 65000
neighbor ARISTA route-map BGP_PREPEND out


Now from LEAF3’s perspective, it is easier to see where the routes were sourced from:



IP Addressing and Prefix Advertisement


One challenge that resides in having a layer three (L3) switch fabric in the data center is how to handle your point-to-point routed links between spine and leaf switches. Not advertising them can increase troubleshooting difficulty, but advertising them inefficiently can lead to a large number of /31 entries in the Routing Information Base (RIB) that may become resource-intensive.

As a hypothetical example, if an Arista 7050QX-32 switch were to be used as a spine switch, assuming no ports are used for uplinks to the core or edge, and only 40Gb links are used to connect leaf switches, you would have a total of 32 links with /31 addressing. This means they could be aggregated as follows:
  • 16 /30s
  • 8 /29s
  • 4 /28s
  • 2 /27s
  • 1 /26
A single /26 can represent all possible point-to-point connections that the particular spine switch could have. So if you started with 192.168.255.0 for your IP addressing on your point-to-point links, you could allocate contiguous blocks of IP addresses starting with 192.168.255.0/26 for the first spine switch, 192.168.255.64/26 for the second spine switch, and so on. Then on each spine switch, only advertise its loopback and the aggregate with the aggregate-address summary-only BGP router configuration command.

What this will ultimately look like from the leaf layer perspective is an optimized RIB containing the loopbacks of all switches in the fabric, and a single additional entry per spine switch to reach point-to-point IP addresses for troubleshooting purposes.

To help visualize, consider a scenario where two spine switches are connected to just three leaf switches. The two spine switches have been configured to advertise every leaf connection, like so:

BGPDC-SPINE1(config-router-bgp)#sh active
router bgp 64600
router-id 192.168.254.1
bgp log-neighbor-changes
distance bgp 20 200 200
maximum-paths 32 ecmp 32
bgp listen range 192.168.255.0/24 peer-group ARISTA remote-as 65000
neighbor ARISTA peer-group
neighbor ARISTA fall-over bfd
neighbor ARISTA password 7 6x5GIQqJNWigZDc2QCgeMg==
neighbor ARISTA maximum-routes 12000
network 192.168.254.1/32
network 192.168.255.0/31
network 192.168.255.2/31
network 192.168.255.4/31


Looking at the routing table of LEAF1, there are a number of /31 entries learned via BGP:


As you can imagine, in larger environments, the number of entries could be excessive. Instead, the spine switches can configured with the aggregate-address router configuration command:

BGPDC-SPINE1(config-router-bgp)#sh active
router bgp 64600
router-id 192.168.254.1
bgp log-neighbor-changes
distance bgp 20 200 200
maximum-paths 32 ecmp 32
bgp listen range 192.168.255.0/24 peer-group ARISTA remote-as 65000
neighbor ARISTA peer-group
neighbor ARISTA fall-over bfd
neighbor ARISTA password 7 6x5GIQqJNWigZDc2QCgeMg==
neighbor ARISTA maximum-routes 12000
network 192.168.254.1/32
aggregate-address 192.168.255.0/28 summary-only



The same output from LEAF1 now shows no /31 entries learned from BGP:


Instead, there is a single /28 entry from each spine learned via BGP:



BGP at the Spine Layer


The Need for Fast Failure Detection

In a BGP switch fabric it is critical for failures to be detected as quickly as possible at the spine layer. Consider the following hypothetical scenario:



NODE1 is communicating with NODE2. LEAF1 and LEAF2 are MLAG peers. In an ECMP-routed switch fabric, asymmetric routing can occur, where NODE1’s traffic flow is taking the path through LEAF2, while the return traffic is taking the path through LEAF1. This is fine for now, but let’s consider what happens when LEAF1 is rebooted:




With default behavior of MLAG, when LEAF1 finishes rebooting and re-establishes the MLAG session with LEAF2, LEAF1 will place all non-peer-link interfaces into the err-disable state for 300 seconds (five minutes) in order to allow the topology to reconverge. So including the reboot, you’re looking at ~7 minutes or so before the switch begins forwarding again. Even though you have a redundant gateway provided by LEAF2, the reboot of LEAF1 would actually create a problem when combined with the fact that the spine switches are using default configuration in this particular situation of asymmetric routing. Default BGP keepalive and hold timers are 30 seconds and 180 seconds, respectively, so for 180 seconds (3 minutes), the spine switches will still consider LEAF1 an established BGP neighbor, and still consider the path to LEAF1 a viable route even though it’s clearly not. So for any flow that gets hashed to this path, the traffic is effectively blackholed. BGP timers could be modified to provide some relief, but failover time may still be considered unacceptable, in addition to the possibility of introducing instability of the network to BGP peer flapping.

By default, Arista switches will immediately tear down an eBGP neighbor adjacency when the interconnecting link goes down. This alleviates the issue described above. To further improve failure detection, Bidirectional Forwarding Detection (BFD) should also be leveraged. BFD is a very simple, low overhead Hello protocol that works in conjunction with BGP to provide sub-second failure detection. BFD adds additional protection in “gray” failure conditions such as would possibly be seen with a degraded fiber strand. The interface status may remain up and therefore not trigger the eBGP neighborship tear-down, but not pass traffic intermittently or at all - which in those cases network performance can suffer while the BGP keepalive/hold-down mechanism by itself may or may not be of much help.

BGP at the Leaf Layer


MLAG - To Peer or not to Peer

When running a pair of leaf switches in an MLAG configuration with VARP, you want to peer the two leaf switches. Although unlikely to actually happen, consider the following failure scenario:




NODE1 is communicating with NODE2. LEAF1 and LEAF2 are MLAG peers and are configured to run VARP to provide an active/active redundant first hop gateway for NODE1, but they are not BGP peers. On LEAF1, both of its uplinks have failed. While both SPINE1 and SPINE2 would certainly notice this failure and reconverge, some things will not be affected, such as:
  • MLAG running between LEAF1 and LEAF2 would not notice any change, and continue functioning as normal, which in turn means that the port-channel between NODE1 and LEAF1/LEAF2 would remain up and function as normal
  • VARP would continue to function as normal
This is important because traffic leaving NODE1 that gets hashed to LEAF2 would be fine, but any traffic hashed to LEAF1 would effectively be blackholed:




Peering LEAF1 and LEAF2 alleviates this. Traffic hashed to LEAF1 would follow the only remaining route pointing to LEAF2, and then be ECMP-routed to the spine layer:




Additionally, when this failure scenario doesn’t exist, you don’t have to worry about having a suboptimal path because without the failure described above, the path through the peering leaf switch is longer than those directly connected to the spine layer:


Optimizations


update wait-for-convergence

The update wait-for-convergence BGP router configuration command basically prevents BGP from programming routes into hardware and from advertising routes until a convergence event is resolved. A convergence event can occur when initializing BGP for the first time, clearing the BGP process, terminating the Rib agent, etc. The benefit this provides is reduced CPU churn from repeatedly programming routes into and out of hardware during a convergence event.

Best practice is to only use this on spine switches. Enabling this on leaf switches (especially leaf switches acting as host gateways) can result in traffic being blackholed until BGP finishes installing routes in hardware.

NOTE: Enabling this feature in vEOS will cause routes not to be advertised.

update wait-install

The update wait-install BGP router configuration command prevents routes from being advertised to peers until they are programmed in hardware, and that hardware sends an “Ack” to software. This prevents traffic being dropped in situations where routes are advertised, but they aren’t programmed yet in hardware.

NOTE: Enabling this feature in vEOS will cause routes not to be advertised.

ip hardware fib next-hop update event bfd

The ip hardware fib next-hop update event bfd global configuration command allows BFD to remove an interface that went down from next-hop entries of routes using it in hardware. The benefit is reduced traffic loss compared to the normal operation of BFD which is to bring down the BGP neighborship and recompute the next-hop for all routes using the interface which went down and then reprogram hardware. This is a default configuration whenever BFD is enabled.

ip hardware fib route unprogrammed parent-drop


The ip hardware fib route unprogrammed parent-drop global configuration command makes it so that if a more specific route cannot be programmed in hardware for whatever reason, its parent route will be programmed to point to a drop index (similar to null0). This is default behavior.

NOTE: This is not really an optimization, and is instead more of a feature. It should be noted that this will result in traffic being black-holed rather than being forwarded on to possibly undesirable destinations which is the “benefit” of this feature. An unfortunate side effect is that if there is no other parent route, the default route will be pointed to a drop index instead. In summary, if there isn’t a need to prevent traffic being routed using parent routes when a more specific route cannot be programmed, it is recommended to disable this feature with the “no” form of the same command.

Sources



Configuration Excerpt Examples


Configuration examples are from a small-scale switch fabric created using virtual Extensible Operating System (vEOS) version 4.15.0F.

Per-Rack Leaf AS (Option 1)


SPINE1

interface Ethernet1
no switchport
ip address 192.168.255.0/31
!
interface Ethernet2
no switchport
ip address 192.168.255.2/31
!
interface Ethernet3
no switchport
ip address 192.168.255.4/31
!
interface Loopback0
ip address 192.168.254.1/32
!
ip routing
no ip hardware fib route unprogrammed parent-drop
!
router bgp 64600
router-id 192.168.254.1
update wait-for-convergence
update wait-install
bgp log-neighbor-changes
distance bgp 20 200 200
maximum-paths 32 ecmp 32
bgp listen range 192.168.255.0/30 peer-group ARISTA remote-as 65000
bgp listen range 192.168.255.4/31 peer-group ARISTA remote-as 65001
neighbor ARISTA peer-group
neighbor ARISTA fall-over bfd
neighbor ARISTA password p4ssw0rd
network 192.168.254.1/32
aggregate-address 192.168.255.0/28 summary-only



LEAF1

vlan 10
name HOST_10
!
vlan 4093
name BGP_PEER
trunk group BGP_PEER
!
vlan 4094
name MLAG
trunk group MLAG
!
interface Port-Channel1
switchport mode trunk
switchport trunk group BGP_PEER
switchport trunk group MLAG
!
interface Ethernet5
channel-group 1 mode active
!
interface Ethernet6
no switchport
ip address 192.168.255.1/31
!
interface Ethernet7
no switchport
ip address 192.168.255.17/31
!
interface Loopback0
ip address 192.168.254.3/32
!
interface Vlan10
ip address 192.168.10.2/24
ip virtual-router address 192.168.10.1/24
!
interface Vlan4093
ip address 1.1.1.2/31
!
interface Vlan4094
ip address 1.1.1.0/31
!
ip routing
no ip hardware fib route unprogrammed parent-drop
!
!
route-map BGP_PREPEND permit 10
set as-path prepend 65101
!
router bgp 65000
router-id 192.168.254.3
update wait-install
bgp log-neighbor-changes
distance bgp 20 200 200
maximum-paths 32 ecmp 32
neighbor ARISTA peer-group
neighbor ARISTA remote-as 64600
neighbor ARISTA allowas-in 1
neighbor ARISTA fall-over bfd
neighbor ARISTA route-map BGP_PREPEND out
neighbor ARISTA password p4ssw0rd
neighbor 1.1.1.3 peer-group ARISTA
neighbor 1.1.1.3 remote-as 65000
neighbor 1.1.1.3 next-hop-self
neighbor 192.168.255.0 peer-group ARISTA
neighbor 192.168.255.16 peer-group ARISTA
network 192.168.10.0/24
network 192.168.254.3/32



Leaf Layer AS (Option 2)


SPINE1


interface Ethernet1
no switchport
ip address 192.168.255.0/31
!
interface Ethernet2
no switchport
ip address 192.168.255.2/31
!
interface Ethernet3
no switchport
ip address 192.168.255.4/31
!
interface Loopback0
ip address 192.168.254.1/32
!
ip routing
no ip hardware fib route unprogrammed parent-drop
!
router bgp 64600
router-id 192.168.254.1
update wait-for-convergence
update wait-install
bgp log-neighbor-changes
distance bgp 20 200 200
maximum-paths 32 ecmp 32
bgp listen range 192.168.255.0/24 peer-group ARISTA remote-as 65000
neighbor ARISTA peer-group
neighbor ARISTA fall-over bfd
neighbor ARISTA password p4ssw0rd
neighbor ARISTA maximum-routes 12000
network 192.168.254.1/32
aggregate-address 192.168.255.0/28 summary-only



LEAF1

vlan 10
name HOST_10
!
vlan 4093
name BGP_PEER
trunk group BGP_PEER
!
interface Port-Channel1
switchport mode trunk
switchport trunk group BGP_PEER
switchport trunk group MLAG
!
interface Ethernet5
channel-group 1 mode active
!
interface Ethernet6
no switchport
ip address 192.168.255.1/31
!
interface Ethernet7
no switchport
ip address 192.168.255.17/31
!
interface Loopback0
ip address 192.168.254.3/32
!
interface Vlan10
ip address 192.168.10.2/24
ip virtual-router address 192.168.10.1
!
interface Vlan4093
ip address 1.1.1.2/31
!
ip routing
no ip hardware fib route unprogrammed parent-drop
!
route-map BGP_PREPEND permit 10
set as-path prepend 65101
!
router bgp 65000
router-id 192.168.254.3
update wait-install
bgp log-neighbor-changes
distance bgp 20 200 200
maximum-paths 32 ecmp 32
neighbor ARISTA peer-group
neighbor ARISTA remote-as 64600
neighbor ARISTA allowas-in 1
neighbor ARISTA fall-over bfd
neighbor ARISTA route-map BGP_PREPEND out
neighbor ARISTA password p4ssw0rd
neighbor 1.1.1.3 peer-group ARISTA
neighbor 1.1.1.3 remote-as 65000
neighbor 1.1.1.3 next-hop-self
neighbor 192.168.255.0 peer-group ARISTA
neighbor 192.168.255.16 peer-group ARISTA
network 192.168.10.0/24
network 192.168.254.3/32