ACI Multi-Site deployment uses different overlay control and data plane functionalities for connecting endpoints that are deployed across different sites.
MP-BGP EVPN is used as a control plane between spine switches for exchanging host information for discovered endpoints that are part of separate fabrics to allow east-west communication between these endpoints.
And once those endpoints information is exchanged, the VXLAN data plane isused to allow intersite communications.
Let's go and check how control and data plane is handled in details.
Multi-Site Overlay Control Plane
As we know so far, for endpoints in different sites to communicate to each other, their EID need to be shared between sites. So ACI Multi-Site uses MP-BGP EVPN between spine switches across sites.
And for endpoint information to be shared to other sites, Cisco NDO need to indicate which EPG to stretch across sites.
or if the endpoint is related to different non-stretched EPG with a contract that allow the communication to EPG in different sites. So there are two scenarios here:
1- If the Endpoint has IP address,So it is shared across sites via MP-BGP EVPN.
2- If the Endpoint without IP address, So it is shared only when Layer 2 STRETCH is enabled on the NDO.
Before going to Overlay Data Plane, Let us understand how MP-BGP EVPN is used to share endpoint information across sites:
EP1 and EP2 are connected to different sites 1 and 2 respectively.
These endpoints are locally learned on each leaf, and coop control-plane message is generated for each endpoint on each site to the spine nodes.
The spine nodes at site-1 learn the locally connected EP1 at the leaf node, and the same is happening for site-2, till now no information is exchanged between sites because there is no policy in place yet indicating a need for those endpoints to communicate.
An intersite policy is defined in Cisco Nexus Dashboard Orchestrator and is then pushed and rendered in the two sites.
Once the inter-site policy is configured, a Type-2 EVPN update is triggered across sites to exchange information about EP1 and EP2. The Endpoint information is associated with the O-UTEP address, to identify which site this EP is discovered.
Multi-Site Overlay Data Plane
After endpoint information is exchanged across sites, the VXLAN data plane is used to allow intersite Layer 2 and Layer 3 communication. Let's go and understand the how ACI Multi-site handle the different traffic scenarios (BUM and Unicast traffic).
BUM Traffic between sites:
The deployment of VXLAN allows the use of a logical abstraction so that endpoints separated by multiple Layer 3 hops can communicate as if they were part of the same logical Layer 2 domain.
ACI Multi-Site enables ingress replication for this traffic (BUM traffic) on the source VXLAN TEP (VTEP) devices, which create multiple unicast copies of each BUM frame to be sent to the all remote VTEPs on which those endpoints are part of same layer 2 domain are connected, Once the BUM frame encapsulated with a unicast O-MTEP reaches the destination site, the destination site replicates the BUM frame and floods it within the site, which is the headend replication.
BUM traffic is forwarded to other sites only when Intersite BUM Traffic Allow is enabled on the bridge domain to let the flooded traffic to reach other sites as well.
There are three different types of Layer 2 BUM traffic:
Layer 2 Broadcast frames (B): Frames are always forwarded across sites when "Intersite BUM Traffic Allow" is enabled for the bridge domain.
Layer 2 unknown unicast frames (U): The frames are flooded only when Layer 2 Unknown Unicast is set to flood in the bridge domain regardless of Multi-Site.
Layer 2 Multicast frames (M): The traffic is forwarded across the sites whatever it is layer 3 or layer 2traffic where the bridge domain is stretched across sites with "Intersite BUM Traffic Allow" is enabled.
In the below example will check how Layer 2 BUM traffic occured:
EP 1 is belonging to BD, and generates a layer 2 BUM frame.
The frame is VXLAN-encapsulated and sent to the specific multicast group (Group IP address outer [GIPo]) associated with the bridge domain within the fabric along one of the specific multidestination trees associated to that GIPo, so it can reach all the other leaf and spine nodes in the same site.
One of the spine nodes connected to the ISN is elected as the designated forwarder for that specific bridge domain (this election is held between the spine nodes using IS-IS protocol exchanges). The designated forwarder is responsible for replicating each BUM frame for that bridge domain to all the remote sites with the same stretched bridge domain.
The designated forwarder makes a copy of the BUM frame and sends it to the remote sites. The destination IP address used when the packet is encapsulated with VXLAN is the special IP address (O-MTEP) identifying each remote site and is used specifically for the transmission of BUM traffic across sites. The source IP address for the VXLAN-encapsulated packet is instead the anycast O-UTEP address deployed on all the local spine nodes connected to the ISN.
One of the remote spine nodes receives the packet, translates the VNID value contained in the header to the locally significant VNID value associated with the same bridge domain, and sends the traffic within the site along one of the local multidestination trees for the bridge domain.
The traffic is forwarded within the site and reaches all the spine and leaf nodes with endpoints that are actively connected to the specific bridge domain.
The receiving leaf nodes use the information that is contained in the VXLAN header to learn the site location for endpoint EP1 that sourced the BUM frame. They also send the BUM frame to all (or some of) the local interfaces that are associated with the bridge domain, so that endpoint EP2 (in this example) can receive it.
Depending on the number of configured bridge domains, the same GIPo address may be associated with different bridge domains. Thus, when flooding for one of those bridge domains is enabled across sites, BUM traffic for the other bridge domains using the same GIPo address is also sent across the sites and will then be dropped on the received spine nodes. This behavior can increase the bandwidth utilization in the intersite network.
So to solve this issue and don't utilize the ISN bandwidth for useless traffic, when a bridge domain is configured as stretched with "Intersite BUM Traffic Allow" enabled from the Cisco NDO, by default a GIPo address is assigned from a separate range of multicast addresses. It is reflected in the user interface by the "Optimize WAN Bandwidth" flag, which is enabled by default for the bridge domain created by the NDO.
In the below example, we can verify the configured policy by NDO which is reflected on ACI.
Unicast Traffic between sites:
For any endpoint on the same subnet to be able to communicate together, they need to exchange the ARP information, and in ACI ARP handling is depends on the ARP Flooding feature in the bridge domain.
There are two different scenarios to consider:
ARP flooding is enabled in the bridge domain: When ARP flooding is enabled, the Intersite BUM Traffic Allow in the same bridge domain needs to be enabled as well because the ARP request is handled as normal broadcast traffic and is flooded.
ARP flooding is disabled in the bridge domain: When ARP flooding is disabled, the ARP request is handled as a routed unicast packet.
EP1 generates an ARP request for the EP2 IP address.
Since ARP Flooding is disabled, the local leaf node inspects the ARP payload and checks the target IP address which is of the EP2. Assuming that EP2’s IP information is initially unknown on the local leaf, the ARP request is encapsulated and sent toward the Proxy A anycast TEP address defined on all the local spine nodes to perform a lookup in the COOP database.
One of the local spine nodes receives the ARP request from the local leaf node.
The capability of forwarding ARP requests across sites in "unicast mode" is mainly dependent on the knowledge in the COOP database of the IP address of the remote endpoint (information that is received via the MP-BGP EVPN control plane with the remote spine nodes).
The VXLAN frame is received by one of the remote spine nodes, which translates the original VNID and class ID values to locally significant ones and encapsulates the ARP request, then sends it toward the local leaf node to which EP2 is connected.
The leaf node receives the frame, de-encapsulates it, and learns the class ID and site location information for remote endpoint EP1.
The frame is then forwarded to the local interface to which the EP2 is connected, assuming the ARP flooding is disabled on the bridge domain in Site 2 as well.
8. EP2 sends ARP reply to the EP1.
9. The local leaf node encapsulates the traffic to remote O-UTEP A address.
10. The spine nodes also rewrite the source IP address of the VXLAN-encapsulated packet, with the local O-UTEP B address identifying the Site 2.
11. The VXLAN frame is received by the spine node, which translates the original VNID and class ID values of Site 2 to locally significant ones (Site 1) and sends it toward the local leaf node to which EP1 is connected.
12. The leaf node receives the frame, de-encapsulates it, and learns the class ID and site location information for remote endpoint EP2.
13. The frame is then sent to the local interface on the leaf node and reaches EP1.
Let's recap we have discovered in this topic:
How EP information is exchanged between differend site in different scenarios (EP has IP address and EP without IP address).
How Overaly Control Plane occured.
How Overlay Data plane occured.
How different data plane traffic is exchanged between sites.
CCIE RS #64605 - Network Consultant at SFR ( Backbone MPLS/BGP - Grands Comptes )
8moThanks for sharing Veru useful