BGP Prefix Independent Convergence
Convergence is very wide topic and every single hardware component/protocol which are involved in the end-to-end Service including Customer CPEs and their configuration tuned together to give fast convergence. In today’s network, BGP is the protocol who carry services prefixes through operator’s network. When it comes to BGP protocol as standard deployment, by default BGP speaker does not have any backup route available at all so it is quite challenging to achieve faster convergence with BGP protocol. It is very evident and clear that to achieve sub-second convergence, Routing nodes should calculate backup routes in advance and must program on the forwarding plane along with primary route. Backup Route calculation and programming on the forwarding table is the solely key point which we need to achieve in respective protocols. BGP PIC is the solution design to cater the requirement for achieving faster convergence when BGP as the protocol involved.
BGP protocol has been designed as a slow protocol from route updates point of view. It takes seconds or even minutes to get converge. Bigger the routing table, longer is the convergence time. To get the routing fully converge, BGP follows the below stages which takes seconds or minutes depending on the number of affected prefixes.
- Failure detection-BGP learns about failure either through IGP protocol or through BFD.
- Routes withdraw– BGP withdraw the routes from RIB. RIB then withdraw routes from FIB.
- Peers Update- BGP send route withdrawn message to all peers.
- Route calculation- BGP calculate best new path for the affected prefixes.
- RIB/FIB update- BGP insert best path for the affected prefixes to RIB and RIB install the same into FIB.
If we would have backup route calculated in advance then during 1st step just after detection, we would have shift traffic over backup one because both best route and backup is programmed on forwarding table which can be made active immediate after detection. Rest all steps of routing convergence would keep happening parallel in the background.
Before we go into the details of BGP PIC/FRR design model, we need to understand few of challenges associated with BGP and what all the possible solution to overcome the challenges.

Figure 1- IPv4 BGP Prefix Global BGP table


Figure 2- VPNv4/IPv4 BGP Customer Prefix in Global or VRF table
In above drawings, we are trying to present two use case-
- 1st one is the one where same Prefixes is being advertised from different Peering partners or from different data centres or from different sites of same customer.
- 2nd use case where Customer has opted resilient Service and Advertising same prefixes to operator network to achieve resiliency. Services can be VPN with VRFs on Operators side or it can be regular Internet Service with IPV4 Global table.
- R1 and R2 Received 10.0.0.0/16 NLRI from Peering partners and advertised to Router R3. R3 can be any BGP speakers (Edge/Core/RR). Consider in this example it is RR (Route Reflector) but it can be any BGP speaking router.
- R3 received same 10.0.0.0/16 NLRI with two IGP Next Hops (R1 and R2) which we call two Paths for the same NLRI.
- R3 would run BGP decision algorithm and select one path as the best path which may be either R1 or R2 based on attributes associated with NLRI.
- BGP always advertises best selected paths of that NLRI to all peers except the one who has been selected best.
- If we consider R1 is the best Path selected, then R3 would advertise best routes to R2 and R4.
- If there is any failure on best selected path, then R3 must re-run the BGP decision algorithm to select the new best and here is the problem because it may take time in seconds or even minutes which depends on the size of BGP table. This is entirely depending on length of table.
- BGP by themselves neither calculate backup paths nor advertised more than one Path to any other BGP speakers. Both problems are big enough from Convergence perspective. All nodes must have minimum two paths so that one can be primary and one act as backup.
- Problem No1- no backup path Advertisement
- Problem No2- If backup path gets advertised then No option for backup path installations on Receiving end BGP nodes.
- Problem No2- If backup path gets advertised then No option for backup path installations on Receiving end BGP nodes.
In our example- R3 supposed to advertise Primary and backup path and other Nodes like R4 supposed to install both Primary and Backup in forwarding table. We must have solution to fix both problems.
We need to 1st discuss the methodologies to provide sols for above two problem statements-
This is the capabilities for BGP developed under RFC7911.In Cisco, BGP “add-path” knob under protocol BGP has been designed to fix both above discussed issues. This knob is applicable both in Global mode under a particular address-family or per neighbor. If applied under address-family then it is applicable to all BGP neighbor defined under address-family.
- BGP Add-path “Advertise”- If configured then BGP Node advertises 2nd best path to iBGP Peers. This is being used for Backup Path Advertisement.
- BGP Add-path “install”- If applied, then back up would get install in RIB/FIB/CEF. This knob is being used for Backup Path Install.
- “Advertise best-external”- This knob mainly applied on Edge nodes where multi-homing CPEs are connected. This single know serves both the purpose of advertising best EBGP route as backup and install the route in forwarding plane. If these knobs applied, then we may skip add-path knobs. IOS XE does even allow to configure both together. However, IOS XR CLI accepts both configurations.
- Different RD values for VPN sites- Applicable only for VPNv4 not for IPv4. Customers resilient connections connected with Edge routers must with different RDs. Routes advertised by Edge nodes from VRFs to RR must be from different RD. But this serves partial not a complete solution. If we have configured different RDs then can avoid Add-path “Advertise” on Edge nodes but Add-path “install” is still must. Basically, different RD creates different NLRI form VPN prospective hence it get advertise to all Peers.
To understand BGP PIC, we 1st need to understand default behaviour of BGP route calculation, updates and two critical BGP timers on which BGP convergence relies.
BGP Route calculation- BGP calculates the best routes while running bgp best path algorithm and further change in any attributes brings the algorithm to re-run and select best route. This selected best route must be update to all BGP Peers via BGP update message. BGP update message composed of router to be withdraw or to be add new NLRI. BGP Route-Withdraw can be implicit or can be explicit.
Explicit BGP route withdraw- If the Previously advertised BGP prefixes is no longer valid or vanished from the network then BGP speaker must send BGP update packet and list out all those prefixes in withdrawn route field.
Implicit BGP route withdraw- If BGP speaker further update previously advertised NLRI with the changes in attributes then it is called implicit BGP route withdraw. Attributes changes can be in the form of any IGP metric change or any path attribute. If any BGP node receives the same NLRI with changes in attributes, then for him it is implicit withdraw because prefix is not vanished from the network.
R#sh bgp ipv4 unicast neighbors <Neigh IP> | include Withdraw
Implicit Withdraw: 17 0
Explicit Withdraw: 2 3
As we know, BGP is completely relies on IGP. For any NLRI to reach there would be associated BGP next hop which is learnt through IGP only. Any change in IGP which can be any form like IGP metric change or BGP Next hop change due to node failure or due to change in path Attributes. etc brings BGP to re-run the BGP best route selection algorithm. The result of algorithm would be propagated to BGP speaker in the form of explicit or implicit route withdrawn.
In this section, we need to understand how and when BGP react upon the changes in IGP or on the path attributes or if there are no changes then what BGP do?
BGP Scan-Timer:
BGP by default, scans all the BGP prefixes installed in the table to find best route and a valid Next-Hop IP against each BGP prefix if there is any change. The validation of Next hop against each BGP prefix is the process of resolving the next-hop recursively through the router’s RIB and possibly changing the forwarding information in response to IGP events. This process run periodically every 60 Second. This interval is called BGP scan-timer.
We all know IGP protocol are designed to react fast on any changes in the network. RIB would import the new IP hops very quickly against each IGP prefix based on changes happened in the network. Possibly these change in RIB table also bringing change of Next Hop for BGP prefixes in BGP table but BGP would not be able to detect those changes until next BGP Scan-time that run every 60 Seconds. BGP would be unaware about changes until next Scan timer which can lead sub-optimal path for some BGP prefixes or traffic can be blackholed until BGP get new next Hop until next BGP Scan time. The periodic behaviour of BGP Scanner is very slow to effectively respond to IGP events. This leads a very poor convergence for the network where BGP prefixes are involved for source/destinations.
BGP table can be very large in size and scanning every 60 second may hog the Router CPU. BGP scanning is not suitable and optimal way of reflecting changes of RIB in BGP table.
- BGP Next-Hop Tracking (NHT) feature is a new way of reflecting RIB changes in BGP table, and we can achieve much faster convergence than BGP scan timer.
Next hop tracking is an optimization feature that reduces the processing time involved in the BGP best path algorithm by monitoring changes to the routing table.
prefix/mask –> next hop
[Ex: 172.1.0.0/16 –> 1.1.0.2]
prefix/mask –> next hop , interface
[Ex: 1.1.0.2 –> 2.2.12.1, Gi0/0/0/0]
BGP routes are recursive in nature and BGP next hops get resolved through an IGP route. IGP usually adds its routes pointing to an interface (these are called non-recursive routes).
BGP next hop tracking (NHT) is a feature that reduces the BGP convergence time by monitoring BGP next hop address changes in the routing table. Next hop tracking is enabled by default. It’s event-based because it detects changes in the routing table. When it detects a change, it schedules a next hop scan to adjust the next hop in the BGP table. But after detecting a change, the next hop scan has a default delay of 5 seconds. But this delay time is further configurable option. Next hop tracking also supports dampening penalties. This increases the delay of the next hop scan for next hop addresses that keep changing in the routing table.
bgp nextop trigger delay XX
In NHT, BGP process register the next-hop values with the RIB “watcher” process and require a “call-back” every time information about the prefix corresponding to the next-hop changes. In any Network, BGP register with IGP next-hops which is maximum to the number of Edge router in the network. Typically, the number of registered next-hop values equals the number of exits from the local AS, or the number of PEs in MPLS/BGP VPN environment.
There can be two types of IGP events which bring change in BGP table:
- IGP prefix becoming unreachable.
- IGP prefix metric change
IGP prefix becoming unreachable-
In the case of Edge Node failure, it would be BGP next-hop change and the BGP process must start BGP Router sub-process for re-calculating the best paths. This will affect every prefix that has the next-hop changed because of IGP event, and could take significant amount of time, based on number of prefixes associated with this next hop. For example, if an AS has two connections to the Internet and receives full BGP tables over both connections, then a single exit failure will force full-table walk for over 900k prefixes. After this happens, BGP must upload the new forwarding information to RIB/FIB, with the overall delay being proportional to the table size. BGP convergence is non-deterministic in response to an IGP event, e.g. there is no well-defined finite time for the process to complete.
IGP prefix metric change-
if the IGP change did not result in any effects to BGP next-hop but the Metric to BGP NH get change due to any of Core link failure then it would be same as discussed above that BGP process will kick start re-calculating the best paths. But if due to failure of link and core node does not bring change in the metric of BGP NH then BGP is not needed to be informed at all and convergence is handled at IGP level.
- The use of Hierarchical FIB on Cisco platform and Indirect NH in Juniper platform improves the convergence time faster than flat FIB model.
In any of Service Provider Network, all Service prefixes are BGP NLRI. It can be Internet prefixes or VPN services but in both cases prefixes type is BGP. So it is very important to understand how fast we can converge the BGP so that traffic impact to the end customer can be minimized. Both Control plane and data plane together makes the solution of faster BGP convergence. The solution has been named BGP PIC. We discussed above, during BGP scan which can be triggered through standard 60 second time or through BGP NHT process, but all the prefixes get checked against BGP best path algorithm which means longer the BGP table then poor is the convergence. Last prefixes in the table would take highest time than any other prefix sitting in the middle or in the beginning. With standard approach, we can not make the convergence prefix independent.
There are mainly Two BGP PIC option which contributes towards Service convergence.
- BGP PIC Core
- BGP PIC Edge
BGP PIC Core –
- When IGP path changes in Core Network like IGP metric change or due to other reasons like Core Link failure or if any Core Node failure.
- Any of Core link/Node failure but BGP Next hope still points to the same IGP route of BGP Next Hop. However due to Core element failures, Metric or outgoing interface would have change to reach IGP route of BGP Next Hop

Figure- Representation of BGP PIC Core
Failure (1,2,3) shown in the diagram are core failure which will not change the BGP Next-hop for the Node R6. It would be still R2 is BGP Next Hop. However, to BGP NH, R6 would now use another outgoing interface.
BGP Core PIC does not have explicit configuration but solely relies on two factor-
- How faster is the IGP of Network get converge through IGP control plane (SPF and link advertisements) and Core network forwarding plane (IP FRR, Remote LFA or ti-LFA). This one is total independent of BGP but BGP convergence depends on it. It is implicitly understood that Core network must configured for faster convergence if we want Services to converge fast.
- How efficiently the changes which are brought by BGP NHT and BGP scan process are handle by Data plane. This depends on how prefixes (FIB) are programmed in hardware and how changes can be minimized to make lower impact on scan process.
In cisco, Moving from Flat FIB architecture to Hierarchical FIB solved the 2nd problem and made the convergence prefix independent. Let’s understand briefly how Hierarchical FIB solve this issue.

FLAT CEF-
Let’s understand FLAT FIB from Edge Router R6 prospective as above topology-
R6 would have Prefixes from R7/R8/R10 routers. It could be Internet prefix or VPN prefix. Flat FIB design would make one to one mapping between all the prefixes.
For the example- We have taken just 5 prefixes in this topology and Flat FIB programmed outgoing interface against each BGP prefix on R6.
BGP Prefix on R6 | Outgoing Int |
10.1.0.0/16 | gi0/0/0/5 |
10.2.0.0/16 | gi0/0/0/5 |
10.3.0.0/16 | gi0/0/0/5 |
172.1.0.7/32 | gi0/0/0/4 |
172.1.0.10/32 | gi0/0/0/5 |
Due to IGP changes (core link failure or metric changes etc), BGP NHT triggers BGP scan immediate after information get flooded to router R6. BGP scan get kicks in for all prefixes which are scanned one by one and new outgoing interface may program.
This is highly inefficient way and convergence are dependent on size of table. In this example, we have taken 5 prefixes but in real time network there may be full internet table of 900K prefixes.
Hierarchical FIB-
Hierarchical FIB brings multiple pointers in the memory to save prefix forwarding information and created indirect relationship between the actual prefix and the outgoing interface being used by the Router.
Hierarchical FIB is composed of below-
- BGP NLRI- BGP NLRI is actual BGP learnt prefix for which router is creating BGP routing table.
- BGP Path List- BGP NH information, one of BGP attribute. For each BGP NLRI, there must be valid BGP NH which is reclusively reachable through IGP. You may have multiple BGP NH if NLRI is learnt through multiple end points or internet gateways or If same prefixes being received from multiple MPLS Edge nodes in Resilient VPN CPE sites.
- IGP path list- To reach BGP NH listed in BGP path list which IGP path and neighbors is being used are listed in IGP Path list. It is the IGP information to reach BGP NH. In the given topology, if there is ECMP path to reach BGP NH then neighbor would be member of IGP path list.
- Outgoing interface- It is the direct interface or set of interface or outgoing interfaces which would be used to reach BGP NH. In the case of ECMP path to BGP NH, there can be multiple interfaces.

In Hierarchical FIB, there is separate pointer in memory location for all these above discussed components. if there is any change in IGP path to reach BGP NH which may result change of IGP neighbor and outgoing interface then only that information in IGP path list memory and outgoing interface memory would get updated. There would not be any change in BGP NLRI due to IGP changes. This ensures that length of BGP NLRI table does not impact convergence of BGP table. The relation between BGP NLRI and IGP entries of router is through pointer indirection and any change in IGP is just redirection of pointer.
In the case of Flat FIB it was suppose to be change of outgoing interface with every BGP NLRI because there was separate one to one entry stored in memory and to change outgoing interface, you need to scan whole BGP table.
- BGP PIC Core is entirely depended on Hierarchical FIB and Core network data plane convergence. There is no separate configuration for BGP PIC Core.
- BGP PIC Core is only talking about IGP path list and outgoing int list changes but not BGP Path list changes. BGP Path list changes will be handled by BGP PIC Edge.
If due to any link failure or node failure in Core infrastructure, there is no impact on local router to reach BGP NH entry listed in BGP Path list then BGP would not trigger scanning.
Example- In above topology, if link between R4 and R5 fails then both Router R4 and R5 would flood the Link state update to the network. Local router receives IGP update and after running SPF algorithm if there is no change in Metric and immediate next router and outgoing interface then BGP NHT process would not register any change and BGP Scan also would not trigger.
- For any IGP change information, there would be 1st SPF run and then there is comparison of IGP Path list and outgoing interface wrt to IGP changes.
When there is an ECMP path in IGP path List for respective BGP NH listed in BGP Path list, then failure of ECMP Paths until one path in ECMP list is leftover, there would not be any trigger of BGP Scanning and there is no best path algorithm kicks in.
Failure scenarios with BGP PIC Core-

In the above topology- Failure scenarios are listed below-
- Failure cases from Point 1 to 8 are falling under BGP PIC Core Convergence
- Failure cases from Point 9 to 14 are falling under BGP PIC Edge Convergence
- Failure cases from Point 15 to 16 are not falling under any of BGP PIC Core or Edge
- Failure cases from Point 18 to 20 are not falling under any of BGP PIC Core or Edge
Failures from 1 to 6 under BGP PIC Core Convergence
Failure No 1& 2– Link between R6 to R4 got failed or the Router R4 itself got failed.

- From R6 local router prospective, BGP NH to reach 10.x/16 and 172.1.0.7/32 is reachable via BGP NH R1 and R2 which are listed in BGP Path list. Both R1 and R2 are advertising these prefixes to RR and RR is advertising to R6.
- R6 has ECMP path to Reach R1 and R2 via IGP immediate neighbors R4 and R5. These both ECMP paths are programmed in hardware and doing flow based forwarding.
- Now if R4 or the link between R4-R6 went down then still 2nd ECMP path is available to forward the traffic to BGP NH R1 and R2.
- IGP would detect the failure with in few milli seconds as it is expected to have BFD enable for fast detection for BGP PIC core to work. IGP on R6 would finish SPF on the updates which is being received due to failures. Due to BGP NHT feature, BGP would get to know about the changes in IGP which would be result of SPF. But since we have 2nd link of ECMP path still alive then SPF would not give anything better than existing and no change in IGP metric to reach R1/R2. BGP would not run any scan to find best path to be include in IGP Path list of above tables.
Failure No 3 – Link between R6 to R5 got failed
This failure replicates above scenario For 10.x/16 and 172.1.0.7/32 BGP NLRI but for the 172.1.0.10/32 NRLI IGP Path list would get change and SPF run on R6 would bring new best path to reach R11 node mentioned in BGP Path list. BGP NHT would get notify this and BGP scan would get triggered to replace IGP path list memory pointer.
FIB during failure–

FIB After failure

- BGP Scan is not just for one prefix infect if any change in IGP Path list for any of BGP NLRI that would get notify to BGP via BGP NHT which would trigger BGP Scan for the whole table.
Failure No 5 to 8– Link between R6 to R5 got failed or the Router R5 itself got failed
Any one of failure but not any two or more together mentioned from 5th to 8th Number then it may change IGP Path List.
- If Point Number 5 failure- Link between R4 to R1 then in this case IGP SPF on R6 would conclude that traffic to R1 is not via ECMP path but only via R5. This would modify the IGP Path list but traffic to R2 is still via ECMP and no change in the path list. This would modify the IGP Path list, but BGP scan would not require.
- If Point Number 8 failure- Link between R5 to R2 then in this case IGP SPF on R6 would conclude that traffic to R2 is not via ECMP path but only via R4. This would modify the IGP Path list but traffic to R1 is still via ECMP and no change in the path list. This would modify the IGP Path list, but BGP scan would not require.
- Failure No 6 and 7 is like failure No 5th and 8th
Failure No 9th and 10th– If Node R1 or R2 got failed
- Failure 9th and 10th would make BGP NH down. This failure would change BGP Path List not IGP Path list. This failure is not covering under BGP PIC Core. We would discuss this use case under PIC Edge section.
PIC Edge–
BGP PIC Edge covers the failure on Edge node or failure behind Edge node towards CPE.
- Failure cases from Point 9 to 14 are falling under BGP PIC Edge Convergence
- Failure cases from Point 15 and 16 are not falling under any of BGP PIC Core or Edge. However, if both CPEs R7 and R8 would have from Same customer from dual CPE resilience scenario with iBGP between both then this would have another perfect case of real time deployment with PIC-Edge use case.
- when remote Edge PE node fail- In our topology R1/R2/R11 are Remote Edge PEs. This would change BGP Next Hop of BGP NLRI listed in BGP Path list. If BGP path list get changed then it may change IGP path too which in turn may change IGP Path list and outgoing interface of FIB table.
- when PE-CE link fails, or CE fails- This failure scenario is topology dependent.
- If Resiliency design has single CPE and that fails who is advertising Prefixes and there is no 2nd CPE. There would not be any convergence for such topology.
- In above topology CPEs R7/R8 are deployed Single CPE with Dual connectivity to R1 and R2
- In above topology, CPE R10 is deployed with single connectivity to Edge node R11.
- If resiliency design has single CPE but dual WAN connectivity to Edge Nodes. Failure of any one of WAN link would be covered under BGP PIC Edge.
- Above topology, CPEs R7/R8 are deployed with dual wan connectivity
- If Single CPE deployed with single WAN connectivity- there is no fast Convergence
- possible.
- If Resiliency design has single CPE and that fails who is advertising Prefixes and there is no 2nd CPE. There would not be any convergence for such topology.

Figure- Representation of BGP PIC Edge
Let’s understand BGP PIC Edge in more details with example-
Failure No. 9- If Edge Node R1 is Primary node for Traffic to CPE R7/R8 and it get fails.
During failure-

After failure-

- R1 and R2 are the Edge nodes who are advertising CPE R7/R8 prefixes toward RR in Operator network.
- By default, RR, choose one best and advertise to other Edge nodes like R6 in this topology.
- Let’s consider R1 is Primary Node which can be policy driven or selected by RR due to IGP metric prospective. If R1 fails-
- RR and other Edge node would come to know about failure either via BGP NHT or through BGP fall over BFD. Update from IGP would be must faster than BFD so it should be the IGP in the form of BGP NHT.
- It is BGP NH failure for the BGP NLRI and that would tigger BGP scan on RR and other Edge nodes for the recalculation of NH for those BGP NLRIs who’s NH was R1.
- Interesting point, Other Edge nodes may not have direct BGP but via RR. In this topology R6 has no direct BGP with R1 but it has BGP with RR(R3). It is RR’s Job to send BGP update (Withdraw) message to all BGP Peers. That happens immediate after RR know about failure. RR would not wait for BGP scanning if failures is in the BGP Path list(NH) and triggers BGP update packet. Edge node should react immediate and run BGP Scanning once receive BGP update(withdraw) for new NH calculation. But interesting fact, BGP update may take longer time to inform Other Nodes in the topology and through BGP NHT, it would be IGP who might have already informed about failure. So R6 in our topology might already have knowledge about R1’s failure and working on 2nd best NH calculation.
- After sending BGP update (Withdraw) message to Peers, RR would start scanning table for best new NH and same it would be on other Edge nodes.
- There would be a complete traffic drop to CPE In the period where R1 fails, RRs and other Edge nodes who are busy in finding best new NH.
- RR will find new next HP and would advertise to Peers
- Other Nodes would install new NH in FIB and start forwarding traffic via R2 now.
- Above discussed points clearly shows that convergence is completely relying on BGP Control plan and BGP Control plane is relying on fast detection about failures. BGP Control plan calculation
- If we have Primary and backup path calculation in advance before failure on Edge nodes and RR then during failures, Edge node can shift traffic to backup path and let BGP Control plane to give enough time to find best new NH? Is not this good idea and this is what we called BGP PIC Edge.
- CPE prefixes were advertised by both R1 & R2 nodes but due to BGP default behaviour we are programming only best route into the RIB & FIB. During failures of Best NH, until BGP find best new NH there would not be any route installation in RIB/FIB.
- BGP PIC not only ensures the backup BGP NH calculation but also to be program in RIB/FIB so that failure of 1st would immediately kicks in the backup one for traffic forwarding.
- In BGP PIC solution all nodes-
- would calculation best route as like default BGP behaviour
- Would calculate Backup route unlike default BGP behaviour
- Would advertise both Primary and Backup to other peers unlike default BGP behaviour
- Edge nodes would install both Primary and Backup in the RIB/FIB tables.
- Installation of backup Path in the FIB table is very critical to ensure sub second convergence but since RR are not placed inline in the topology then we can skip point “d” of backup route installation in FIB. Backup route installation on RR may create scale issues so solution must be considered carefully.
- Labs demonstrated in this workbook are about BGP PIC Edge because BGP PIC core does not require any configuration and implicitly relies on Core convergence and Platform FIB structure.
BGP PIC Edge- Control Plane-
We have chosen simple topology for low level explanations-

- R7 connected with R1 and R2 would advertise 172.1.0.7 via eBGP.
- R1 and R2 received 172.1.0.7 in BGP table. Both Edge routers will advertise this prefix to RR node R3. At this stage both R1 and R2 have only single Prefix NLRI and no backup route yet.
- RR, R3 received NLRI 172.1.0.7
- RR will run best path Algorithm to select the best Path for NLRI
- RR is already configured for Add-path hence and after best route selection, he will find the 2nd best (Backup path).
- RR would advertise these updates to all three Edge nodes (R1, R2, R6) shown in topology.
- R6 would receive BGP update from RR and will program both Primary and Backup path in RIB/FIB tables.
- R1 and R2 also received BGP update from RR
- If no explicit policy like Local-preference for enforcing Primary PE selection, then by default both R1 and R2 would keep eBGP received routes as best route.
- If policy is applied, then Primary Path would be strictly followed and one of the Edge node would accept Primary Path via iBGP not via eBGP.
- In our example- We have taken R2 Primary which means on R1 node, best route would be via R2. In this case eBGP route on R2 would become backup one
- Both Edge node would have one Primary Path and 2nd route which can be either through eBGP or iBGP would be backup one. Both are programmed in the forwarding plane.
- Now all Edge Nodes(R6/R1/R2) have both Primary and Backup route in RIB/FIB/CEF.
- RR does not need to Program backup path. If RR node is not inline (not in forwarding plane) then route installation is not important for RR.
- BGP Add path configuration should be there on CPEs as well. Without it, if any one link fails then CPE would take long time to fall back on the 2nd link. Hence for end to end sub-second convergence we should have all elements in the path configured for fast convergence features.
BGP PIC Edge Data Plane-
We discussed above, all Edge nodes are programmed for both Primary and Backup Path for NLRI 172.1.0.7/32. CPE R7 also configured for BGP-PIC so that both R1 and R2 routes are programmed. Let’s understand what happen if there is failure on Link R2-R7 or Edge Node R2 itself fails.
Link between PE-CPE (R2-R7) fails
- Traffic is taking path from R9 to R7 via Primary path as R2.
- We have traffic transiting via R2 and Link R2-R7. In the meantime R2-R7 link failed.
- R2 and R7 would be the 1st one who would detect this failure and there must be BFD configuration to ensure failover detection in sub-seconds.
- Other edge and core Node does not know yet about the failure. Hence R6 would keep sending traffic via R2.
- In this case if R2 don’t react after detection then traffic would be dropping on R2 until R6 control plane get converge and make R1 as Primary. But R2 already has backup path programmed which is via R1.
- Immediate after Link failure detection, R2 would switch traffic over back path which would take very few milli seconds as path was already programmed on PFE.
- At this stage there is no control plane convergence has happened. Which means for all other nodes, R2 is still Primary. Hence traffic is always heading to R2 but then R2 forward that traffic to R1.
- Since Backup Path is via another MPLS Edge node where VPN label lookup will happen. Hence it is very important for R2 Primary node to do VPN label SWAP to R1’s VPN label before switch packet toward R1.
- R1 also have both best and Backup path programmed but being backup Node (Policy forced) will have LFIB table for local table entry and out label would show “unlabelled”. Both Primary and Backup would show eBGP interface for traffic to exit. This LFIB entry looks strange against local VPNV4 label, but this is very critical and important point to understand.
- For Per Prefix VPN label, there is always a “unlabelled” entry against local label in LFIB table. Packet forwarding to CPE happens based on local label match and outgoing interface in LFIB table. There is no IP lookup for Per Prefix VPN label case.
- Since R1 is acting as backup hence its Primary Path is falling back to R2 which means if R2 send packet to R1 then R1 would switch back those packets to R2 which would create loop between R2 and R1 until control plane get converge.
- It is very important that when R1 received label packets form R2 then in LFIB lookup both paths must show toward CPE interface and packet must be switched directly based on local label entry.
- If due to any reason, IP lookup happens then R1 would switch packet back to R2. To avoid this LFIB program both Best and Backup via CPE interface.
- Due to above reason, BGP-PIC does not work with “PER vrf label”. In this IP lookup is must before packet get switched to CPE and with BGP pic it can create loop so “Per VRF label” is not supported with BGP-PIC configuration.
- It is expected, R2 in parallel of switching packet via backup path also sending BGP withdraw message to RR.
- RR already got to know about failure through BGP NHT.RR (R3) will also send BGP withdraw message to R6 and R1.
- R6 and R1 by default should take time in calculating another best route if BGP PIC not configured but since we have BGP PIC configured on R6, then R6 on receive of BGP withdraw message from RR or because of BGP NHT would immediately shift to backup Path which is via R1.
- Traffic now from R6 will be shifted to R1 and R2 is not expected to receive any traffic now.
- R6, in the meantime converge the control plane in synch with RR and calculate best route which is again via R1 only.
- R1 node get notification about failure through both IGP as well BGP withdraw message. Immediate after he would change his LFIB/CEF/RIB table and there would be only one entry will reflect thereafter.
- It is not expected that whole network gets converge at same time but until achieve full convergence, individual nodes would use their backup path which is programmed in advance to get sub-second convergence.