We have now connected our two data networks together for redundancy and have run into a very strange issue.
Just to be clear:
THW Core ae0 ----------------------------------- ae0 Core HEX
iBGP is running between the core loopback interfaces.
ISIS is the IGP
eBGP peering over xe-1/2/5
THW Core aggregated routes remain advertised when the connection is made.
HEX aggregated routes disappear completely.
If I disconnect the data centres, the HEX aggregated routes re-appear.
Any ideas anyone?
Funny you should mention that as I have an update..... and this will be good learning for troubleshooting.
Take the two sites, as follows:
THW CORE --------------------------------- HEX CORE
THW Core has aggregated, advertised routes to one upstream ISP and HEX Core has different advertised aggregated routes to another upstream ISP.
When they were separate the routing was fine to the upstream ISPs. When connected it was not. However, after completing some troubleshooting I found out the following:
run show route advertised-protocols bgp <peer address>
All aggregated routes appeared.
After connection was made between the data centres
No aggregated routes showed
I know why this occuring now. All routing from HEX Core is now going across iBGP or ISIS (I need to investigate a little further) to THW Core and out to that upstream ISP. Naturally, as the HEX Core routes are not advertised there, then nothing will work from HEX, only from THW.
So, I'm getting there with the troubleshooting, now I have to find a way of making HEX routes exit via HEX Core and not THW Core.
I will leave open and update as I go for a good bit of troubleshooting for other people.
I'm open to any suggestions on what to look for to influence the routes (local preference, MEDs)......
Yes. Preference of an internal route is what I suspected too. I will investigate today and let you know the results.
There has to be "live" contributing routes for aggregate routes to be advertised in BGP.... these do exist.... so, I'll check the routing and forwarding tables and the policies and see what I can find there.
If you want me to post the config here then please say.
Okay. I will try and explain this as best I can over forum messaging.
When looking at the routing tables I found that IS-IS was being preferred, but the reason for this was, unknown to me, someone on THW-Core was advertising the complete internal network. So, I removed that aggregate route and the ISO configuration on the connecting interfaces, hey presto, all of the aggregated routes for HEX re-appeared. All good so far...... except....
One of the tests we need to complete is a loss of upstream ISP on one site and ensure the routes traverse the interconnects and exit the opposing site, therefore ensuring failover. So, for example, if we shut the upstream ISP interface on HEX, we expect all routing to go to THW from HEX across the core routers. Again, here is the topology (Basic):
THW CORE -------------------------- HEX CORE
Port xe-1/2/5 is to the Upstream ISP.
Disable xe-1/2/5 on HEX Core. I expect all routes that did exit xe-1/2/5 on HEX to now traverse across the connection to THW and exit xe-1/2/5 on THW.
So, I disabled the xe-1/2/5 interface on HEX Core. Checked the routes on THW and none of HEX routes appeared (I did add them to THW aggregated list):
run show route advertised-protocol bgp <peer address> on THW. Only THW routes appeared.
run show route protocols aggregated detail - No contributing routes showing.
So, I found the right policy that was only adding "Direct" routes to iBGP. So, I configured the following as a test:
On HEX Core:
set policy-options policy-statement internal-bgp-peers term 2 from protocol isis
And the routes appeared on THW.
So, now I am left with the following position:
I can route from a DSL CPE to THW xe-1/2/5 interface, but no further. The routes appear in the aggregated table, they also appear in the contributing list.... I have also checked on "lookingglass" and the routes are seen there in the BGP tables. It's almost like iBGP is not getting the routes to eBGP somehow.
I have attached a basic overview of the network.....
Please let me know what other information you would like to point me in the right direction.
I will carry on investigating.
As an add on to this..... A thought I had regarding what is currently occuring is that the "other" upstream ISP might not be accepting those new routes. Waiting for a response from them as to what filters they have in place.
Edit add on:
No. It appears the other upstream ISP is accepting those routes.
So, given the diagram I attached, imagine a CPE DSL Customer the other side of Wholesale, comes into HEX-LNS-02, out of ae1 to HEX-CORE-02 and, normally, through xe-1/2/5 to the upstream ISP. But in this DR test I have disabled xe-1/2/5 on HEX-CORE-02. Now, when I complete the following command on THW-CORE-01:
Let's say the two routes should be:
set routing-options aggregate route 192.168.10.0/24
set routing-options aggregate route 192.168.100.0/24
run show route advertising-protocol bgp <peer address> ---- I see the following:
* 192.168.10.0/24 Self I* 192.168.10.0/30 Self I
* 192.168.100.0/24 Self I* 192.168.100.0/30 Self I
And if I run the following command:
run show route protocol aggregate detail:
192.168.10.0/24 (1 entry, 1 announced) *Aggregate Preference: 130 Next hop type: Reject, Next hop index: 0 Address: 0x2a2f284 Next-hop reference count: 13 State: <Active Int Ext> Local AS: 11111 Age: 18:42:58 Validation State: unverified Task: Aggregate Announcement bits (3): 0-KRT 2-BGP_RT_Background 7-Resolve tree 4 AS path: I (LocalAgg) Flags: Depth: 0 Active AS path list: AS path: I Refcount: 2 Contributing Routes (1): 192.168.10.0/30 proto BGP
192.168.100.0/24 (1 entry, 1 announced) *Aggregate Preference: 130 Next hop type: Reject, Next hop index: 0 Address: 0x2a2f284 Next-hop reference count: 13 State: <Active Int Ext> Local AS: 11111 Age: 17:42:08 Validation State: unverified Task: Aggregate Announcement bits (3): 0-KRT 2-BGP_RT_Background 7-Resolve tree 4 AS path: I (LocalAgg) Flags: Depth: 0 Active AS path list: AS path: I Refcount: 1 Contributing Routes (1): 192.168.100.1/32 proto BGP
If I complete a traceroute from the CPE at HEX side, I can get through to the THW-CORE-01 ae0 interface. If I run the following command on THW-CORE-01 ae1 interface:
run monitor traffic interface ae0 no-resolve size 1500 matching "net 192.168.10.1" -----
and ping the xe-1/2/5 interface on THW-CORE-01 I see the packet flow. When I run traceroute from the CPE I get to the ae0 interface and no further and if I try and ping 18.104.22.168 I see no traffic.... so, I think there is an issue with the advertsing or the Policy on THW-CORE-01 rather than HEX-CORE-02.......
Any help here would be great ..... if the above has not completely confused you 🙂
But with the attached network topology it will make sense...
I am not sure I follow the topology so apologies if this is off.
The issue I see is how the site will know that the ISP has failed at the remote site.
We have to have the aggregate routes ready to advertise from what we see on the remote site link to our own ISP. But only when the remote ISP is down.
It would be easy to advertise them all the time with either different as prepend, local pref to the ISP or in different prefix lengths. But to turn this on and off based on an event on the remote router is tricky.
My simple solution would be:
for the primary site advertise two aggregate routes each being half of the ip space. This will make the longest match win and most traffic will come here.
On the backup site advertise the full aggregate single prefix so it is in the ISP tables and ready to go but generally is not used.
This could then be up all the time with no event detection needed.
No need for apologies. I am very appreciative of the help here.
I have attached here a new, more detailed network topology to try and show where BGP is running and the traffic flow. I have created two made up customers. So, with regards to the attached document, here is the scenario:
2 x data centres: 1 at Harbour Exchange (HEX) and one at Telehouse West (THW). We will have a 50/50 split estate across the two sites.
A customer at site HEX (as marked with IP 192.168.12.10 on the diagram) would have atraffic flow through HEX-LNS-02, across interface ae1 to ae1 on HEX-CORE-02 and then out of xe-1/2/5 on HEX-CORE-02 to the upstream ISP (marked as ASN 23456 on the diagram).
The same flow would occur on THW (but obviously through the THW equipment to the upstream ISP marked as ASN 98765 on the diagram).
So, one of the Disaster Recovery tests I need to complete is to simulate a loss of one ISP from one of the sites. So, on HEX-CORE-02 I disable interface xe-1/2/5 to simulate the loss (this is from remote so commit confirmed is always used).
Now the traffic flow should be (so no loss of service for customers at HEX):
Customer at 192.168.12.10 will route through HEX-LNS-02, then the ae1 interface to HEX-CORE-02, then across iBGP, so ae0 to THW-CORE-01 and then out of xe-1/2/5 at THW-CORE-01 to the upstream ISP on ASN 98765.
Hopefully, with the diagram and that explanation it should look a lot better.
Please let me know what information you would like to see?
Output from troubleshooting commands?
Any thing you need, then please let me know.
I will continue troubleshooting and will also try your suggestion.
Okay. So, I have got the routing working by enabling IS-IS on the ae0 interfaces, but this appears to have caused another issue....
I cannot peer with the HEX upstream ISP when re-enabling the port.
Reason: Prefix limit reached.
It appears that the internal iGP is sending the complete internet routing table to each other. Not sure if that is meant to happen. More troubleshooting required.
For the igp it sounds like ISIS is importing bgp routes which you will not want to do in your case since the only use for the igp is to get the loopbacks and internal routing. All the bgp should be done in the bgp peerings.
For the iBGP are these the only two sites or is there multiple sites and a route reflector somewhere?
It also sounds like you don't need full tables for this setup. Since you have each site using only one ISP they only really need the default route to reach the internet. You would only need full tables to mix the upstream usage of more than one ISP.
Between the sites on iBGP you would advertise the default route from the ISP and your customer prefixes for reachability. On the import policy you would mark these at a lower local preference so that this default is only used when the local ISP is lost.
The export to the ISP policies I described above. By using the more specific prefixes at the preferred site most traffic will arrive there but the longer prefix will be out and available for failover when the primary site is lost.
Thank you for the reply. Much appreciated.....
From an iBGP perspective, I have configured the following group which only has 1 policy applied:
set protocols bgp group internal-peers type internalset protocols bgp group internal-peers local-address 192.168.1.2 ----- loopback addressset protocols bgp group internal-peers export internal-bgp-routeset protocols bgp group internal-peers peer-as 200994set protocols bgp group internal-peers neighbor 192.168.1.5 ----- peer loopback
set policy-options policy-statement internal-bgp-route term 2 from protocol directset policy-options policy-statement internal-bgp-route term 2 then acceptset policy-options policy-statement internal-bgp-route term next-hop-self then next-hop self
This is replicated on HEX.
Can I assume then, given your information, that this policy is not correct? Should I remove the "next-hop-self" (I'm not sure I should or the iBGP peer will not know where to route everything.
I assume you have other terms in the policy?
This would only send directly connected routes which I assume are your customer prefixes downstream.
To have the backup isp link you would need to send over the ISP route. Currently this is the full table BGP that you mentioned and also seems to not have enough memory to share. Here I suggest you get a default only instead then have a term in this policy that matches and sends ony the bgp default route.
The import policy would accept this bgp default route from the iBGP peer and set a local preference lower than the local ISP default route so it will only be a backup should the local route be lost.
So, currently, I have the following configured and set on the eBGP peering interface:
set protocols bgp group External-Peers export isis-default
set policy-options policy-statement isis-default term ipv4 from protocol staticset policy-options policy-statement isis-default term ipv4 from route-filter 0.0.0.0/0 exactset policy-options policy-statement isis-default term ipv4 then accept
Is this what I should also export in iBGP but with the obvious option of changing the local-preference so it becomes a backup?
Almost. You don't want to create a static default route at all. You will want that route to come in via BGP from the upstream carrier. That way the route will leave when you lose the carrier link.
If you have this as a static route it won't go away in some cases. Only if you lose the physical link to the carrier. In order for failover to occur you need to have the default route removed when the upstream path is gone fo any reason not just physical link failure.
Okay, I have just set the default route and removed the static and configured it as an "export" policy on the iBGP group.
Now, here is the issue we are experiencing:
1: The complete table has gone, which is great.
2: If I now complete a traceroute from a DSL subscriber on HEX (As per the diagram) when I shiutdown the upstream ISP (GTT) interface, I get a loop from the Core to the LNS backwards and forwards.
On the Core router I have a 0.0.0.0/0 default route pointing towards the LNS. If I remove this then I have no routes available for DSL subscribers and also lose connectivity to HEX 9Commit confirmed is useful 🙂 ).
This is a little confusing as with the default created policy applied as an export on the iBGP group, I would have thought it would have worked.
It may be that I need to put the iBGP physical link (not the peering loopback) into IS-IS and retry.
The solution is nearly always a little piece of configuraiton that is missed.
So, when they were two separate sites, the default route had to point back to the LNS. And it was STILL pointing to the LNS. I changed this to point to the peer physical interfaces between the iBGP peers and, hey presto, it is all working exactly as we want it to.
I do have a separate question that I discovered during this troubleshooting, but I will ask that in a separate topic.
Thank you for your help Steve, much appreciated.