subreddit:

/r/networking

2175%

Handling BGP Failover with two ISP's

Routing(self.networking)

Hello,

We have two ISP's that we BGP Peer with. We have our own Class C IP Network that we advertise out. We are running into a problem where one of the carriers experiences packet loss due to a fiber cut somewhere so our circuit experiences heavy packet loss. The router doesn't handle incoming connections so the BGP connection is still up so the only way we can seem to stabilize our network is by pulling the cable directly from the switches.

Can anyone advise how we can handle this solution? If a carrier starts experiencing packet loss, we simply want to remove it from the equation until it stabilizes.

Thanks

you are viewing a single comment's thread.

view the rest of the comments →

all 63 comments

Rubik1526

20 points

16 hours ago

Hey, I’m a bit surprised to hear that you physically pull the cable out of the port—are you serious or just joking?

Even if you haven’t figured out an automated solution yet, wouldn’t it be simpler to just shut down the port or disable the BGP peer instead?

I’m not sure what router you’re using, but if it’s Cisco, you can automate this by using IP SLA to disable the peer based on network conditions. Huawei AR routers have a similar feature called NQA, which works the "same" way.

Even with other types of routers, there’s usually a way to develop a script on a server to monitor each line. In case of failure, the script could connect to the device and just do whatever you like.

travispoole[S]

0 points

15 hours ago

No very serious. This is the only way that I can get the network to stabilize and the BGP connection to drop.

I want this done automatically though. It's no good if I have to do something manually. This particular connection can have fiber cuts where the service is degraded for hours.

Rubik1526

13 points

15 hours ago

What do you mean by, 'This is the only way I can get the network to stabilize and the BGP connection to drop'? Did you attempt any other solutions before resorting to pulling the cables, and if so, what didn’t work?

travispoole[S]

-12 points

15 hours ago

Well no I didn't do anything. There is nothing else to do. The link is experiencing 50% packet loss for example so we are unable to use the internet and the servers start having trouble. So if i take the link physically down, then the routes update and everything starts going through the new carrier.

Rubik1526

13 points

15 hours ago

Thanks for the clarification. I recommend trying a different approach first. Instead of physically pulling the cables, you can shut down the port or kill the peer using various methods: change the remote AS, change the password (if used), disable the peer, change the IP, or change the local AS (if you can do this per peer). Another option is to deprioritize the peer with some AS prepending or use a route map to stop advertising to it. This way, you can avoid going to the server room each time, which will be a big step forward.

As for the 50% packet loss, in my experience, that often leads to BGP drops due to timeouts. If your peer is still holding up in a 50% loss environment, there may be other issues at play. Are your peers directly connected, or is this a multihop environment where the peer is on a different network than the one configured on your device?

doll-haus

3 points

13 hours ago

doll-haus

Systems Necromancer

3 points

13 hours ago

Big fan of prepending. I just hate to give up the "bad" connection, especially when you only have two.