Routing

Expand all | Collapse all

BGP Issue - Peer flapping

  • 1.  BGP Issue - Peer flapping

    Posted 01-04-2018 14:01

    Dears,

    I have a problem in a BGP peering between an SRX220 and an MX10. The problem: at least once every two hours, the peer goes down, due a "Hold Time Expired Error". I dont know how to identify the event that cause this 'periodic' interruption; i don't see link flaps, neither packett loss.

     

    I share the log messages and relevant configuration in both devices.

     

    SRX Config:

    set interfaces ge-0/0/0 description TRUNK
    set interfaces ge-0/0/0 unit 0 family ethernet-switching port-mode trunk
    set interfaces ge-0/0/0 unit 0 family ethernet-switching vlan members VLAN-551
    
    set interfaces vlan unit 551 family inet address 10.0.5.11/29
    
    set vlans VLAN-551 vlan-id 551
    set vlans VLAN-551 l3-interface vlan.551
    
    set routing-instances VRF-VLAN551 instance-type virtual-router
    set routing-instances VRF-VLAN551 interface vlan.551
    
    set routing-instances VRF-VLAN551 protocols bgp family inet unicast
    set routing-instances VRF-VLAN551 protocols bgp local-as 3597
    set routing-instances VRF-VLAN551 protocols bgp group CORE neighbor 10.0.5.9 description PEER-VLAN551
    set routing-instances VRF-VLAN551 protocols bgp group CORE neighbor 10.0.5.9 local-address 10.0.5.11
    set routing-instances VRF-VLAN551 protocols bgp group CORE neighbor 10.0.5.9 import rm-import
    set routing-instances VRF-VLAN551 protocols bgp group CORE neighbor 10.0.5.9 export rm-export
    set routing-instances VRF-VLAN551 protocols bgp group CORE neighbor 10.0.5.9 peer-as 3597

     

    MX10 Config:

    set interfaces ae0 unit 551 description VLAN551
    set interfaces ae0 unit 551 vlan-id 551
    set interfaces ae0 unit 551 family inet address 10.0.5.9/29
    set interfaces ae0 unit 551 family iso
    
    set protocols bgp group GIOL-VRFINST type internal
    set protocols bgp group GIOL-VRFINST family inet unicast
    set protocols bgp group GIOL-VRFINST cluster 5.5.5.5
    set protocols bgp group GIOL-VRFINST peer-as 3597
    set protocols bgp group GIOL-VRFINST neighbor 10.0.5.11 local-address 10.0.5.9
    set protocols bgp group GIOL-VRFINST neighbor 10.0.5.11 import rm-import
    set protocols bgp group GIOL-VRFINST neighbor 10.0.5.11 export rm-export

    Example of log messages during a event; thats happen once every 2 or 3 hours.

     

    MX10

    Jan  4 18:21:59.954957 bgp_hold_timeout:4055: NOTIFICATION sent to 10.0.5.11 (Internal AS 3597): code 4 (Hold Timer Expired Error), Reason: holdtime expired for 10.0.5.11 (Internal AS 3597), socket buffer sndcc: 57 rcvcc: 0 TCP state: 4, snd_una: 3256096123 snd_nxt: 3256096180 snd_wnd: 16384 rcv_nxt: 3443979671 rcv_adv: 3443996055, hold timer out 90s, hold timer remain 0s
    Jan  4 18:21:59.955057 bgp_peer_close: closing peer 10.0.5.11 (Internal AS 3597), state is 7 (Established)
    Jan  4 18:21:59.955107 bgp_event: peer 10.0.5.11 (Internal AS 3597) old state Established event HoldTime new state Idle
    Jan  4 18:22:00.172348 bgp_event: peer 10.0.5.11 (Internal AS 3597) old state Idle event Start new state Active
    Jan  4 18:22:32.173849 bgp_event: peer 10.0.5.11 (Internal AS 3597) old state Active event ConnectRetry new state Connect
    Jan  4 18:23:47.173435 bgp_connect_complete: error connecting to 10.0.5.11 (Internal AS 3597): Socket is not connected
    Jan  4 18:23:47.173584 bgp_event: peer 10.0.5.11 (Internal AS 3597) old state Connect event OpenFail new state Idle
    Jan  4 18:23:47.173966 bgp_event: peer 10.0.5.11 (Internal AS 3597) old state Idle event Start new state Connect
    Jan  4 18:23:47.173996 bgp_event: peer 10.0.5.11 (Internal AS 3597) old state Connect event ConnectRetry new state Connect
    Jan  4 18:24:56.423226 bgp_event: peer 10.0.5.11 (Internal AS 3597) old state Connect event Open new state OpenSent
    Jan  4 18:24:56.424201 advertising graceful restart receiving-speaker-only capability to neighbor 10.0.5.11 (Internal AS 3597)
    Jan  4 18:24:56.430079 advertising graceful restart receiving-speaker-only capability to neighbor 10.0.5.11 (Internal AS 3597)
    Jan  4 18:24:56.430136
    Jan  4 18:24:56.430136 BGP SEND 10.0.5.9+179 -> 10.0.5.11+63774
    Jan  4 18:24:56.430176 BGP SEND message type 1 (Open) length 59
    Jan  4 18:24:56.430202 BGP SEND version 4 as 3597 holdtime 90 id 168.96.6.10 parmlen 30
    Jan  4 18:24:56.430310 BGP SEND MP capability AFI=1, SAFI=1
    Jan  4 18:24:56.430334 BGP SEND Refresh capability, code=128
    Jan  4 18:24:56.430354 BGP SEND Refresh capability, code=2
    Jan  4 18:24:56.430377 BGP SEND Restart capability, code=64, time=120, flags=
    Jan  4 18:24:56.430400 BGP SEND 4 Byte AS-Path capability (65), as_num 3597
    Jan  4 18:24:56.430433
    Jan  4 18:24:56.430433 BGP SEND 10.0.5.9+179 -> 10.0.5.11+63774
    Jan  4 18:24:56.430470 BGP SEND message type 3 (Notification) length 21
    Jan  4 18:24:56.430492 BGP SEND Notification code 6 (Cease) subcode 7 (Connection collision resolution)
    Jan  4 18:24:56.443023 bgp_event: peer 10.0.5.11 (Internal AS 3597) old state OpenSent event RecvOpen new state OpenConfirm
    Jan  4 18:24:56.443137 bgp_read_message: 10.0.5.11 (Internal AS 3597): 0 bytes buffered
    Jan  4 18:24:56.456538 bgp_event: peer 10.0.5.11 (Internal AS 3597) old state OpenConfirm event RecvKeepAlive new state Established

    SRX

    Jan  4 18:21:59.970738 bgp_read_v4_message:10642: NOTIFICATION received from 10.0.5.9 (Internal AS 3597): code 4 (Hold Timer Expired Error), socket buffer sndcc: 57 rcvcc: 0 TCP state: 5, snd_una: 3443979671 snd_nxt: 3443979728 snd_wnd: 16384 rcv_nxt: 3256096202 rcv_adv: 3256112565, hold timer out 90s, hold timer remain 41.093555s
    Jan  4 18:21:59.971008 bgp_peer_close: closing peer 10.0.5.9 (Internal AS 3597), state is 7 (Established)
    Jan  4 18:21:59.971247 bgp_event: peer 10.0.5.9 (Internal AS 3597) old state Established event RecvNotify new state Idle
    Jan  4 18:21:59.976374 bgp_event: peer 10.0.5.9 (Internal AS 3597) old state Idle event Start new state Active
    Jan  4 18:22:31.973760 bgp_event: peer 10.0.5.9 (Internal AS 3597) old state Active event ConnectRetry new state Connect
    Jan  4 18:23:46.973987 bgp_connect_complete: error connecting to 10.0.5.9 (Internal AS 3597): Socket is not connected
    Jan  4 18:23:46.974366 bgp_event: peer 10.0.5.9 (Internal AS 3597) old state Connect event OpenFail new state Idle
    Jan  4 18:23:46.977161 bgp_event: peer 10.0.5.9 (Internal AS 3597) old state Idle event Start new state Connect
    Jan  4 18:23:46.977342 bgp_event: peer 10.0.5.9 (Internal AS 3597) old state Connect event ConnectRetry new state Connect
    Jan  4 18:24:56.438429 bgp_event: peer 10.0.5.9 (Internal AS 3597) old state Connect event Open new state OpenSent
    Jan  4 18:24:56.438722 advertising graceful restart receiving-speaker-only capability to neighbor 10.0.5.9 (Internal AS 3597)
    Jan  4 18:24:56.443393 bgp_pp_recv:3396: NOTIFICATION sent to 10.0.5.9 (Internal AS 3597): code 6 (Cease) subcode 7 (Connection collision resolution), Reason: dropping 10.0.5.9 (Internal AS 3597), connection collision prefers 10.0.5.9+51507 (proto)
    Jan  4 18:24:56.443984 bgp_peer_close: closing peer 10.0.5.9 (Internal AS 3597), state is 4 (OpenSent)
    Jan  4 18:24:56.444535 bgp_event: peer 10.0.5.9 (Internal AS 3597) old state OpenSent event Stop new state Idle
    Jan  4 18:24:56.445281 bgp_event: peer 10.0.5.9 (Internal AS 3597) old state Idle event Start new state Active
    Jan  4 18:24:56.448441 bgp_event: peer 10.0.5.9 (Internal AS 3597) old state Active event Open new state OpenSent
    Jan  4 18:24:56.448638 advertising graceful restart receiving-speaker-only capability to neighbor 10.0.5.9 (Internal AS 3597)
    Jan  4 18:24:56.449074 bgp_event: peer 10.0.5.9 (Internal AS 3597) old state OpenSent event RecvOpen new state OpenConfirm
    Jan  4 18:24:56.460226 bgp_event: peer 10.0.5.9 (Internal AS 3597) old state OpenConfirm event RecvKeepAlive new state Established

    I need help to understand this log messages to determine if the problem is in BGP config; i've read about the Cease and Collisions events, but i don't see why occurs this in current config; i have between same devices similar configurations (on other vlan) and there is no problem.

     

    Regards!

     



  • 2.  RE: BGP Issue - Peer flapping

     
    Posted 01-05-2018 03:02

    The configuration looks correct.

     

    I would continue to dig on the link itself.  

    Are there incrementing errors

    confirm mtu is matching

     

    I am wondering why there is an AE on the MX and just a ge on the SRX.  This is just a single link between two devices?

    You are not creating an AE on the MX then connecting the physical interfaces to multiple other units?

     

    You also could add BFD to the link to get a better picture on the quality of the connection.

     



  • 3.  RE: BGP Issue - Peer flapping

    Posted 01-05-2018 08:25

    Check you MTU on the both interfaces.

    Try jumbo frame.



  • 4.  RE: BGP Issue - Peer flapping

     
    Posted 01-05-2018 10:51

    Hi Folks,

    Just my 2 cents on this…

     

    Can they add this and see if that can hold the BGP up when it happens again just in case ?

     

    # set protocols bgp precision-timers

     

    24 HOURS with snapshot for every seconds on CPU; get this and share, we can look for any resource crunch in the box in terms of CPU cycles,

     

    start shell

    top -s 1 -d 86400 -n 100 >> /var/tmp/top.txt &



  • 5.  RE: BGP Issue - Peer flapping

    Posted 01-05-2018 12:50

    Dear all, thank you very much for your replies.

    I add some more details about topology, i attach an image.

     

    juni-forum-1.png

     

    As you can see, on the side of MX there is an ae because is connected with a stack of EX3300, so the connection between MX and SRX is not direct, in the middle there are two devices under our administratión (an stack and standalone EX3300), and two QinQ links contracted from two providers. The ports of my EX3300 conected to each provider are simply dot1q trunks, the qinq encapsulation is transparent for me. This ports on EX3300, doesnt admit the same vlans, so i dont see problems related with STP or L2 loops.

     

    The case that I commented before, is related with VLAN 551, but on VLAN 553 that i put in the attached image, i have a similar configuration between the sames devices (another peer bgp, same configuratión over other subnet), and works fine, the session is up since 4 weeks ago. So, there may be a problem related to "provder 1".

     

    In respect of MTU, are you refering to media MTU or protocol MTU? It could be convenient increase the media MTU in all interfaces involved in this path? I know that both providers admit jumbo frames. However, i don't see errors in the interfaces of SRX, how can i troubleshoot MTU issues more accurately?

     

    SRX - interface ge-0/0/0:

     

    admin@"SRX"# run show interfaces ge-0/0/0 extensive    
    Physical interface: ge-0/0/0, Enabled, Physical link is Up
      Interface index: 134, SNMP ifIndex: 508, Generation: 137
      Description: TRUNK-"PROVIDER1"
      Link-level type: Ethernet, MTU: 1514, Link-mode: Full-duplex, Speed: 1000mbps, BPDU Error: None, MAC-REWRITE Error: None, Loopback: Disabled, Source filtering: Disabled, Flow control: Disabled,
      Auto-negotiation: Enabled, Remote fault: Online
      Device flags   : Present Running
      Interface flags: SNMP-Traps Internal: 0x0
      Link flags     : None
      CoS queues     : 8 supported, 8 maximum usable queues
      Hold-times     : Up 0 ms, Down 0 ms
      Current address: f4:b5:2f:89:3b:bb, Hardware address: f4:b5:2f:89:3b:bb
      Last flapped   : 2017-11-15 23:57:51 ART (7w1d 17:20 ago)
      Statistics last cleared: 2017-11-28 01:46:09 ART (5w3d 15:32 ago)
      Traffic statistics:
       Input  bytes  :        8511109371495              5197400 bps
       Output bytes  :        5757191123681             10281808 bps
       Input  packets:           1645955461                 1423 pps
       Output packets:             54680961                  902 pps
      Input errors:
        Errors: 0, Drops: 0, Framing errors: 0, Runts: 0, Policed discards: 0, L3 incompletes: 0, L2 channel errors: 0, L2 mismatch timeouts: 0, FIFO errors: 0, Resource errors: 0
      Output errors:
        Carrier transitions: 0, Errors: 0, Drops: 0, Collisions: 0, Aged packets: 0, FIFO errors: 0, HS link CRC errors: 0, MTU errors: 0, Resource errors: 0
      Egress queues: 8 supported, 4 in use
      Queue counters:       Queued packets  Transmitted packets      Dropped packets
        0 best-effort                    0                    0                    0
        1 expedited-fo                   0                    0                    0
        2 assured-forw                   0                    0                    0
        3 network-cont             1769399              1769399                    0
      Queue number:         Mapped forwarding classes
        0                   best-effort
        1                   expedited-forwarding
        2                   assured-forwarding
        3                   network-control
      Active alarms  : None
      Active defects : None
      MAC statistics:                      Receive         Transmit
        Total octets                 8511109371495    5757097345534
        Total packets                   1645955461         52362499
        Unicast packets                 1642723238         50037184
        Broadcast packets                  1363152             4595
        Multicast packets                  1869071          1769399
        CRC/Align errors                         0                0
        FIFO errors                              0                0
        MAC control frames                       0                0
        MAC pause frames                         0                0
        Oversized frames                         0
        Jabber frames                            0
        Fragment frames                          0
        VLAN tagged frames                       0
        Code violations                          0
      Filter statistics:
        Input packet count                       0
        Input packet rejects                     0
        Input DA rejects                         0
        Input SA rejects                         0
        Output packet count                                       0
        Output packet pad count                                   0
        Output packet error count                                 0
        CAM destination filters: 1, CAM source filters: 0
      Autonegotiation information:
        Negotiation status: Complete
        Link partner:
            Link mode: Full-duplex, Flow control: None, Remote fault: OK, Link partner Speed: 1000 Mbps
        Local resolution:
            Flow control: None, Remote fault: Link OK
      Packet Forwarding Engine configuration:
        Destination slot: 0
      CoS information:
        Direction : Output
        CoS transmit queue               Bandwidth               Buffer Priority   Limit
                                  %            bps     %           usec
        0 best-effort            95      950000000    95              0      low    none
        3 network-control         5       50000000     5              0      low    none
      Interface transmit statistics: Disabled
    
      Logical interface ge-0/0/0.0 (Index 79) (SNMP ifIndex 512) (Generation 144)
        Flags: SNMP-Traps 0x0 Encapsulation: ENET2
        Traffic statistics:
         Input  bytes  :        8511109371495
         Output bytes  :        5757226511661
         Input  packets:           1645955461
         Output packets:             54131898
        Local statistics:
         Input  bytes  :                    0
         Output bytes  :            129166127
         Input  packets:                    0
         Output packets:              1769399
        Transit statistics:
         Input  bytes  :        8511109371495              5197400 bps
         Output bytes  :        5757097345534             10281600 bps
         Input  packets:           1645955461                 1423 pps
         Output packets:             52362499                  902 pps
        Security: Zone: Null
        Flow Statistics :  
        Flow Input statistics :
          Self packets :                     0
          ICMP packets :                     0
          VPN packets :                      0
          Multicast packets :                0
          Bytes permitted by policy :        0
          Connections established :          0 
        Flow Output statistics: 
          Multicast packets :                0
          Bytes permitted by policy :        0 
        Flow error statistics (Packets dropped due to): 
          Address spoofing:                  0
          Authentication failed:             0
          Incoming NAT errors:               0
          Invalid zone received packet:      0
          Multiple user authentications:     0 
          Multiple incoming NAT:             0
          No parent for a gate:              0
          No one interested in self packets: 0       
          No minor session:                  0 
          No more sessions:                  0
          No NAT gate:                       0 
          No route present:                  0 
          No SA for incoming SPI:            0 
          No tunnel found:                   0
          No session for a gate:             0 
          No zone or NULL zone binding       0
          Policy denied:                     0
          Security association not active:   0 
          TCP sequence number out of window: 0
          Syn-attack protection:             0
          User authentication errors:        0
        Protocol eth-switch, MTU: 0, Generation: 158, Route table: 0
          Flags: Is-Primary, Trunk-Mode
    

     

     

    SRX - interface vlan.551

     

    admin@"SRX"# run show interfaces vlan.551 extensive    
      Logical interface vlan.551 (Index 91) (SNMP ifIndex 535) (Generation 158)
        Flags: SNMP-Traps 0x0 VLAN-Tag [ 0x8100.551 ]  Encapsulation: ENET2
        Bandwidth: 0
        Traffic statistics:
         Input  bytes  :        6835490047251
         Output bytes  :        4469740006029
         Input  packets:           8153298551
         Output packets:          10030630940
        Local statistics:
         Input  bytes  :             15745185
         Output bytes  :             25891389
         Input  packets:               271821
         Output packets:               338622
        Transit statistics:
         Input  bytes  :        6835474302066                    0 bps
         Output bytes  :        4469714114640                    0 bps
         Input  packets:           8153026730                    0 pps
         Output packets:          10030292318                    0 pps
        Security: Zone: Null
        Flow Statistics :  
        Flow Input statistics :
          Self packets :                     0
          ICMP packets :                     0
          VPN packets :                      0
          Multicast packets :                0
          Bytes permitted by policy :        0
          Connections established :          0 
        Flow Output statistics: 
          Multicast packets :                0
          Bytes permitted by policy :        0 
        Flow error statistics (Packets dropped due to): 
          Address spoofing:                  0
          Authentication failed:             0
          Incoming NAT errors:               0
          Invalid zone received packet:      0
          Multiple user authentications:     0 
          Multiple incoming NAT:             0
          No parent for a gate:              0
          No one interested in self packets: 0       
          No minor session:                  0 
          No more sessions:                  0
          No NAT gate:                       0 
          No route present:                  0 
          No SA for incoming SPI:            0 
          No tunnel found:                   0
          No session for a gate:             0 
          No zone or NULL zone binding       0
          Policy denied:                     0
          Security association not active:   0 
          TCP sequence number out of window: 0
          Syn-attack protection:             0
          User authentication errors:        0
        Protocol inet, MTU: 1500, Generation: 175, Route table: 5
          Flags: Sendbcast-pkt-to-re, Is-Primary
          Addresses, Flags: Is-Default Is-Preferred Is-Primary
            Destination: 10.0.5.8/29, Local: 10.0.5.11, Broadcast: 10.0.5.15, Generation: 180
    

     

    I will try to implement BFD, it could help to see more details.

     

    Again, thank you for your time.

    Regards.

     



  • 6.  RE: BGP Issue - Peer flapping

     
    Posted 01-05-2018 20:02

    Hi Folks,

    Please anable bgp traceoptions in both mx and srx end; share the same after the next flap,

     

    To trace BGP protocol traffic, include the traceoptions statement at the [edit protocols bgp] hierarchy level. For routing instances, include the traceoptions statement at the [edit routing-instances routing-instance-name protocols bgp] hierarchy level.

     

    traceoptions {

    file filename <files number> <size size> <world-readable | no-world-readable>;

    flag flag <flag-modifier> <disable>;

    }

    You can specify the following BGP protocol-specific trace options using the flag statement:

     

    keepalive—BGP keepalive messages.

    packets—All BGP protocol packets.

    refresh—BGP refresh packets.

    update—BGP update packets. These packets provide routing updates to BGP systems.

     

     



  • 7.  RE: BGP Issue - Peer flapping

     
    Posted 01-06-2018 10:50

    Thanks for the diagram makes this a lot easier to visualize.

     

    The two major reasons I see for this type of error are mtu mismatch or physical interface issues.  Either could occur at any interface along the path.

     

    You are fortunate to have a working and non-working peer that share a lot of the same interfaces so I agree we can narrow this down to the provider link and those associated interfaces.

     

    MTU should be good since you have this at the default on all devices so we don't need to worry about mismatches.

     

    on the SRX ge-0/0/0 looks clean.  I generally look at the extensive interface output matching on error, flap and carrier looking for errors flapping and carrier transitions.  These all look clean there.  Please check this on the on the ex3300 ge-0/0/0 as well.

    show interface ge-0/0/0 extensive | match "flap|error|carrier"

     

    If this also comes back clean, I would open a ticket with the carrier to test the circuit.

     

     

     



  • 8.  RE: BGP Issue - Peer flapping

    Posted 07-24-2018 13:51

    Have you tried disabling seq-check on SRX. I have had the issue where BGP comes up but goes down after holdtime expires. So, I changed hold-timer to 200s and now by BGP neighbourship goes down after 200s instead of 90s. After lot of troubleshooting and articles and logs found out that packets were being drops because of syn check. disabled it and no more BGP flapping.

     

    https://www.juniper.net/documentation/en_US/junos/topics/reference/configuration-statement/security-edit-no-sequence-check.html

     

    Here;'s what I have added:

    set security flow tcp-session no-syn-check
    set security flow tcp-session no-sequence-check
    set security flow tcp-session tcp-initial-timeout 4
    set security flow tcp-session time-wait-state session-ageout