SRX

Expand all | Collapse all

Communication issue btw SRX1500 and multiple vSRX's

Jump to Best Answer
  • 1.  Communication issue btw SRX1500 and multiple vSRX's

    Posted 08-21-2019 08:44

    We have 3 vSRXs and 2 clusters of SRX1500's and the vSRX's are having issues communicating with thier gateway (1 SRX 1500 cluster reth interface) using the fxp0 interface that is within the mgmt_junos routing instance. Doesn't matter where the test ping is sourced from.

     

    They can communicate with the other SRX1500 cluster with no issues and anything else on the same network.

     

    Layer2 ARP resolves the correct address on both sides. 

     

    tcpdump from the vSRX shows the traffic going out with no response. tcpdump on the SRX1500 reth interface shows nothing coming in... only the ARP request and reply.

     

    Attaching diagram of topology to clarify.vsrx-to-1500-diagram.png

     

    **updated with correct ip of vSRX in diagram



  • 2.  RE: Communication issue btw SRX1500 and multiple vSRX's

     
    Posted 08-21-2019 09:52

    Hi, based on the diagram the reth1.48 on SRX-1500-FW-01 has the same address of the fxp0 on the vSRX, is this correct?

     



  • 3.  RE: Communication issue btw SRX1500 and multiple vSRX's

    Posted 08-21-2019 10:51

    sorry... goof on my part. Copy/Paste issues... the vSRX fxp0 interface is 10.2.48.60

     

    I have updated the original post with the correct image



  • 4.  RE: Communication issue btw SRX1500 and multiple vSRX's

     
    Posted 08-21-2019 11:47

    Thanks for the confirmation. Can you take a packet capture on the reth interface of the SRX 1500-FW-01?

     

    https://kb.juniper.net/InfoCenter/index?page=content&id=KB11709

     

    I believe the tcpdump works in the same way as the "monitor traffic interface" command and in this case you might not see ICMP traffic: https://rtodto.net/monitor-traffic-doesnt-show-any-icmp-traffic/

     

    Either try with different type of traffic or take the packet capture.



  • 5.  RE: Communication issue btw SRX1500 and multiple vSRX's

    Posted 08-21-2019 12:45

    Didn't know that. Good to know about ICMP.

     

    I setup the pcap just like the KB but it never generates the file. Instead I just did another tcp dump with ssh and I get the same results. I can see the traffic leaving fxp0 on the vSRX but the SRX1500 reth1.48 interface never receives it.

     

    VSRX tcpdump
    
    
    root@vSRX:/var/home/derek # tcpdump -i fxp0 host 10.2.48.1
    verbose output suppressed, use <detail> or <extensive> for full protocol decode
    Address resolution is ON. Use <no-resolve> to avoid any reverse lookup delay.
    Address resolution timeout is 4s.
    Listening on fxp0, capture size 96 bytes
    
    Reverse lookup for 10.2.48.60 failed (check DNS reachability).
    Other reverse lookup failures will not be reported.
    Use <no-resolve> to avoid reverse lookups on IP addresses.
    
    19:37:51.400258 Out IP truncated-ip - 4 bytes missing! 10.2.48.60.53269 > 10.2.48.1.ssh: S 2745767260:2745767260(0) win 65535 <mss 1460,nop,wscale 1,nop,nop,timestamp 2163394968 0,[|tcp]>
    19:37:54.401375 Out IP truncated-ip - 4 bytes missing! 10.2.48.60.53269 > 10.2.48.1.ssh: S 2745767260:2745767260(0) win 65535 <mss 1460,nop,wscale 1,nop,nop,timestamp 2163397969 0,[|tcp]>
    19:37:57.604031 Out IP truncated-ip - 4 bytes missing! 10.2.48.60.53269 > 10.2.48.1.ssh: S 2745767260:2745767260(0) win 65535 <mss 1460,nop,wscale 1,nop,nop,timestamp 2163401172 0,[|tcp]>
    .........
    138 packets received by filter
    0 packets dropped by kernel
    
    
    
    
    
    SRX1500 tcpdump
    
    
    root@SRX1500-FW-01:/var/home/derek # tcpdump -i reth1.48 host 10.2.48.60
    verbose output suppressed, use <detail> or <extensive> for full protocol decode
    Address resolution is ON. Use <no-resolve> to avoid any reverse lookup delay.
    Address resolution timeout is 4s.
    Listening on reth1.48, capture size 96 bytes
    
    Reverse lookup for 10.2.48.40 failed (check DNS reachability).
    Other reverse lookup failures will not be reported.
    Use <no-resolve> to avoid reverse lookups on IP addresses.
    
    19:37:47.082032  In arp who-has 10.2.48.40 tell 10.2.48.60
    19:37:47.082043  In arp who-has 10.2.48.40 tell 10.2.48.60
    ^C19:41:21.199939  In 
    272 packets received by filter
    0 packets dropped by kernel

     



  • 6.  RE: Communication issue btw SRX1500 and multiple vSRX's

     
    Posted 08-21-2019 13:00

    Are you able to perform a port-mirror on the switch interfaces facing the SRX1500? With the data we have now, it looks like the packets are sent by the vSRX but we are unsure if they are received by the SRX1500. The port mirror on the switch will confirm this.

     

    Check for any error on the reth interface of the SRX1500:

     

    > show interfaces extensive

     



  • 7.  RE: Communication issue btw SRX1500 and multiple vSRX's

    Posted 08-21-2019 13:24

    Wouldn't the SRX1500 reth interface recieving traffic from other devices on that network confirm that the underlying L2 connectivity is good? Same with the vSRX, any of the other devices it can communicate with.

     

    I checked the reth interface on the 1500 ...

     

     derek@SRX1500-FW-01> show interfaces reth1 extensive 
    Physical interface: reth1, Enabled, Physical link is Up
      Interface index: 129, SNMP ifIndex: 581, Generation: 132
      Description: Trust Interfaces - xe-0/0/17 & xe-7/0/17
      Link-level type: Ethernet, MTU: 1518, Speed: 10Gbps, BPDU Error: None, Ethernet-Switching Error: None, MAC-REWRITE Error: None, Loopback: Disabled, Source filtering: Disabled, Flow control: Disabled, Minimum links needed: 1,
      Minimum bandwidth needed: 1bps
      Device flags   : Present Running
      Interface flags: SNMP-Traps Internal: 0x4000
      Current address: 00:10:db:ff:10:01, Hardware address: 00:10:db:ff:10:01
      Last flapped   : 2018-09-05 13:53:06 UTC (50w0d 06:24 ago)
      Statistics last cleared: Never
      Traffic statistics:
       Input  bytes  :       10290601298180              3962784 bps
       Output bytes  :       19833025646885              7397984 bps
       Input  packets:          22992581469                 1122 pps
       Output packets:          25814933197                 1200 pps
      Dropped traffic statistics due to STP State:
       Input  bytes  :                    0
       Output bytes  :                    0
       Input  packets:                    0
       Output packets:                    0
      MAC statistics:                      Receive         Transmit
        Broadcast packets                        0                0
        Multicast packets                        0                0
      Input errors:
        Errors: 0, Drops: 0, Framing errors: 0, Runts: 0, Giants: 0, Policed discards: 0, Resource errors: 0
      Output errors:
        Carrier transitions: 0, Errors: 0, Drops: 0, MTU errors: 0, Resource errors: 0
      Ingress queues: 8 supported, 4 in use
      Queue counters:       Queued packets  Transmitted packets      Dropped packets
        0                                0                    0                    0
        1                                0                    0                    0
        2                                0                    0                    0
        3                                0                    0                    0
      Egress queues: 8 supported, 4 in use
      Queue counters:       Queued packets  Transmitted packets      Dropped packets
        0                         41763063             41763063                    0
        1                                0                    0                    0
        2                                0                    0                    0
        3                          2230645              2230645                    0
      Queue number:         Mapped forwarding classes
        0                   best-effort
        1                   expedited-forwarding
        2                   assured-forwarding
        3                   network-control
    
     Logical interface reth1.48 (Index 70) (SNMP ifIndex 593) (Generation 171)
        Flags: Up SNMP-Traps 0x4000 VLAN-Tag [ 0x8100.48 ]  Encapsulation: ENET2
        Statistics        Packets        pps         Bytes          bps
        Bundle:
            Input :     204809093          0   67079691724          536
            Output:     581838493          0  198983752937            0
        Adaptive Statistics:
            Adaptive Adjusts:          0
            Adaptive Scans  :          0
            Adaptive Updates:          0
        Link:
          xe-0/0/17.48
            Input :     187715561          0   64908877270          264
            Output:     581836782          0  198981945970            0
          xe-7/0/17.48
            Input :      17093532          0    2170814454          272
            Output:          1906          0       1817001            0
        Marker Statistics:   Marker Rx     Resp Tx   Unknown Rx   Illegal Rx
          xe-0/0/17.48               0           0            0            0
          xe-7/0/17.48               0           0            0            0
        Security: Zone: internal-juniper
        Allowed host-inbound traffic : https ping snmp ssh traceroute
        Flow Statistics :  
        Flow Input statistics :
          Self packets :                     668399
          ICMP packets :                     2389843
          VPN packets :                      0
          Multicast packets :                0
          Bytes permitted by policy :        62661552343
          Connections established :          4008625 
        Flow Output statistics: 
          Multicast packets :                0
          Bytes permitted by policy :        198977390027 
        Flow error statistics (Packets dropped due to): 
          Address spoofing:                  0
          Authentication failed:             0
          Incoming NAT errors:               0
          Invalid zone received packet:      0
          Multiple user authentications:     0 
          Multiple incoming NAT:             0
          No parent for a gate:              0
          No one interested in self packets: 0       
          No minor session:                  0 
          No more sessions:                  0
          No NAT gate:                       0 
          No route present:                  1 
          No SA for incoming SPI:            0 
          No tunnel found:                   0
          No session for a gate:             0 
          No zone or NULL zone binding       0
          Policy denied:                     1224074
          Security association not active:   0 
          TCP sequence number out of window: 3
          Syn-attack protection:             0
          User authentication errors:        0
        Protocol inet, MTU: 1500
        Max nh cache: 100000, New hold nh limit: 100000, Curr nh cnt: 13, Curr new hold cnt: 0, NH drop cnt: 0
        Generation: 189, Route table: 6
          Flags: None
          Addresses, Flags: Is-Preferred Is-Primary
            Destination: 10.2.48/24, Local: 10.2.48.1, Broadcast: 10.2.48.255, Generation: 166

     

     



  • 8.  RE: Communication issue btw SRX1500 and multiple vSRX's

     
    Posted 08-21-2019 13:40

    Wouldn't the SRX1500 reth interface recieving traffic from other devices on that network confirm that the underlying L2 connectivity is good?

     

    R/ Yes you are right but it doesnt prove that the switch indeed sent those specific packets from the vSRX towards the SRX1500. Also it is important to note that we see ARP requests on the reth from the vSRX which means that there is sort of communication between them, however these ARP requests are non-IP packets; maybe this could be a difference between the traffic that is working and the traffic from the vSRX that doesnt work. Anyways performing the port-mirror will confirm that the switch indeed sent the packets.

     

    Also I can see that the reth has xe-0/0/17 & xe-7/0/17. The switch interfaces connecting to them shouldnt be configured as a bundle, can you confirm this is not the case?

     

    Is it possible, as a test, to configure a standalone port instead of the reth interface? Im trying to isolate a problem with the reth configuration itself and that test could prove it.

     

    Please also share the "show interfaces extensive" command from xe-0/0/17 & xe-7/0/17 to check for any errors at layer 2.

     



  • 9.  RE: Communication issue btw SRX1500 and multiple vSRX's

    Posted 08-22-2019 10:22
      |   view attached

    mrojas thanks for the help so far it is greatly appreciated

     

    performing the port-mirror will confirm that the switch indeed sent the packets.

     

    I worked this morning with my internal network team and we could not do a port-mirror. With it being remote location and we didn't have any non-configured cabled ports we were unable to get it working (Cisco Nexus 5k wth fex). The fex created some issues when trying to configure a span which forced us to abandon.

     

    Also I can see that the reth has xe-0/0/17 & xe-7/0/17. The switch interfaces connecting to them shouldnt be configured as a bundle, can you confirm this is not the case?

     

    They are not configured in a bundle. The only configuration on those switchport links is a trunk (Cisco Nexus 5k)

     

    Is it possible, as a test, to configure a standalone port instead of the reth interface? Im trying to isolate a problem with the reth configuration itself and that test could prove it.

     

    Unfortunately this is in a remote DC and we don't have any extra non-reth interfaces configured or cabled.

     

    Please also share the "show interfaces extensive" command from xe-0/0/17 & xe-7/0/17 to check for any errors at layer 2.

     

    I have attached this as the output is quite long. I saw some L2 channel errors on the interfaces and monitoring/refreshing the output did not cause these counters to rise. There was also some counters increasing for "policed discards". There is a hardening filter applied to the lo0 interface. To test, the input filter was disabled but there was no changed in connectivity.

    Attachment(s)



  • 10.  RE: Communication issue btw SRX1500 and multiple vSRX's

    Posted 08-22-2019 10:27

    I did want to add and be transparent.... at some point last week, this was working.

     

    The only change that was made in the remote DC over the weekend was a migration to a new SRX1500 cluster (SRX1500-FW-02 in the diagram) from an old different vendor firewall. However, this firewall is not in play with communication between these vSRX's. 

     

    Since these vSRX's are currently in deployment connectivity after the migration was not verified other than other production devices on that same network.



  • 11.  RE: Communication issue btw SRX1500 and multiple vSRX's

     
    Posted 08-22-2019 13:15

    Hi,

     

    I understand that the "L2 channel errors" and "Policed discards" counters were cleared and were not increasing but were you sending traffic from the vSRX at that time?

     

    I believe they could be related to the issue:

     

    • L2 channel errors: Number of times the software did not find a valid logical interface for an incoming frame.
    • Policed discards: Number of frames that the incoming packet match code discarded, as they were not recognized or not of interest. Usually, this field reports protocols that the JUNOS software does not handle.

    Ref: https://kb.juniper.net/KB24601

     

    -Is the reth1.48 configured for vlan-tagging? (I understand that the Nexus ports are configured as trunk)

    -Are the correct vlans permitted over that trunk? (At SRX and Nexus side)

    -Is the Nexus port facing the vSRX configured on the correct vlan?

    -Is it possible to remove vlan-tagging on the reth and on the trunk port at the nexus side to test the communication without vlan tagging?

    -Are you able to perform a failover to test the comunication when the active link resides on the other node of the cluster?

     



  • 12.  RE: Communication issue btw SRX1500 and multiple vSRX's

    Posted 08-22-2019 14:04

    "L2 channel errors" and "Policed discards" counters were cleared and were not increasing but were you sending traffic from the vSRX at that time?

     

    Yes I was sending traffic at the time watching the counters, both ICMP and SSH. The L2 channel errors never incremented and the policed discards incremented about once every 5 seconds. Even with no traffic sourced from the vSRX the slow increment on policed discards still occurs.

     

    -Is the reth1.48 configured for vlan-tagging? (I understand that the Nexus ports are configured as trunk)

    -Are the correct vlans permitted over that trunk? (At SRX and Nexus side)

    -Is the Nexus port facing the vSRX configured on the correct vlan?

    Yes.

     

    SRX1500-FW-01

    derek@SRX1500-FW-01> show configuration interfaces reth1       
    description "Interfaces - xe-0/0/17 & xe-7/0/17";
    vlan-tagging;
    redundant-ether-options {
        redundancy-group 1;
    }
    unit 0 {
        disable;
        vlan-id 3967;
    }
    unit 40 {
        vlan-id 40;
        family inet {
            address 10.2.40.200/24;
        }
    }
    unit 45 {
        vlan-id 45;
        family inet {
            address 10.2.45.1/24;
        }
    }
    unit 48 {
        vlan-id 48;
        family inet {
            address 10.2.48.1/24;
        }
    }

    Nexus:

     

    ##Both Nexus 5k contain SRX1500 Interfaces for xe-0/0/17 and xe-7/0/17##
    
    interface Ethernet1/10
      description **SRX1500-FW-01-Trunk**
      switchport mode trunk
      switchport trunk allowed vlan 40,45,48
     
    
    ##Nexus FEX for vSRXs##
    
    interface Ethernet103/1/27
      description **JunESX2 10GB port2**
      switchport mode trunk
    
    interface Ethernet102/1/27
      description **JunESX2 10GB port1**
      switchport mode trunk
    
    
    ##vSRX##
    
    derek@SUW-NC-FW# show interfaces fxp0 
    unit 0 {
        family inet {
            address 10.2.48.60/24 {
                master-only;
            }
        }
    }

     

     

    -Is it possible to remove vlan-tagging on the reth and on the trunk port at the nexus side to test the communication without vlan tagging?

     

    Unfortunately not. The SRX1500 reth interface is trunked for a couple of production VLANs

     

    -Are you able to perform a failover to test the comunication when the active link resides on the other node of the cluster?

     

    I should be able get this done and report back.



  • 13.  RE: Communication issue btw SRX1500 and multiple vSRX's

    Posted 08-22-2019 14:24

    Just in case.... below is the portgroup settings in vmware attached to the fxp0 port (the first nic on the vSRX).

     

    jnpr-vmware-pg-fxp0-promiscuous.pngjnpr-vmware-pg-fxp0-promiscuous-security.png



  • 14.  RE: Communication issue btw SRX1500 and multiple vSRX's

     
    Posted 08-22-2019 15:57

    Regarding the configuration on the interfaces between the Nexus and the vSRX, I can see that Nexus' side is a trunk but the fxp0 side is not using any tagging. Will Vmware take care of this?

     

    You may also try a counter on the reth interface to determine if the pings/ssh packets are reaching the SRX:

     

    set firewall family inet filter ICMP_FILTER term 1 from source-address 10.2.48.60
    set firewall family inet filter ICMP_FILTER term 1 from destination-address 10.2.48.1
    set firewall family inet filter ICMP_FILTER term 1 from protocol icmp
    set firewall family inet filter ICMP_FILTER term 1 then accept
    set firewall family inet filter ICMP_FILTER term 1 then count COUNTER
    set firewall family inet filter ICMP_FILTER term ACCEPT_ALL then accept
    
    set interfaces reth1.48 family inet filter input ICMP_FILTER

     

    After commiting the config, try the test and then check with "> show firewall". If the counter remains zero then you might need to check the switching side.

     

    If possible try deleting the reth interface and re-apply it; and keep us posted with the failover results.

     

     



  • 15.  RE: Communication issue btw SRX1500 and multiple vSRX's

    Posted 09-03-2019 10:22

    Sorry for the absent replies I was out of  office all last week at the Juniper Tech Fest event in Chicago. I will be updating this as soon as I can. Thanks everyone.



  • 16.  RE: Communication issue btw SRX1500 and multiple vSRX's

    Posted 09-09-2019 14:03

    An update to this saga....

     

    We got into esxcli and did some packet captures using "pktcap-uw" on the physical nics as well as the vswitchports that the vSRX was connected to.

     

    At the dvSwitch level, the vSRX vm is passing all ICMP traffic destined to the SRX1500 reth1.48 interface.

     

    pktcap-uw --switchport 50331672

    The above switchport id was found via esxtop

     

    However, at the physical NIC the capture did not see any ICMP traffic when the vSRX was pinging the SRX1500 reth1.48 interface.

    pktcap-uw --uplink vmnic0 --proto 0x01

     

    There are 2 physical 10Gb nics in a team and the above command was checked on both nics in the event traffic was exiting the other nic. There was no traffic seen on the 2nd nic in the team. ESXTOP confirmed the vSRX is only using the 1 vmnic.

     

    When pinging another working SRX1500 cluster on the same subnet, we did see traffic in the capture.

     

    I am hoping that tomorrow we can do some additional testing inside vmware to rule out any dvswitch issues.



  • 17.  RE: Communication issue btw SRX1500 and multiple vSRX's

    Posted 09-09-2019 22:55

    ,

     

    Interesting, please keep us posted.

     



  • 18.  RE: Communication issue btw SRX1500 and multiple vSRX's

    Posted 09-10-2019 08:12

    This morning we removed a physical nic as an uplink in the dvswitch portgroup for the vSRX. There was no change in connectivity.

     

    For testing we decided to create a standard vswitch with a vm portgroup as a trunk with the physical nic we removed from the dvswitch portgroup and voila, the vSRX can now communicate with it's gateway.

     

    Will be doing some additional troubleshooting to determine what is wrong with the distributed vswitch and reconnect it. I will update once we determine the issue.



  • 19.  RE: Communication issue btw SRX1500 and multiple vSRX's

    Posted 09-10-2019 12:28

    Some additional testing from today:

     

     

    One thing we noticed was that after a reboot of the vSRX the fxp0 is the first interface this is initialized and active and the interface can reach it's gateway. Once the revenue interface's become active, the fxp0 interface can no longer hit it's gateway. And again, this is the the only address not reachable.

     

    We also noticed some interesting behaviro. Our fxp0 interface is in it's own portgroup with promiscuous mode enabled. This port group is assigned the vlan id of 48 as we are not tagging the fxp0 interface on the vsrx. The revenue port on the vSRX is connected to another portgroup that is trunked (tagged) for all vlans. The reth interface on the vSRX tagged for each VLAN interface configured. There is no vlan 48 interface configured on the vSRX. If we change the portgroup in vmware to trunk all except vlan 48 it begins to work just fine.

     

    The above can be a work around but there is a possibility of running into this issue in the future so a cause/fix needs to be determined. It is still odd that the only device it is unable to communicate with in that subnet is it's gateway address. 

     

    We have also tested and created a new dvswitch and portgroup and still run into the same exact problems.

     

    We have opened a ticket with VMware to do some sanity checks.



  • 20.  RE: Communication issue btw SRX1500 and multiple vSRX's

    Posted 09-13-2019 10:36

    We opened a ticket with VMware in which they initially tried to claim this was an unsupported configuration. We eventually convinced them and they verified in a lab that it should work. Their lab was using a 6.5 dvswitch.

     

    Ultimately they don't know what the actual problem is, however, upgrading to ESXi 6.5 with a 6.5 dvswitch solves the problem when we tested this in our own separate lab. We deployed a vSRX on a 6.5 host with a 6.0 dvswitch and we observed the same problem. Upgrading the dvswitch to 6.5 solves the problem. This confirms there is an issue with the 6.0 dvswitch, we just don't know what exactly what and why it only affects the gateway address.

     

    We have our infrastructure team performing an vCenter, ESXi, and dvswitch upgrade for our hosts with vSRX's. I will report back once it is complete.



  • 21.  RE: Communication issue btw SRX1500 and multiple vSRX's

    Posted 09-16-2019 08:05

    Our infrastructure team upgraded vCenter, ESXi hosts, and dvSwitch to 6.5 and we are still having the problems. Very odd we could replicate this in a lab. They are headed back to VMware with the results



  • 22.  RE: Communication issue btw SRX1500 and multiple vSRX's
    Best Answer

    Posted 09-18-2019 07:55

    So I have finally solved this issue. I also believe there were 2 issues at play with one more problematic than the other.

     

    This ultimately resulted in MAC address conflicts for reth interfaces between the 2 physical SRX1500 clusters and the 3 vSRX clusters. All clusters had the same cluster-id and the SRX generates a MAC address for each reth interface based on the cluster id. The same MAC was generated for all clusters. This wasn't too much of an issue and wasn't noticeable prior because of the use of VLANs for each firewall.

     

    The ESXi dvswitch version, according to lab testing, may have played a role, albeit not the root cause.

     

    I found an error message on the console of one of the vSRX's (duplicate tnp address found on interface x) and led me to this thread that I need to give credit too:

    J-NET: Two-Pairs-of-SRX-Clusters-on-MAC-Address-Conflicts

     

    I pulled all MAC addresses from every SRX/vSRX and this is when I noticed we had a MAC conflict problem.

     

    The vSRX's were not in production so I was able to reset the cluster-id on each one and verify the MAC addresses changed on the reth interfaces.   After hours we did the same for one of the production SRX1500 clusters and everything came up correctly and working.

     

    user@SRX> set chassis cluster cluster-id #

    After thinking about the process, the packet captures on the ESXi hosts make sense. When the vSRX would ARP for its gateway, the reply with the MAC address was matching to the local interface on the vSRX. So the traffic would never leave the ESXi host. Once we removed that VLAN from the trunk group it started working due to the MAC address not being local in the same VLAN anymore.