SRX

 View Only
last person joined: 21 hours ago 

Ask questions and share experiences about the SRX Series, vSRX, and cSRX.
Expand all | Collapse all

Odd behavior with RPM and IP monitoring

  • 1.  Odd behavior with RPM and IP monitoring

    Posted 3 days ago
    Edited by TacticalDonut164 3 days ago

    Hey guys,

    I am having an odd issue with RPM and IP monitoring. 

    With the services block active, this policy will eventually trip and fail me over to my secondary ISP. The higher I set the intervals, the longer I can go before it fails me over. But it always will eventually fail. 

    I know for a fact I am not losing 20 seconds worth of pings consecutively, and I know there cannot be some network-level issue that causes pings from the SRX to not transit for 20 seconds. The CPU doesn't appear to spike during those times, nor does a DHCP client renew on either internet router correlate. 

    Preempt doesn't work either, as 8.8.8.8/google.com remains reachable through reth3.500 (since the failover was not genuine to begin with), but it does not reset the route at all. The only way to recover from this event is to delete or deactivate the services block. 

    Looking in the logs I just see that the probe succeeds, succeeds, succeeds, and then suddenly it starts failing and never stops failing. 

    Deactivating and reactivating the block resets the failover and everything starts working again until the next event. 

    I've also tried by suggestion to add an HTTP GET test to the probe, so that both have to fail, and GET is less likely to be dropped. After doing this, everything dropped not even 10 minutes later. Of course, deactivating services brought it all back up, almost like there never was a real failure to begin with.

    Policy - FAIL-TO-SECONDARY-INET (Status: FAIL)
      RPM Probes:
        Probe name             Test Name       Address          Status
        ---------------------- --------------- ---------------- ---------
        PROBE-PRIMARY-INET     TEST-PRIMARY-INET-ICMP 8.8.8.8   FAIL
        PROBE-PRIMARY-INET     TEST-PRIMARY-INET-HTTP           FAIL
    
      Route-Action (Adding backup routes when FAIL):
        route-instance    route             next-hop         state
        ----------------- ----------------- ---------------- -------------
        inet.0            0.0.0.0/0         10.255.250.6     APPLIED

    Hoping to get some additional eyes on this!

    Thank you!

    Topology:

    [SRX345 Cluster] <-- .1 -- 10.255.250.0/30 -- .2 --> Internet Router 1 <-> ISP 1
                     <-- .5 -- 10.255.250.4/30 -- .6 --> Internet Router 2 <-> ISP 2

    Config:

    rpm {
        probe PROBE-PRIMARY-INET {
            test TEST-PRIMARY-INET-ICMP {
                target address 8.8.8.8;
                probe-count 4;
                probe-interval 5;
                test-interval 10;
                thresholds {
                    successive-loss 4;
                }
                destination-interface reth3.500;
            }
            test TEST-PRIMARY-INET-HTTP {
                probe-type http-get;
                target url https://www.google.com;
                test-interval 10;
                thresholds {
                    successive-loss 3;
                }
                destination-interface reth3.500;
            }
        }
    }
    ip-monitoring {
        policy FAIL-TO-SECONDARY-INET {
            match {
                rpm-probe PROBE-PRIMARY-INET;
            }
            then {
                preferred-route {
                    route 0.0.0.0/0 {
                        next-hop 10.255.250.6;
                    }
                }
            }
        }
    }
    


  • 2.  RE: Odd behavior with RPM and IP monitoring

    Posted 2 days ago

    I'm not completely sure about this, but I think you should also specify next-hop on the RPM probes to force them to use ISP-1 during failover condition; otherwise maybe they'll try to use ISP-2 (due to the ip-monitoring policy action) with source address of the ISP-1 interface which will make them fail continuously indefinitely.

    As for why the thing fails over in the first place, if you're absolutely sure you're not running into some transient ISP failure, then try enabling RPM traceoptions to get more details on what's going on with the RPM probes: https://www.juniper.net/documentation/us/en/software/junos/cli-reference/topics/ref/statement/traceoptions-edit-services-rpm.html



    ------------------------------
    Nikolay Semov
    ------------------------------



  • 3.  RE: Odd behavior with RPM and IP monitoring

    Posted 2 days ago

    (FYI - broke out an old lab firewall and am now testing on that so I don't just drop myself every five minutes)

    New topology:

    [SRX320 Lab Firewall] < .13 -- 10.255.250.12/30 -- .14 > Internet Router 0 <-> ISP 1
                          < .17 -- 10.255.250.16/30 -- .18 > Internet Router 1 <-> ISP 2

    Unfortunately this did not fix preemption - after adding next-hop 10.255.250.14 to both probes:

    LabBR> show configuration services rpm
    probe PROBE-PRIMARY-INET {
        test TEST-PRIMARY-INET-ICMP {
            target address 8.8.8.8;
            probe-count 4;
            probe-interval 5;
            test-interval 10;
            thresholds {
                successive-loss 4;
            }
            destination-interface ge-0/0/5.501;
            next-hop 10.255.250.14;
        }
        test TEST-PRIMARY-INET-HTTP {
            probe-type http-get;
            target url https://www.google.com;
            test-interval 10;
            thresholds {
                successive-loss 3;
            }
            destination-interface ge-0/0/5.501;
            next-hop 10.255.250.14;
        }
    }

    Then simulating an upstream failure by stripping VLAN 501 off of the switch and rolling back after a few minutes, the probes still continued to fail. It's therefore required for me to override the static route either by adding a lower metric route, or deactivating the services block. Once the IP monitoring route is removed or overridden, everything starts working normally again. From then, I can reactivate the services block and/or delete the [routing-options static] without any issues.

    The "good news" is that I have not seen an uncommanded failover yet, and it's been active for quite a while. So we "might" have resolved that issue, but I'm not holding my breath.

    Preemption not working is less of an issue than everything just dying, but still something I would like to get resolved. I'm not sure why it doesn't work. 

    To give some more details. When everything is working okay. This is what the routing table looks like:

    0.0.0.0/0          *[BGP/200] 1w0d 18:42:56, localpref 100
                          AS path: 64513 ?, validation-state: unverified
                        >  to 10.255.250.14 via ge-0/0/5.501
                        [BGP/250] 5d 18:58:02, localpref 100
                          AS path: 64514 ?, validation-state: unverified
                        >  to 10.255.250.18 via ge-0/0/5.551

    When it fails over, the routing table changes to:

    0.0.0.0/0          *[Static/10] 00:02:19, metric2 0
                        >  to 10.255.250.18 via ge-0/0/5.551
                        [BGP/200] 00:00:49, localpref 100
                          AS path: 64513 ?, validation-state: unverified
                        >  to 10.255.250.14 via ge-0/0/5.501
                        [BGP/250] 5d 19:04:38, localpref 100
                          AS path: 64514 ?, validation-state: unverified

    For what it's worth, when the route is in effect, I cannot ping even sourced from the 501 interface:

    LabBR# run ping 8.8.8.8 interface ge-0/0/5.501
    PING 8.8.8.8 (8.8.8.8): 56 data bytes
    ^C
    --- 8.8.8.8 ping statistics ---
    2 packets transmitted, 0 packets received, 100% packet loss



  • 4.  RE: Odd behavior with RPM and IP monitoring

    Posted 2 days ago

    Packets sourced from ISP1 IP address are not expected to work over ISP2, that's normal.

    I think I'm missing something with the RPM config. It's supposed to be able to go out of a configured interface to a configured next-hop regardless of what the routing table says. I just don't know what the exact right configuration is supposed to be. Maybe it as a source-address + next-hop? Maybe RPM traceoptions can help with that, too.

    Also, if you're already running BGP with the ISPs, why do you need further RPM? If a connection breaks, the BGP route ought to disappear, no?



    ------------------------------
    Nikolay Semov
    ------------------------------



  • 5.  RE: Odd behavior with RPM and IP monitoring

    Posted 2 days ago
    Edited by TacticalDonut164 2 days ago

    I'll keep looking at what options are available and configure some traceoptions.

    With regards to BGP, I am not running BGP with the ISPs. It's only with the internet routers, so that I can have failover capabilities in the event one router fails. In my testing, when I simulate a failure on the primary ISP by unplugging router 0, the 0.0.0.0/0 route does not disappear from the routing table, and therefore still gets advertised, so all traffic gets backholed. 

    When I tried adding source-address 10.255.250.13, I then got this:

    LabBR# run show services rpm history-results test TEST-PRIMARY-INET-ICMP owner PROBE-PRIMARY-INET | last 2
        PROBE-PRIMARY-INET, TEST-PRIMARY-INET-ICMP Mon Jul  7 13:51:54 2025 Mon Jul  7 13:51:54 2025No route to target
        PROBE-PRIMARY-INET, TEST-PRIMARY-INET-ICMP Mon Jul  7 13:51:59 2025 Mon Jul  7 13:51:59 2025No route to target

    I did not see anything too useful in the traceoptions, but I'll keep monitoring.




  • 6.  RE: Odd behavior with RPM and IP monitoring

    Posted 2 days ago

    That's weird, try with just source-address and next-hop without destination-interface. Keep turning knobs and flipping switches, there's got to be a way to make sure the probes always go out of .501 under all circumstances. I normally have different ISPs split into different VRs so I've never had to configure this scenario, but I'm pretty sure it's possible.



    ------------------------------
    Nikolay Semov
    ------------------------------



  • 7.  RE: Odd behavior with RPM and IP monitoring

    Posted 2 days ago
    Edited by TacticalDonut164 2 days ago

    Unfortunately removing the destination-interface, this did not do it.

    Still nothing helpful in the traces. But I know what the issue is now. 

    Even when I explicitly put source address, next hop, destination interface whatever, for some reason, it stubbornly sends packets out router 1. Consider this, where 551 is to router 1 and 501 is to the primary router 0:

    LabBR> show arp interface ge-0/0/5.501 no-resolve
    MAC Address       Address         Interface                Flags
    60:15:2b:cb:ef:30 10.255.250.14   ge-0/0/5.501             none
    
    LabBR> show arp interface ge-0/0/5.551 no-resolve
    MAC Address       Address         Interface                Flags
    34:e5:ec:48:12:30 10.255.250.18   ge-0/0/5.551             none

    Then a SPAN on the switch reveals:

    Ethernet II, Src: JuniperNetwo_c7:e5:cd (40:7f:5f:c7:e5:cd), Dst: PaloAltoNetw_48:12:30 (34:e5:ec:48:12:30)

    This traffic was generated from "ping 8.8.8.8 source 10.255.250.13 rapid count 100".

    Extremely frustrating. I don't know why it is doing this.




  • 8.  RE: Odd behavior with RPM and IP monitoring
    Best Answer

    Posted 2 days ago

    Don't jump to conclusions about RPM traffic based on a manual ping. The ping command will use whatever route you have in the routing table. Specifying a source address does not change that behavior. And you can expect that to fail (as I mentioned before, using ISP-1 IP address over ISP-2 is supposed to fail most of the time).

    The next-hop value you specify in the RPM probe is supposed to determine where the probe traffic will go instead of the routing table.

    Since you have SPAN configured already, just look for the RPM probes rather than generating more traffic on top of that.



    ------------------------------
    Nikolay Semov
    ------------------------------



  • 9.  RE: Odd behavior with RPM and IP monitoring

    Posted 2 days ago
    Edited by TacticalDonut164 2 days ago

    Took a closer look at the SPAN. 

    I saw ICMP responses.

    And... these logs being spammed over and over again.

    USER.ERR: Jul  7 15:52:45 LabBR RT_IDS: RT_SCREEN_IP: IP spoofing! source: 8.8.8.8, destination: 10.255.250.13, protocol-id: 1, zone name: Untrust, interface name: ge-0/0/5.501, action: drop

    When I did 'delete security screen ids-option IDS-Untrust ip spoofing' this resolved the issue. This was probably also the cause of the probes randomly failing eventually.

    Very frustrating that it was such a simple issue that I did not catch onto earlier. 

    Thanks for your help with this. 




  • 10.  RE: Odd behavior with RPM and IP monitoring

    Posted 2 days ago

    Oh, yeah, didn't think about that. The IP Spoofing screen also uses the routing table to make its decisions. If it fits your network, you should also be able to avoid triggering the screen by having by ISP interfaces in the same zone, rather than disabling the screen altogether.

    The random failure is unlikely to be related. Also, earlier you mentioned that adding an http test made things more stable compared to just ICMP -- the effect may have been achieved by simply having a second test -- all tests must fail in order for the probe to be considered failed as a whole and trigger your ip-monitoring policy. So, the more tests you have, the less chance of false positives.



    ------------------------------
    Nikolay Semov
    ------------------------------



  • 11.  RE: Odd behavior with RPM and IP monitoring

    Posted 2 days ago

    The thing I want to point out first is that you are using a Redundancy interface. That and the fact that 8.8.8.8 is a very long route. 

    In my configuration I'm using rpm and ip monitoring for world of tanks. Both for PC and mobile. So my example is not similar and I'm not saying you can't do it for google, but let's get what's good, and kind of weird about my wot rpm and ipmon.

    The good thing is it works when you skip a day of playing wot. What I mean is that the topology resets and "2x" on the tank icon in the game resets all "2x" on all tank icons to "2x" or whatever number the day starts with. So what's happening is the topology is having an event. Please remember rpm has its own failover. I see it as finicky but it's always worked for me without failure. But that's me and my topology. Here is my example. I have a setup of rpm and ipmon for microsoft store too. My use of only ip addresses does operate properly.

           probe Probe-External-WOT-ipv4 {

                test ExternalWOT {

                    target address 92.223.56.72;

                    probe-count 10;

                    probe-interval 5;

                    test-interval 5;

                    thresholds {

                        successive-loss 10;

                    }

                    next-hop 92.223.56.1;

                }

            }



           policy WOT-Tracking-ipv4 {

                match {

                    rpm-probe Probe-External-WOT-ipv4;

                }

                then {

                    preferred-route {

                        route 92.223.56.0/32 {

                            next-hop 92.223.56.72;

                        }

                    }

                }

           }



    ------------------------------
    Adrian Aguinaga
    B.S.C.M. I.T.T. Tech
    (Construction Management)
    A.A.S. I.T.T. Tech
    (Drafting & Design)
    ------------------------------



  • 12.  RE: Odd behavior with RPM and IP monitoring

    Posted 2 days ago

    Yeah I have no idea why mine doesn't reliably work. Oddly although it did originally induce a failure, the ICMP + HTTP configuration has held solid for at least 24 hours, something that the solely ICMP one wasn't able to do. 

    If it continues to hold I'll try the ICMP again.

    The inconsistency leads me to think it's somehow a platform issue or something, just based on the fact that the last time I had such inconsistent issues with a Juniper product was with an ACX1100 that was always running out of TCAM entries and causing some incomprehensible behavior.