Oh, yeah, didn't think about that. The IP Spoofing screen also uses the routing table to make its decisions. If it fits your network, you should also be able to avoid triggering the screen by having by ISP interfaces in the same zone, rather than disabling the screen altogether.
The random failure is unlikely to be related. Also, earlier you mentioned that adding an http test made things more stable compared to just ICMP -- the effect may have been achieved by simply having a second test -- all tests must fail in order for the probe to be considered failed as a whole and trigger your ip-monitoring policy. So, the more tests you have, the less chance of false positives.
Original Message:
Sent: 07-07-2025 17:33
From: TacticalDonut164
Subject: Odd behavior with RPM and IP monitoring
Took a closer look at the SPAN.
I saw ICMP responses.
And... these logs being spammed over and over again.
USER.ERR: Jul 7 15:52:45 LabBR RT_IDS: RT_SCREEN_IP: IP spoofing! source: 8.8.8.8, destination: 10.255.250.13, protocol-id: 1, zone name: Untrust, interface name: ge-0/0/5.501, action: drop
When I did 'delete security screen ids-option IDS-Untrust ip spoofing' this resolved the issue. This was probably also the cause of the probes randomly failing eventually.
Very frustrating that it was such a simple issue that I did not catch onto earlier.
Thanks for your help with this.
Original Message:
Sent: 07-07-2025 15:55
From: Nikolay Semov
Subject: Odd behavior with RPM and IP monitoring
Don't jump to conclusions about RPM traffic based on a manual ping. The ping command will use whatever route you have in the routing table. Specifying a source address does not change that behavior. And you can expect that to fail (as I mentioned before, using ISP-1 IP address over ISP-2 is supposed to fail most of the time).
The next-hop value you specify in the RPM probe is supposed to determine where the probe traffic will go instead of the routing table.
Since you have SPAN configured already, just look for the RPM probes rather than generating more traffic on top of that.
------------------------------
Nikolay Semov
Original Message:
Sent: 07-07-2025 15:41
From: TacticalDonut164
Subject: Odd behavior with RPM and IP monitoring
Unfortunately removing the destination-interface, this did not do it.
Still nothing helpful in the traces. But I know what the issue is now.
Even when I explicitly put source address, next hop, destination interface whatever, for some reason, it stubbornly sends packets out router 1. Consider this, where 551 is to router 1 and 501 is to the primary router 0:
LabBR> show arp interface ge-0/0/5.501 no-resolveMAC Address Address Interface Flags60:15:2b:cb:ef:30 10.255.250.14 ge-0/0/5.501 noneLabBR> show arp interface ge-0/0/5.551 no-resolveMAC Address Address Interface Flags34:e5:ec:48:12:30 10.255.250.18 ge-0/0/5.551 none
Then a SPAN on the switch reveals:
Ethernet II, Src: JuniperNetwo_c7:e5:cd (40:7f:5f:c7:e5:cd), Dst: PaloAltoNetw_48:12:30 (34:e5:ec:48:12:30)
This traffic was generated from "ping 8.8.8.8 source 10.255.250.13 rapid count 100".
Extremely frustrating. I don't know why it is doing this.
Original Message:
Sent: 07-07-2025 14:58
From: Nikolay Semov
Subject: Odd behavior with RPM and IP monitoring
That's weird, try with just source-address and next-hop without destination-interface. Keep turning knobs and flipping switches, there's got to be a way to make sure the probes always go out of .501 under all circumstances. I normally have different ISPs split into different VRs so I've never had to configure this scenario, but I'm pretty sure it's possible.
------------------------------
Nikolay Semov
Original Message:
Sent: 07-07-2025 14:50
From: TacticalDonut164
Subject: Odd behavior with RPM and IP monitoring
I'll keep looking at what options are available and configure some traceoptions.
With regards to BGP, I am not running BGP with the ISPs. It's only with the internet routers, so that I can have failover capabilities in the event one router fails. In my testing, when I simulate a failure on the primary ISP by unplugging router 0, the 0.0.0.0/0 route does not disappear from the routing table, and therefore still gets advertised, so all traffic gets backholed.
When I tried adding source-address 10.255.250.13, I then got this:
LabBR# run show services rpm history-results test TEST-PRIMARY-INET-ICMP owner PROBE-PRIMARY-INET | last 2 PROBE-PRIMARY-INET, TEST-PRIMARY-INET-ICMP Mon Jul 7 13:51:54 2025 Mon Jul 7 13:51:54 2025No route to target PROBE-PRIMARY-INET, TEST-PRIMARY-INET-ICMP Mon Jul 7 13:51:59 2025 Mon Jul 7 13:51:59 2025No route to target
Original Message:
Sent: 07-07-2025 14:44
From: Nikolay Semov
Subject: Odd behavior with RPM and IP monitoring
Packets sourced from ISP1 IP address are not expected to work over ISP2, that's normal.
I think I'm missing something with the RPM config. It's supposed to be able to go out of a configured interface to a configured next-hop regardless of what the routing table says. I just don't know what the exact right configuration is supposed to be. Maybe it as a source-address + next-hop? Maybe RPM traceoptions can help with that, too.
Also, if you're already running BGP with the ISPs, why do you need further RPM? If a connection breaks, the BGP route ought to disappear, no?
------------------------------
Nikolay Semov
Original Message:
Sent: 07-07-2025 14:36
From: TacticalDonut164
Subject: Odd behavior with RPM and IP monitoring
(FYI - broke out an old lab firewall and am now testing on that so I don't just drop myself every five minutes)
New topology:
[SRX320 Lab Firewall] < .13 -- 10.255.250.12/30 -- .14 > Internet Router 0 <-> ISP 1 < .17 -- 10.255.250.16/30 -- .18 > Internet Router 1 <-> ISP 2
Unfortunately this did not fix preemption - after adding next-hop 10.255.250.14 to both probes:
LabBR> show configuration services rpmprobe PROBE-PRIMARY-INET { test TEST-PRIMARY-INET-ICMP { target address 8.8.8.8; probe-count 4; probe-interval 5; test-interval 10; thresholds { successive-loss 4; } destination-interface ge-0/0/5.501; next-hop 10.255.250.14; } test TEST-PRIMARY-INET-HTTP { probe-type http-get; target url https://www.google.com; test-interval 10; thresholds { successive-loss 3; } destination-interface ge-0/0/5.501; next-hop 10.255.250.14; }}
Then simulating an upstream failure by stripping VLAN 501 off of the switch and rolling back after a few minutes, the probes still continued to fail. It's therefore required for me to override the static route either by adding a lower metric route, or deactivating the services block. Once the IP monitoring route is removed or overridden, everything starts working normally again. From then, I can reactivate the services block and/or delete the [routing-options static] without any issues.
The "good news" is that I have not seen an uncommanded failover yet, and it's been active for quite a while. So we "might" have resolved that issue, but I'm not holding my breath.
Preemption not working is less of an issue than everything just dying, but still something I would like to get resolved. I'm not sure why it doesn't work.
To give some more details. When everything is working okay. This is what the routing table looks like:
0.0.0.0/0 *[BGP/200] 1w0d 18:42:56, localpref 100 AS path: 64513 ?, validation-state: unverified > to 10.255.250.14 via ge-0/0/5.501 [BGP/250] 5d 18:58:02, localpref 100 AS path: 64514 ?, validation-state: unverified > to 10.255.250.18 via ge-0/0/5.551
When it fails over, the routing table changes to:
0.0.0.0/0 *[Static/10] 00:02:19, metric2 0 > to 10.255.250.18 via ge-0/0/5.551 [BGP/200] 00:00:49, localpref 100 AS path: 64513 ?, validation-state: unverified > to 10.255.250.14 via ge-0/0/5.501 [BGP/250] 5d 19:04:38, localpref 100 AS path: 64514 ?, validation-state: unverified
For what it's worth, when the route is in effect, I cannot ping even sourced from the 501 interface:
LabBR# run ping 8.8.8.8 interface ge-0/0/5.501PING 8.8.8.8 (8.8.8.8): 56 data bytes^C--- 8.8.8.8 ping statistics ---2 packets transmitted, 0 packets received, 100% packet loss
Original Message:
Sent: 07-07-2025 10:41
From: Nikolay Semov
Subject: Odd behavior with RPM and IP monitoring
I'm not completely sure about this, but I think you should also specify next-hop on the RPM probes to force them to use ISP-1 during failover condition; otherwise maybe they'll try to use ISP-2 (due to the ip-monitoring policy action) with source address of the ISP-1 interface which will make them fail continuously indefinitely.
As for why the thing fails over in the first place, if you're absolutely sure you're not running into some transient ISP failure, then try enabling RPM traceoptions to get more details on what's going on with the RPM probes: https://www.juniper.net/documentation/us/en/software/junos/cli-reference/topics/ref/statement/traceoptions-edit-services-rpm.html
------------------------------
Nikolay Semov
Original Message:
Sent: 07-05-2025 18:54
From: TacticalDonut164
Subject: Odd behavior with RPM and IP monitoring
Hey guys,
I am having an odd issue with RPM and IP monitoring.
With the services block active, this policy will eventually trip and fail me over to my secondary ISP. The higher I set the intervals, the longer I can go before it fails me over. But it always will eventually fail.
I know for a fact I am not losing 20 seconds worth of pings consecutively, and I know there cannot be some network-level issue that causes pings from the SRX to not transit for 20 seconds. The CPU doesn't appear to spike during those times, nor does a DHCP client renew on either internet router correlate.
Preempt doesn't work either, as 8.8.8.8/google.com remains reachable through reth3.500 (since the failover was not genuine to begin with), but it does not reset the route at all. The only way to recover from this event is to delete or deactivate the services block.
Looking in the logs I just see that the probe succeeds, succeeds, succeeds, and then suddenly it starts failing and never stops failing.
Deactivating and reactivating the block resets the failover and everything starts working again until the next event.
I've also tried by suggestion to add an HTTP GET test to the probe, so that both have to fail, and GET is less likely to be dropped. After doing this, everything dropped not even 10 minutes later. Of course, deactivating services brought it all back up, almost like there never was a real failure to begin with.
Policy - FAIL-TO-SECONDARY-INET (Status: FAIL) RPM Probes: Probe name Test Name Address Status ---------------------- --------------- ---------------- --------- PROBE-PRIMARY-INET TEST-PRIMARY-INET-ICMP 8.8.8.8 FAIL PROBE-PRIMARY-INET TEST-PRIMARY-INET-HTTP FAIL Route-Action (Adding backup routes when FAIL): route-instance route next-hop state ----------------- ----------------- ---------------- ------------- inet.0 0.0.0.0/0 10.255.250.6 APPLIED
Hoping to get some additional eyes on this!
Thank you!
Topology:
[SRX345 Cluster] <-- .1 -- 10.255.250.0/30 -- .2 --> Internet Router 1 <-> ISP 1 <-- .5 -- 10.255.250.4/30 -- .6 --> Internet Router 2 <-> ISP 2
Config:
rpm { probe PROBE-PRIMARY-INET { test TEST-PRIMARY-INET-ICMP { target address 8.8.8.8; probe-count 4; probe-interval 5; test-interval 10; thresholds { successive-loss 4; } destination-interface reth3.500; } test TEST-PRIMARY-INET-HTTP { probe-type http-get; target url https://www.google.com; test-interval 10; thresholds { successive-loss 3; } destination-interface reth3.500; } }}ip-monitoring { policy FAIL-TO-SECONDARY-INET { match { rpm-probe PROBE-PRIMARY-INET; } then { preferred-route { route 0.0.0.0/0 { next-hop 10.255.250.6; } } } }}