View Only

Expert Advice: Making Scripts Robust: Handling Routing Engine Switchovers

By Erdem posted 08-11-2015 14:46


Part 3 of 5 of Making Scripts Robust


Previous Article: Making Scripts Robust: Checking Status

Next Article: Making Scripts Robust: Recovering from Power Loss or Routing Engine Restart


NOTE: This applies to SLAX version 1.0 and higher.


Handling Routing Engine Switchovers


“There is nothing permanent except change.” – Heraclitus


If your script is running on a Junos device with multiple routing engines (REs), your device will very likely change operation from one RE to the other at some point. It’s also safe to assume that your script needs to run, regardless of which RE is currently master. Your script can detect RE status via an explicit query or by detecting an RE switchover event. Explicit check of current RE and status is probably the most reliable and easiest approach. Detecting (and acting on) an RE switchover event requires defining an event policy and action for this event. This latter approach is the more complicated of the two, and as such I don’t generally recommend it, unless this is an explicit requirement dictated by your customer or other circumstance.


Detecting Current RE using $junos-context

As I mentioned earlier, the most direct and reliable method is for your script to determine the RE on which it’s executing on as well as its status, and proceed accordingly.$junos-context gives Junos automation scripts the capability to do this. For example: assume we have a dual-RE configuration, and our script that needs to run (periodically or continually) to perform some interface data collection. We want this script to always run, regardless of which RE is currently the master. However, our script can only collect interface data if it’s running on the Master RE (the backup RE has extremely limited interface visibility). Given this, we can design our script so that it will always run on both REs:


  • If the script detects that it is running on the Master RE, it proceeds with the interface data collection.
  • If it detects that it’s running on the Backup RE, it sleeps or exits, depending on whether it’s a time-based event script or a self-starting script running in a loop.

As of Junos OS 11.1 Release, the special global variable $junos-context is available and contains useful information about a script’s operating environment. We can now extract the following:


  • $junos-context/routing-engine-name—This contains the RE name (e.g., re0 or re1) that the script is currently executing on.
var $re = $junos-context/routing-engine-name;
expr jcs:output(“We are running on “, $re);


  • $junos-context/re-master—This variable only exists if the RE on which the script is running is the master (it doesn't exist on the backup RE). ?

if (!$junos-context/re-master) {
expr jcs:output(“Not master. Nothing to do but wait.”);
expr jcs:sleep($wait-interval);
} else {
expr jcs:output(“running on master. Do stuff.”);
call do_stuff;

In the above example we see that it’s possible to have an identical copy of a script that always runs on both REs. When the RE is not the master, the script will do nothing; when it is the master, the script will run.


One more thing: if your script is saving data to a file locally, take note of which RE you’re executing on. Each RE has its own file system, so for example, if your script is executing on RE0 and writes data to "/var/tmp/foo", that data will only exist as re0:/var/tmp/foo. If your script needs to share that data with another instance that is running (or may run) on RE1, then it is up to your script to copy that data to the other RE's filesystem as needed.


Detecting RE Switchover Events

As I mentioned earlier, it’s possible to detect and accommodate RE switchover events as they happen.This requires creating an event policy and action to detect RE switchover. One way to detect this is via an event policy that triggers on the event, where the message text matches “SNMP trap generated: redundancy switchover”.


Message: SNMP trap generated: <trap> (<argument1> <value1>, <argument2>
<value2>, <argument3> <value3>, <argument4> <value4>,
<argument5> <value5>, <argument6> <value6>, <argument7>
<value7>, <argument8> <value8>, <argument9> <value9>,
<argument10> <value10>)
Help: chassisd generated SNMP trap
Description: The chassis process (chassisd) generated a Simple Network
Management Protocol (SNMP) trap with the ten indicated
argument-value pairs.
Type: Event: This message reports an event, not an error
Severity: notice
Facility: LOG_DAEMON


I include this approach and description here only for completeness’ sake. No example code for this approach is included in this article because I don’t want to encourage you to use this technique.


Here’s my primary argument against trying to directly detect/act on an RE switchover event. When an RE switchover takes place, there’s usually many things going on behind the scenes in Junos. Most of these are probably taking place at the same time, or have higher priority than your event script. So if you assume that catching the RE switchover event will get your script a quicker (re)start on collecting data, you may be surprised.


You might get it started, but you’re more likely to end up running into time-out issues on getting data (since you’re probably contending with other processes -- of equal or greater priority -- dealing with the RE switchover). While it is true that you can contend with that possibility by adding retries and extra checks, you’re now adding more code and more contention. This results in code that is more complex (and thus more bug-prone), so you really need to think hard if this approach is going to provide any significant benefit over an approach that simply checks $junos-context at opportune times.



Previous Article: Making Scripts Robust: Checking Status

Next Article: Making Scripts Robust: Recovering from Power Loss or Routing Engine Restart


Written by Douglas McPherson
Solutions Consultant at Juniper

1 view