Expert Advice: Making Scripts Robust: Recovering from Power Loss or Routing Engine Restart

By Erdem posted 08-11-2015 14:48

Recommend

Part 4 of 5 of Making Scripts Robust

Previous Article: Making Scripts Robust: Handling Routing Engine Switchovers

Next Article: Making Scripts Robust: Managing Retries

NOTE: This applies to SLAX version 1.0 and higher.

Recovering from Power Loss or Routing Engine Restart

“There is only one day left, always starting over: it is given to us at dawn and taken away from us at dusk.” -- Jean-Paul Sartre

One thing is certain: every computerized device will either be rebooted or power-cycled. Whether these events are planned, the device must be able to recover to a known, operational state, in a completely automated and predictable fashion. For network gear such as routers and switches, this means they must be able to recover the operational configuration and begin forwarding packets per that configuration as soon as possible after restart. If you’ve gone to the effort to develop and deploy any Junos automation scripts on those routers or switches, you would likewise want to recover rapidly and gracefully.

There are two types of long-lived Junos scripts:

those that are executed at regular intervals via a timer event policy (a tick tock event), and
those that are self-restarting and handle their own timing intervals and loop counts.

Automatic restart after a reboot or power cycle is handled slightly differently for those two types.

Timer-Based Event Scripts

If the script runs at regular intervals from a tick-tock event policy, you don't need to do anything. As long as eventd is started as part of the normal Junos startup (as it always should), your policy will be loaded and your script will run at its prescribed interval, with no extra effort required on your part.

Non-Timer Based "Daemonic" Scripts

Sometimes we need a long-lived script that manages its own polling intervals, loop iterations, and restarts. (For an explanation why you'd want to assume those extra coding responsibilities, please refer to the previous article: SLAX: Design Consideration for Long-Lived On-box SLAX Scripts.)

This type of script does not use a tick-tock policy, and needs its own policy for eventd to start it automatically on system startup. This system-started event policy is signaled by the system message “Starting of initial processes complete”, which indicates that the RE is running properly and all other requisite processes have been started.

An event policy to kick off our script myeventscript.slax would appear as follows:

<event-options> {
 <policy> {
 <name> "system-started";
 <events> "system";
 <attributes-match> {
 <from-event-attribute> "system.message";
 <condition> "matches";
 <to-event-attribute-value> "Starting of initial processes complete";
 }
 <then> {
 <event-script> {
 <name> "myeventscript.slax";
 }
 }
 }
}

Hold-Down Timers

In some cases, RE startups can initiate system processes that can take a long time (minutes) to complete. Even though your script was honestly informed “Starting of initial processes complete”, some of those processes might still be very busy handling other startup-related issues, and consequently might not be responsive to queries from your script. In those cases, you can avoid some potential problems by instituting a hold-down timer in your event script, so that when it detects it’s being started within a specified amount of time following an RE startup, it can quiesce its operations (sleep) for a certain amount of time before proceeding.

Such a quiesce-startup check could appear as follows:

match / {
 <event-script-results> {
 var $con = jcs:open();
 if (not($con)){
 expr util:emit-msg('error', 'Error connecting to mgd. Exiting.');
 <xsl:message terminate="yes">;
 }
 
 /* Don't do anything else unless|until $holdTime seconds have passed since boot. */
 call util:quiesce-startup($con, $holdTime);
 
 /* Do stuff. Or not. */
 }
}

In the above example, $holdTime is a global variable that represents the number of seconds to wait after an RE startup, even though the script already received the “Starting of initial processes complete” event. In this example, I’ve written it as a template “util:quiesce-startup()”, as shown below:

/*****************************************************************************************
 * Template: util:quiesce-startup()
 * Description: keep script from starting until after $holdTime seconds of system boot.
 * Inputs:
 * $con - handle for rpc,
 * $holdTime (in seconds)
 * Outputs - none.
 * Eithter returns immediately or sleeps until the RE has been operational for $holdTime seconds.
 */
template util:quiesce-startup($con, $holdTime) {
 var $now = date:seconds();
 var $bootTime = jcs:sysctl("kern.boottime","i");
 var $startAfterTime = $bootTime + $holdTime;
 var $waitTime = $startAfterTime - $now;
 /* sleep $waitTime seconds, if we're still inside our hold-down window */
 if ($waitTime > 0) {
 expr util:emit-msg('info', concat("Script execution quiesced for ", $waitTime, " seconds due to recent reboot."));
 expr jcs:sleep($waitTime);
 expr util:emit-msg('info', "Script execution continuing.");
 }
}

As shown in this example, this template simply checks whether more time has passed than $holdTime since the RE booted. If not, it sleeps for the difference. Otherwise it returns. Either way, our calling script continues when the template returns.

You may find, as part of your script testing, that you don’t really need this sort of quiesce operation. I mention it here only because I did run into this sort of issue for a script that used <get-snmp-object> operations to collect data on subscriber demux0 interfaces that numbered in the tens of thousands. As it turned out, snmpd needed a little more time at system startup to sort out ifIndex assignments for all those dynamic interfaces, and would only cooperate when my script gave it a little extra time at system boot.

Your results will vary. The nature of your script and the data you’re collecting will determine if this is a potential issue. You just need to ensure that your script test cases include testing for proper automated

startup after RE boot.

Written by Douglas McPherson
Solutions Consultant at Juniper

#routingengine
#ExpertAdvice
#junoscript
#Slax
#eventscript

Blog Viewer