Expert Advice: Embracing Automation in Your Data Center: Work On Your Six-Pack

By Erdem posted 04-19-2015 16:42

Recommend

Embracing a New Philosophy for Automation

In the data center, the value of automation is apparent when considering a series of manually-demanding tasks that are required to be repeated at scale. If we organize machines to execute these tasks, we typically see huge increases in speed allied with repeatable precision. Humans just cannot compete with a machine’s capacity to execute the same task over and over again with unfailing accuracy.

So in the quest for greater productivity, do we simply look to automate everything and ditch the business liability that is the human workforce?

Well, no. The point of automation is to allow a business to re-focus its workforce on tackling higher-level issues that will have far greater impact on that business’ chance of success. Secondly, through automated design, we have a new opportunity to re-engineer how parts of the business operate such that they are better aligned to your overall business goals. Thirdly, and perhaps the most immediately tangible benefit, is the elimination of operational risk.

Considering Automation from the Outset

Automation is often discussed as though it is a set of tools, but that is only part of the story. The most powerful results are when you consider Automation in a top-down fashion, affecting how you design, build, and operate the network.

One example of this is in the realm of Network Design Verification Testing (N-DVT). As Engineers, we typically create test environments to validate a prospective network design. We build heavyweight test plans of many hundreds of tests to validate the various features enabled and to stress and discover limits of scale and availability. Upon completion of the plan, providing we have assembled enough tests passed, we satisfy ourselves the design can be put into production.

But wait—do we really understand the potential behaviour of this new network?

Have we tested all possible scenarios and permutations?
Was there over-simplification in the test planning or even de-scope of certain tests because of desire to hit a project deadline?
How much time was dedicated to analysis of the test results or did we just accept a pass without deeper review that it was not a false positive?

The psychology behind this approach is probably what many have documented as the ‘optimism bias’ which whilst being a well-meaning evolutionary trait to tackle adversity, it can significantly skew our ability to assess true levels of risk.

This bias is introduced in the following ways:

Scenarios or permutations missed due to assumptions or an incomplete understanding of what the network is for
De-scoping of tests because they are considered low risk as ‘nothing has changed’ and have consistently passed in the past
Signing off on a design based on initial review of the test results. This is done because of the time it typically takes to write the full test report (sometimes weeks).

A fully automated approach protects you from this bias—in all its forms. It places test execution, result analysis, and result reporting into the machine space. This brings all the benefits machines have to offer – speed of execution, consistency across multiple executions, and reporting the results in an easily consumed format.

This releases engineers to do what they are good at: coming up with new tests to automate, analyzing the tests results and, most importantly, communicating with various stakeholders ensuring effective coordination.

But how much faster than a person can a machine be at performing testing? Based on our own experiences, a machine can do in one hour what would take a person one month. That is a testing time factor of improvement of 160 times. Such a massive reduction in time enables 100% coverage of all tests in every execution. Assuming it takes 12 hours to execute all tests, you could evaluate a different version of software every single day.

This approach can be truly transformative to the business. No longer is testing a cumbersome, incomplete exercise where resources seemingly disappear for months at a time. Instead, it becomes responsive, transparent, and guarantees the network delivers promised capabilities visibly aligned to the business.

Work on Your Six-Pack

When you consider the Operational lifecycle as a whole, from design through to build and operation, Juniper Networks is considering six key areas of focus for any business looking to fundamentally de-risk their network operation whilst accelerating their service deployment process.

There are six use cases for automation in the Operational Lifecycle:

Certification
Build
Product Lifecycle
Service Lifecycle
Audit
Troubleshoot

1. Certification

As outlined in our N-DVT example above, an automated approach allows more effective characterization of your network’s behavior by weeding out potential for human short-cutting or oversight, and permits human focus to be on analysis of data rather than test execution. This should be allied with a test environment that is as identical to your production environment as possible - or your production should mirror your lab, it rather depends on your point of view! In addition, a ‘test everything, every time’ approach should be taken regardless of how trivial the design change is considered. Furthermore, the automated test cases should be created to test behavior not state – this will allow you to build the tests once and use for the lifespan of the network.

Version control is a concept long familiar to software developers and should be applied to the network design process too. Wouldn’t it be great if a configuration template within your network ‘source code’ resulted in automated provision of this new ‘network version’? How about following that up with automated traffic flow generation and test battery execution with all results captured and normalized for trend analysis whilst highlighting failures that require deeper inspection?

It sounds too good to be true, but a lot of the heavy-lifting here in being able to source control, automatically build the network, run through the tests and access the data as well as graph and centrally archive results has already been provided by the Open Source community. Combined with the power of Junos to present all configuration and state in a structured (XML-type) format that can be retrieved using simple API calls (via NETCONF) and you are there.

2. Build

The automated provisioning used within the Certification step can be re-used upon arrival at that shiny new data center. This not only accelerates the network stand-up but ensures you are deploying your network as designed. It also obviates the need for someone to stage the equipment (another potential source of risk and therefore delay). At time of writing, there is at least one Open Source example of configuration management tooling that can drive configuration changes via console thus eliminating the need for IP management before you can begin.

3. and 4. Product and Service Lifecycle

Products are the devices in your network whether physical or virtual. We understand how important an inventory of these devices is but how well is it maintained? Is it always synchronized with the live network for adds and removals? Does it have full visibility of network software version changes? To get the most accurate readings, your network inventory should be ‘plugged’ into the live network so new devices and any software changes are detected in real-time, i.e. automatically. This is absolutely key so your organization can get a handle on potential issues with specific hardware or software before they manifest themselves.

Probably the greatest opportunity for network destabilization is in provision and tear-down of services, simply because these instances constitute the bulk of operational changes. If you add that to the fact that many Enterprises arbitrate changes via Command Line Interface (CLI) without strict control of service configuration templates and you have a disaster waiting to happen. We might blame lack of skill for introducing these problems with remediation focused on better training and more rigor in peer review. However, the strong opinions of Automation illuminati would argue the narrow corridor of Service Delivery operation should have been clearly identified as part of Certification and for this to be enforced through system-driven workflow or even fully software-driven service rollout.

5. Audit

Assuming you have a solid view of your device inventory then the next step to capitalizing on this is to cross-check it regularly for any potential software or hardware issues as published by your vendors. This is an important pro-active step that allows you to eliminate risks before they become the production issues that so often collapse network operation into a state of fire-fighting. In this panic state all optimization activities are nearly always put on hold and so events like this stagnate progression, not to mention the time spent chewing over how you got there in the first place! If you are able to put in place an automated mechanism to provide this cross-check then, in our opinion at least, you will be one step closer to network nirvana.

Automated auditing should also be considered for configuration to understand which parts of your network have drifted from your golden configuration (even with automated provisioning this is still likely owing to break-fix or the odd industrious operative electing to deviate from best practice). It should also be considered for network state as this ultimately defines how traffic is forwarded through the network. Network state is the heart rate and blood pressure of your network and they should be observed and recorded before and after you have given the patient its dose of service re-configuration.

How else can you track whether your intervention has created any potentially unsavoury side-effects? Actually, if you want the most robust assessment then you should be checking every network vital statistic for deviation against pre-determined limits and thresholds and this is where Automation can help with managing this scale. An Automated network state validation process should therefore be instrinsic to any Service Delivery process if you are to be sure of the continuing health of the network.

6. Troubleshoot

Bad things can happen (sorry) and in the event that they do, you will want immediate and relevant data capture. The best way this can be achieved is through on-device event handling that alerts your operations to specific device(s) actually having the issue. Firstly this eliminates the time needed for hunting the problem’s source and secondly provides you with all the relevant data captured at time of issue so that your vendor support organization can respond with full root-cause analysis in the fastest possible time.

In conclusion, an Automation philosophy that spans your Operational lifecycle will improve your Speed by orders of magnitude but where it is fundamentally of value is in eliminating Risk and permitting organizational Focus on the science of analysis and optimisation.

How To Get Started

Introducing something new is difficult and typically the first time you do something in a new way the old way would have been faster. Where Automation really pays off is on tasks that are to be executed a large number of times. Unfortunately, there is no magic number of iterations whereby Automation should be considered – you will need to assess investment in effort versus reward in terms of speed of execution and consistency of results for your specific requirement.

Here are some best practices for getting started:

Introduce automation to a process you completely control and avoid processes dependent on people/functions you do not control. This will allow you to accurately measure the level of effort it took to Automate and the level of benefit. This should then be shared with relevant colleagues so they can see the impact – helpful for when your ambition extends to cross organizational boundaries.
Choose something relatively small and a process you understand very well so that the overhead you introduce learning to Automate will be marginal. Often attempts to automate fail catastrophically if you are learning the process at the same time.
Do not expect to get it right the first time. Automation is not a state, it is a process and one that your organisation can and will learn to improve over time. As with software development, efficiencies and improvements are made through iteration.
Do expect that as more things become automated new approaches and practices will become available. As with the industrial age, large-scale automated approaches permitted large-scale cultural shifts that permitted great innovation that marked the Information age of today. An organization built on an Automation philosophy should expect to recreate similar cultural shift albeit microcosmically.
Review your end goal every six months or so to make sure it is still relevant.
Do not be discouraged if you have large ambition with automation but upon evaluation you determine the required energy of getting started appears too great. Instead, seek help from the professionals and ensure you work in the upskilling of your organization as part of this engagement – the rewards in doing so will be truly immense.

Written by David Gethings, Solutions Consultant at Juniper Networks

Written by Richard Balmer, Senior Systems Engineer at Juniper Networks Financial Services

#networkautomation
#ExpertAdvice

Blog Viewer