Increase Implementation Success With a Proven Network Change Management Process
Every network engineer fears unexpected results while implementing a change at 2:00 AM. Unanticipated impact experienced during or after changes can affect relationships with your management, business units and customers. This can lead to greater scrutiny and a lack of trust when planning future changes.
Though there is always a chance for unexpected issues (e.g., bugs) to occur during a change, critical network change management steps can be taken to reduce the likelihood of such occurrences. The best way to avoid undue stress while implementing a change is by spending a little extra time in the planning stages.
This guide outlines the essential change management process you should follow during the planning phase of a network change. It is based on the firsthand experiences of the Optanix Remote Management Service (RMS) team, which has implemented thousands of network changes for our customers. The steps below have helped us drive a high change implementation success rate within our teams. As a result, some of our customer IT organizations have adopted these steps to bolster their methodologies around change governance.
For context, an estimated 90% of network changes that Optanix RMS engineers implement are primarily based on implementing fixes to resolve incidents. As a result, our teams are implementing changes on live production environments where we need absolute certainty we will not cause additional impact, unless that impact is anticipated. However, the steps in this guide can also be applied as new features or services are integrated into a production environment.
6 Steps to Increase the Success Rate of Network Change Implementations
Follow the steps in this section when planning your next network change to improve the odds of that change being implemented successfully.
Step #1: Identify a Measurable End State
Often when approaching a change to resolve an incident, the end state can be very easily defined. For example:
- Module 6 has an operational status and connected hosts can communicate without any errors.
- A data circuit has 0% packet loss and the applications using the circuit are experiencing the necessary performance.
For more complex changes, there may be multiple factors to measure to identify success, but they should be clearly documented along with the expected results. Without knowing where we are going, we are unable to identify how to get there, which limits the ability to be efficient and successful.
Step #2: Research the Path to Take
With a clear goal(s) to achieve, how to achieve it can now be identified. You can avoid spending any unnecessary time on research during the implementation by thoroughly researching how to successfully accomplish the desired end state ahead of time. This should eliminate any unexpected decisions having to be made, or additional research conducted, during the change window.
Researching how to accomplish a particular task can take many forms. Examples of research methods include, but are not limited to:
- Reading a vendor’s documentation to understand the requirements for a feature
- Reading the bug details to understand the workarounds or permanent fixes
- Engaging a vendor to share any information that is not public about a bug
Other than understanding the specific feature, it is also important to understand the current network configuration. This could involve reviewing network diagrams, design documentation and the current state of the environment.
For any non-trivial changes, or actions that have not been implemented before, it is beneficial to test these in a non-production environment. While it can be difficult to get an exact replica of the production environment in a test or lab environment, validating as much as possible can identify areas of concern. It can also help identify areas where additional implementation or verification actions may be necessary and will help increase confidence going into the change window.
Step #3: Identify Impact and Risk
It is your role as a network engineer to understand what technical impact your actions may cause and what technical risk is associated with those actions. The network does not exist in a vacuum, and any actions that you perform can have a downstream effect on the services that depend on the network. For example, if your change is to replace a 48-port access layer switch that experienced an intermittent hardware failure, the following may be true:
- Up to 48 connected end devices will be impacted during the change as they will not have network connectivity and no redundant network connectivity exists for these end devices.
- If the switch is not replaced soon, the switch is at risk of experiencing another intermittent hardware failure at an unpredictable time.
By working with your organization’s stakeholders, the business risk of the change that you are creating can be determined. Based on your technical impact and risk assessments, the business can weigh their risk of how to proceed. This will then influence how to implement your change, such as:
- Dictating the change window
- Identifying other technical teams that should be involved (e.g., application, server, cloud, DevOps)
- Determining any mitigation actions that should be implemented at different levels of the technology stack to avoid impact
Impact and risk assessments are a collaborative effort. By working with other technical teams and your organization’s stakeholders, you can identify the best path forward for your change with the least impact and risk.
Step #4: Create an Implementation and Backout Plan
Engineers within a managed service provider (MSP) organization are often scheduled on shifts, which means the engineer preparing the change may not be the one implementing the change. When it comes to network change management, attention to detail should be made when documenting the specific steps within an implementation and backout plan so that there is no ambiguity. The order in which actions are to be taken, the specific commands to use on a CLI or options to check within a GUI, and the description of what is being performed should be as obvious as possible.
Several principles to keep in mind when documenting these plans are:
- Each step should have a single specific purpose and that purpose should be described.
- When multiple network devices are involved, each step should belong to a single device, and it should be obvious which device the step is for.
- Verification steps should be included throughout the implementation process to identify if the network device’s state reflects what was just performed. This will help avoid implementing many steps only to realize that at step 1 an unexpected issue was experienced
- For complex changes, separate an implementation plan into phases. This can help when communicating with stakeholders about the status of the implementation of a change, and also when reviewing a change.
The same level of detail should still be documented even when the same engineer is both planning and implementing a change. Change tickets/documents are often a historical reference of what was performed and could be used by other engineers in the future to implement the same or a similar change. Following the principles listed above can help increase network change efficiency for your organization.
Backout plans do not need to be an exact reverse of the implementation plan. The goal of a backout plan is often to restore service in the event that what was implemented caused unexpected results and impact. While it may have taken 10 steps to configure a new feature, often that feature can be disabled in a single step.
Step #5: Create a Verification Plan
Now that there is a detailed understanding of the current environment and exactly how it will be implemented, you can identify how to measure that the end state was met successfully. It is important to understand the scope of state. For example, if upgrading a network switch, it could be easy to leave the end state as documenting the switch is running the new version of software. However, what if it was missed that a critical uplink was in a down state after the upgrade took place? While it may not impact end users if there are redundant uplinks, the business is at risk.
The Optanix RMS team advocates for identifying verification plans for three key areas:
- Verifying the specific scope of the change (e.g., the switch is running the new version of software)
- Verifying the state of all other features/functions of the switch which were impacted during the change (e.g., when reloading a switch to perform an upgrade, the state of device hardware, network transit interfaces, trunks, layer 2 neighbors, routing, etc. should be verified)
- Critical business applications are reachable
The scope of each of these areas depends on the context of the change. Each verification step should be documented following the same principles outlined in the “Step #3: Create an Implementation and Backout Plan” section above (single specific purpose, associated with a specific device, etc.).
For key area 2 (verify state of all features/functions), a detailed snapshot of the device should be taken before any changes have been implemented and after the change is complete. This often involves gathering a large amount of CLI output to a log file before and after a change, then using a text diff tool to identify and review changes between the before and after state. When it comes to capturing state information on the device, the more output the better! One added benefit of gathering this level of data is that it makes it easier to identify whether or not a change implemented over a weekend is the cause of any issues experienced when the business opens on Monday morning.
Step #6: Peer Review the Plan
After the implementation, backout, and verification plan has been created, have it peer reviewed by another colleague. A second set of eyes can help identify any issues that may be encountered, as each engineer involved has different prior experiences. It also allows knowledge to be shared across a team.
The peer reviewer is there to constructively criticize the plan and provide additional detail which can be incorporated, thus driving the success of a change. When reviewing a change, a peer should:
- Understand that they are equally responsible for the success of a change
- Review any related tickets (e.g., Incidents) to understand the history and context
- Perform a similar level of research as the engineer creating the change to confirm the technical steps of the change are accurate to the desired outcome
Ticket Systems and Change Documentation
Although an implementation, backout, and verification plan often aligns with fields in most ticket systems, Optanix RMS engineers have found that these plans are better documented in a program like Microsoft Word. Text boxes in ticketing systems do not allow you to convey meaning through proper white space, bullet/number points, and contextual highlighting (e.g., of CLI commands).
Changes detailed in a document like Word are not meant to replace a change ticketing system, but to supplement it and better assist an engineer who is implementing a change. This also allows for a lot more flexibility when it comes to linking to external reference documentation, embedding helpful images/screenshots or documenting specifics that do not match to a particular area of a change ticket (e.g., terminal servers and console ports to use for each network device).
The Future of Network Change Management
With the role of automation and Infrastructure as Code (IAC) seeing an ever-increasing presence in organizations, network configuration management (and implementing changes) is changing dramatically. While these may seem to overcomplicate what has for a long time been a manual process, the benefits are clear: increased efficiency, scale, repeatability, and consistency- leading to decreased implementation time.
Organizations taking a cautious approach to automation can still make use of it immediately for non-impacting tasks in their change process. Such tasks can include automating the gathering of before and after state snapshots from network devices, the analysis of network state to identify errors, and the verification of critical business applications (through synthetic transactions). As the comfort level grows with automation, using it for no- or low-impact change can help an organization gain confidence in the automation solution before moving on to more complex changes.
This blog post was authored by Brian Yaklin, a senior member of Optanix’s route/switch engineering team. It is part of an ongoing series of posts by Optanix engineers focusing on the importance of security in the IT management space.