Channel

How To Select the Right NOC Service Provider to Scale Your MSP Business

Credit: Getty Images
Javid Khan, CTO, IT By Design
Author: Javid Khan, CTO, IT By Design

Are you looking to:

  • Increase your profitability?
  • Lower your upfront costs?
  • Focus your business on what you do best?
  • Be prepared to scale?

If the answer is "yes" to one or more of the above questions, then you should consider an outsourced NOC. But how do you ensure you choose the right one?

Because of the complexities involved with today's networks and business continuity services ― especially in light of increased usage of cloud-based infrastructure and SaaS applications ― it only makes sense for your MSP to outsource and expand your network operations center (NOC) capabilities. But much is at stake when it comes to outsourcing NOC, especially regarding security measures and around-the-clock network management. You simply can't afford to partner with a provider that doesn't adhere to the right NOC practices.

Despite being critical to the success of a technical support operation, the majority of NOCs fail to extend robust service support. Most of the time, the root cause of an underperforming NOC is the lack of a centralized support framework that incorporates and executes best practices. Therefore, it only makes sense to evaluate your NOC choice based on the following best practices that have been tested for process efficiencies and network performance.

Multi-tier NOC activities management

The best NOCs perform fault detection, troubleshooting, and tracking to identify and resolve network issues in different tiers. By following an operational methodology that utilizes a tiered support structure in full alignment with the ITIL framework, NOC can rapidly respond to incidents.

Classifying NOC activities is often the first step in implementing a tiered structure. For best efficiency, Tier 1 of the NOC should provide 24x7 surveillance of the core and local IP, TDM, DWDM, and FTTC networks. It should also troubleshoot customer issues using correlation to network events in addition to remote diagnostics and restoration of the network. Tier 2 should coordinate planned network change events and drive the root-cause analysis of unplanned interruptions. Tier 3 must coordinate and escalate cases to their network planning and engineering team.

Clear escalation path(s)

Building a table of escalation and understanding the prioritization of incidents in terms of their business impact can help foster NOC efficiency. To maintain a reasonable escalation path, all team members should be clear on the proper protocol and channels for escalating issues. A critical problem that was not solved within 30 minutes should be escalated up the management ladder until response and/or ownership is taken.

Identification and mitigation of incidents based on priority

Inconsistency is one of the main reasons NOCs don't perform at optimal levels. Being reliably consistent requires a standardized process framework that arms a NOC with specific procedures for handling various support situations. A priority-based ticketing system can enable a NOC to keep track of all open issues based on severity and urgency.

In most NOCs, issues should be prioritized and organized into a set of queues, so each of them can be handled by the appropriate group. With a classification system in place, a NOC team can determine which incidents have the biggest impact on the network operations. Efficient incident response and triaging hierarchy proactively designates the team member who should handle P1, P2, and P3 incidents and can keep everyone on the same page.

Process automation

For enterprise networks, NOC teams can face more than 10,000 network incidents per month on a varying scale. Handling this volume, even with a large team, is nearly impossible. To optimize the effectiveness of a NOC, automation can achieve end-to-end visibility, faster diagnoses, and streamlined collaboration. IT process automation empowers a Level-one team to deal with issues that otherwise might require an escalation to the Level-two team.

It's not unusual for NOC teams to face the recurrence of a previous issue. By automating best practices for previously solved problems, NOCs can significantly reduce MTTR. Automating the diagnosis process is also critical for any NOC when there are thousands of incidents every week—and thousands of potential root causes for each.

Dynamic documentation

Documentation is essential to a NOC's ability to function efficiently over the long term. This process includes building playbooks, documenting workflow processes, creating structured databases for storing and retrieving information, and recording business results for analysis and optimization.

An ideal NOC service provider must document all the incidents to make a centralized source of information for their staff. This knowledge base should be accessible to all team members and contain structured information about the previously resolved issues, highlighting the most common ones. It's critical to the success of the team to treat these as living documents. It's always easy for NOC engineers to quickly and more efficiently resolve incidents when relying on documented experience.

Platform integrations and consolidated data for action

Most NOCs need to bring customer portals, knowledge bases, playbooks, and workflow management tools into the NOC. Without proper integrations connecting these tools and platforms, NOC engineers are faced with tracking and managing multiple screens for incident information; manually collecting information from multiple sources for the purposes of documentation, notification, and escalation; and then attempting to manage workflow toward service restoration. This makes it nearly impossible to monitor and report on SLA metrics, let alone optimize performance. The results inevitably include operational inefficiencies, missed SLAs, and undue stress on NOC operators.

Meaningful operational metrics

Without an understanding of alarm activity, ticket activity, and common causes for outages and trends, a NOC team remains limited to responses that are reactive and tactical, rather than proactive and strategic.

Since the amount of data available to a NOC is daunting, a service provider must choose the metrics and KPIs that are specific and actionable. Some KPIs to consider include first-call resolution, percentage of abandoned calls, average time to restore, and the number of tickets and calls handled.

Business continuity plan

A business continuity plan (BCP) is essential for managing risk in any NOC operations. The BCP provides a blueprint for NOC staff to follow when recovering from a disaster. Without an effective BCP in place, a NOC can almost certainly remain vulnerable if a disaster or significant workforce disruption impacts the operation.

Key representatives from a cross-section of a NOC need to be involved in creating a BCP. This may even include outside vendors. An analysis of all security threats, a list of action items required to maintain operations (both for short- and long-term interruptions), and steps required to make the backup site(s) operational should be considered before agreed-upon processes and procedures are documented for future operational reference.

Scalable operation

A NOC's scalability is a measure of its ability to handle a growing amount of work without compromising the level of service. The ability to grow or absorb expansion requires a high staff utilization percentage from various NOC activities (at least 80%), a distributed redundant architecture (bandwidth, CPU, memory, etc.) to expand, and cutting-edge monitoring tools.

Comprehensive staff onboarding and ongoing training

A new NOC operator, in the absence of training, can unintentionally cause equipment damage or downtime of critical business services. Therefore, an extensive onboarding program covering users and permissions, troubleshooting, teams, and important contacts should be put in place by a service provider for new NOC employees. A truly comprehensive training program can take up to six months before an engineer is ready to take on NOC support responsibilities.

After work has begun, monthly or quarterly training sessions should be scheduled to reflect on upskilling opportunities. In the world of network management, change is constant. Failing to provide training on emerging NOC technologies and tools always has consequences.

There are plenty of good reasons for MSPs to outsource their NOC to a service provider but doing so comes with inherent risk. While juggling servers, databases, firewalls, and IoT devices, NOC teams can suffer from the lack of insights into emerging technologies and technical know-how alongside poor communication and collaboration across teams. With certain best practices, a NOC can use processes and technology to maximize network availability and improve performance. A successful scaling of your MSP and choosing the right NOC partner require you to do your due diligence and ask detailed questions about these best practices. 


This guest blog is courtesy of IT By Design. Read more IT By Design guest blogs here. Regularly contributed guest blogs are part of ChannelE2E’s sponsorship program.