As a managed service provider (MSP), you must be able to explain your alert handling process to your clients clearly and effectively. Failure to do so may leave clients with the impression that they are not being adequately taken care of, or that your best efforts are falling short — even if you are following industry best practices, with a competent and fully provisioned team that resolves issues in a timely manner.
This guide provides a framework for communicating your alert handling process in a way that your customers can understand. This ensures that realistic expectations are set and that trust is maintained during incidents.
What is the difference between alerts and incidents?
An incident is an interruption to the IT services you supply to, or manage for, your clients. Incidents are usually unplanned, and can range from issues such as degraded internet service due to a failed connection, or a complete file server outage due to a system crash. Cybersecurity incidents such as data breaches may not acutely or immediately interrupt service, but could lead to other negative side effects.
An alert is produced as a result of an incident, sent to your tech team so that they are aware of it. It can be generated by monitoring systems or cybersecurity platforms, and may be sent on the first occurrence of an incident (e.g., a server stops responding), or when specific thresholds are reached (e.g., a certain number of failed login attempts are reached for a user). Alerts can also be human-driven, coming in the form of a support ticket, and reveal unusual activity or malfunction on your infrastructure that automated systems may not.
What is the alert handling process?
The alert handling (or alert triage) process is how you configure and respond to alerts. An effective alert handling process will ensure sensible thresholds are set, and that alerts are prioritized and escalated when necessary, so that nothing is missed and critical incidents take priority over minor annoyances.
For your MSP team, visibility while avoiding alert fatigue are the goals of the alert handling process. For your clients, this must be made understandable: you need to be able to demystify the technical reasons why your alert triage process is set up the way it is. This way, when a severe incident does inevitably occur, they will be reassured knowing that it has taken priority over other work in progress, and that your team will already be on top of it.
Explaining the alert handling process clearly
Communicating your team’s readiness to clients can be enabled by technology. Monitoring software can be demonstrated, showing dashboards and notifications in action. Monitoring and alerting policies (including escalation tiers) can be shared through your MSP’s documentation platform, and your helpdesk software can be leveraged for clear client communication and status updates for an incident.
Quarterly business reviews (QBRs) are an opportunity to show reports generated from monitoring and helpdesk metrics, demonstrating that your alert handling process has been effective.
When explaining your alert handling process, the following questions should be clearly answered for your clients:
- What happens when something goes wrong?
- How fast will you respond?
- How do I know issues won’t slip through the cracks?
None of the answers to these should feel abstract — they should be concrete answers based on your policies, that will be demonstrated when the real-world need arises.
Requirement 1: Break down the alert lifecycle in plain language
Avoid technical jargon when explaining your monitoring systems and alert methods. Cover the following technologies and concepts, and explain how you use them to respond to their issues:
- Detection: Explain the monitoring tools you use to identify issues
- Ticket creation: Explain how tickets are created both automatically from monitoring systems, and by users to report issues
- Triage: Detail how your technician will validate and assess the priority of an alert
- Response: If the alert has detected an incident, walk through the process of how it is assigned and resolved
- Escalation: Make sure the customer understands that if an incident is not resolved promptly, the alert moves to higher-level staff or third-party vendor (for example, it may be necessary to escalate to Microsoft support for a Microsoft 365 issue that cannot be resolved externally)
- Closure and reporting: Documented resolution details and times are included in reports and used for future review and improvement
This can be aided with diagrams and flowcharts showing how different alerts progress. For example, you might show how an initially low priority alert is escalated as thresholds are reached.
Requirement 2: Translate SLAs into scenarios
Practical scenarios can help reassure clients by giving them an example they can map onto their own business processes to assess the impact your alerting process will have.
For example, instead of just quoting a response time, contextualize it: “If your server goes offline at 2 a.m., our system immediately generates an alert, and an on-call technician responds within 15 minutes. We generally expect to have it back up within 30 minutes of that, and if we fail to do so, we will escalate to an on-site technician within the hour”.
Requirement 3: Clarify escalation paths
Explain how an incident will not be caught up in a queue or stuck with a technician who has been unable to resolve it. Cover:
- Who handles first response (Tier 1)
- Who escalates complex issues (Tier 2/3 or third-party vendor)
- How clients and stakeholders are notified during escalation
Requirement 4: Show your clients what they will see when something does go wrong
As part of your client onboarding, show them examples of what they will see during and after incidents have triggered alerts. This may include reports that show resolved alerts and uptime metrics or client-facing dashboards with live data.
Highlight exceptional cases in QBRs to demonstrate the effectiveness of your alert handling process.
Requirement 5: Differentiate between alerts and noise
You should explain the technologies behind your monitoring and alerts and how they enable your team to provide efficient services that prioritize uptime. Alert fatigue happens when technicians are inundated with alerts for minor issues, causing them to miss important incidents.
Show your clients how you avoid this, and how thresholds, filtering, and automation are used to make sure incidents are sensibly prioritized based on their severity.
NinjaOne provides a unified solution for monitoring, alerting, ticketing, and documentation
NinjaOne gives MSPs a comprehensive toolchain for monitoring, alerting, and incident response. Thresholds can be set per-client, and when something happens, notifications can be delivered via SMS, email, and push notification, so that technicians are never far behind with a solution in hand.
Tickets in NinjaOne’s helpdesk can be automatically generated from alerts from a wide range of servers, tools, and platforms, dashboards can be generated to show uptime and resolution metrics. NinjaOne’s built-in documentation can be used to store SLAs, process documentation, charts, reports, examples, and other client-facing documents that help add valuable context and reassurance to your alert handling process.
