Endpoint Monitoring and Alerting Playbook

Topic

This guide provides recommendations for building your endpoint monitoring and alerting strategy as well as step-by-step instructions for building 30+ custom endpoint monitoring conditions in NinjaOne.

This article is a copy of the NinjaOne Best Practice guide from our NinjaOne Resource Center. The data can be downloaded as a PDF at the bottom of this page.

Environment

NinjaOne platform

Index

Introduction
- What Does Good Monitoring Look Like?
Device Health Monitoring
Drive Monitoring
Application Monitoring
Network Monitoring
Security Monitoring
4 Keys to Leveling-up Your Monitoring
Ticketing & Alerting Best Practices

Device Alerts on the System Dashboard

Access the Alerts data on the system dashboard under the Devices tab.

Figure 1: Access Alerts on the system dashboard

This dashboard also allows you to quickly filter these activities by the following options:

Column	Properties
Date range	Choose a date range (last day, last week, last month, or last three months) for when the alert was initially triggered.
Device types	Choose a specific device type or types to filter the list by.
Conditions	Search for specific conditions that are currently triggered.
Organizations	Choose a specified organization or organizations to filter the list by.
Locations	Search for a specified location or locations to filter the list by (if you have any organization filters selected, the list of available locations will update to reflect only the locations under the specified organization(s)).
Devices	Filter the list by specified devices.

Each column can be sorted. By default, all available columns are displayed in the table; use the gear icon under the filter dropdowns to manage column visibility. You can refresh the alert list at any time by clicking Refresh at the top of the list.
The names of the devices, organizations, and locations are hyperlinks that take you to the dashboards for those devices, organizations, or locations.
Click the data in the Alert column to see the details in a pop-up window. The alert details also contain the device name and the date/time that the alert was created.
Activate the checkbox for one or more alerts to see the reset option.

Introduction

What Does Good Monitoring Look Like?

Monitoring and alerting are central to the effective use of an RMM. Good monitoring practices enable you to proactively identify issues, resolve them faster, and be more effective. Better monitoring can also play a key role in generating additional revenue and keeping your clients more satisfied.

The challenge is knowing what to monitor for, what requires an alert, which issues can be automatically resolved, and which need a personal touch. That knowledge can take years to develop, and even then the best teams can still struggle with reducing alert fatigue and ticket noise across client devices.

To help those just getting started condense that ramp-up time and narrow their focus, we’ve put together this list of ideas for 25+ conditions to monitor. These recommendations are based on suggestions from our partners and from NinjaOne’s experience helping MSPs build effective, actionable monitoring.

For each condition we describe what is being monitored, how to set up the monitor in NinjaOne, and what actions should be taken if the condition is triggered. Some monitoring suggestions are concrete while others may require a small amount of customization to fit them to your use case.

These monitoring ideas are obviously not exhaustive, and may not apply to every situation or circumstance. Once you’ve gotten started building out your monitoring around these suggestions, you’ll need to develop a more customized and robust monitoring strategy specific to your clients and their needs. We end this guide with additional recommendations to help with that effort and make monitoring, alerting, and ticketing a competitive advantage for your MSP.

Device Health Monitoring

Monitor for continuous critical events	Condition: Critical Events Threshold: 80 critical events over 5 minutes Action: Ticket and investigate
Identify when a device is unintentionally rebooted	Condition: Windows Event Event Source: Microsoft-Windows-Kernel-Power Event ID: 41 Note: This condition is better suited for servers as workstations and laptops can create this error from user intervention Action: Ticket and investigate
Identify devices in need of a reboot	Condition: System Uptime Threshold recommendation: 30 or 60 days Action: Restart the device during an appropriate window. Automated remediation may work for workstations.
Monitor for offline endpoints	Condition: Device Down Threshold recommendation: 10 minutes or less (servers) 5 days or longer (workstations) Action: Ticket and investigate Wake-on-LAN (servers only)
Monitor for hardware changes	Activity: System Name: Adapter added / changed, CPU added / removed, Disk drive added / removed, Memory added / removed Action: Ticket and investigate
Monitor for prolonged high CPU usage	Condition: CPU• Thresholds: 90% or greater to reduce noise, with 95%+ also being common over a 15 minute or greater period Action: Ticket and investigate

Drive Monitoring

Monitor for potential disk failure	Condition: Windows SMART Status Degraded Condition: Windows Event Event Source: Disk Event IDs: 7, 11, 29, 41, 51, 153 Action: Ticket and investigate
Identify when disk space is approaching capacity	Condition: Disk Free Space Threshold: 20% and again at 10% Action: Perform disk cleanup and delete temporary files
Monitor for potential RAID failures	Condition: RAID Health Status Thresholds: Critical and Non-Critical for all attributes Action: Ticket and investigate
Monitor for prolonged high disk usage	Condition: Disk Usage Thresholds: 90% or greater to reduce noise, with 95%+ also being common over 30- or 60-minute periods Action: Ticket and investigate
Monitor for high disk activity rate	Condition: Disk Active Time Thresholds: Greater than 90% for 15 minutes Action: Ticket and investigate
Monitor for high memory usage	Condition: Disk Active Time Thresholds: Greater than 90% for 15 minutes Action: Ticket and investigate

Application Monitoring

Identify if required applications exist on an endpoint	Condition: Software Usage: Client line-of-business applications (Examples: AutoCAD, SAP, Photoshop) Client productivity solutions (Examples: Zoom, Microsoft Teams, DropBox, Slack, Office, Acrobat) Client support tools (Examples: TeamViewer, CCleaner, AutoElevate, BleachBit) Action: Automatically install the application if it is missing and required
Monitor whether critical applications are running (particularly for servers)	Condition: Process / Service Threshold: Down for at least 3 minutes Example Processes: For workstations: TeamViewer, RDP, DLP For an Exchange server: MSExchangeServiceHost, MSExchangeIMAP4, MSExchangePOP3, etc For an Active Directory server: Netlogon, dnscache, rpcss, etc. For a SQL server: mssqlserver, sqlbrowser, sqlwriter, etc. Action: Restart the service or process
Monitor resource usage for applications known to cause performance issues	Condition: Process Resource Threshold: 90%+ for at least 5 minutes Example Processes: Outlook, Chrome, and TeamViewer Action: Ticket and investigate Disable at startup
Monitor for application crashes	Condition: Windows Event Source: Application Hang Event ID: 1002 Action: Ticket and investigate

Network Monitoring

Monitor for unexpected bandwidth usage	Condition: Network Utilization Direction: Out Threshold: thresholds will be determined by the b type of endpoint and network capacity Each server should have its own threshold based on its use case Workstation network monitor thresholds should be high enough to trigger only when a clients’ network is at risk Action: Ticket and investigate
Ensure network devices are up	Condition: Device Down Duration: 3 Minutes
Monitor which ports are open	Condition: Cloud monitor Ports: 80 (HTTP), 443 (HTTPS), 25 (SMTP), 21 (FTP)
Monitor client website availability	Monitor: Ping Target: Client Website Condition: Failure (5 times) Action: Ticket and investigate

Security Monitoring

Identify if Windows Firewall has been turned off	Condition: Windows Event Event Source: System Event ID: 5025 Action: Turn on Windows Firewall
Identify if antivirus and security tools are installed and/or running on an endpoint	Condition: Software Presence: Doesn’t Exist Software (examples): Huntress, Cylance, Threatlocker, Sophos Action: Automate the installation of the missing security software AND Condition: Process / Service State: Down Process (examples): threatlockerservice.exe, EPUpdateService.exe Action: Restart the process
Monitor for unintegrated AV / EDR threats detected	Condition: Windows Event Example (Sophos) Event Source: Sophos Anti-Virus Event IDs: 6, 16, 32, 42
Monitor for failed user logon attempts	Condition: Windows Error Event Source: Microsoft-Windows-Security-Auditing Event ID: 4625, 4740, 644 (local accounts); 4777 (domain login) Action: Ticket and Investigate
Monitor for the creation, elevation, or removal of users on an endpoint	Condition: Windows Error Event Source: Microsoft-Windows-Security-Auditing Event ID: 4720, 4732, 4729 Action: Ticket and Investigate
Identify if the drives on an endpoint are encrypted/unencrypted	Condition: Script Result Script (Custom): Check Encryption Status Action: Ticket and Investigate
Monitor backup failures (NinjaOne Backup)	Activity: NinjaOne Backup Name: Backup job failed
Monitor backup failures (other backup vendors)	Condition: Windows Event Example Source / IDs (Veeam): Event Source: Veeam Agent Event IDs: 190 Text Contains: Failed Example Source / IDs (Acronis): Event Source: Online Backup System Event ID: 1 Text Contains: Failed

4 Keys to Leveling-up Your Monitoring

Create a baseline device health monitoring template.
Talk to customers about their priorities.
- Which servers and workstations are important?
- What are their critical line of business or productivity
  applications?
- Where are their IT pain points?
Monitor your PSA / ticketing system for recurring issues.
- Adjust alerting to avoid ticket noise.
Monitor clients’ event logs for recurring issues.

Ticketing & Alerting Best Practices

Only alert on actionable information - if you don't have a specific response associated with a monitor, don't monitor it.
Categorize your alerts to go to different service boards in your PSA.
Host regular alert housekeeping meetings to discuss.
- Which alerts are causing the most noise? Can they be removed or narrowed in scope?
- What is not being monitored or creating notifications that should be?
- Which common alerts can be automatically remediated?
- Are there any upcoming project that may generate alerts?
Clean up your tickets and alerts when they are resolved.
- In NinjaOne, many conditions have a ‘Reset when no longer true’, or ‘Reset when not true for x period’ to help you resolve and cleanup notifications that may resolve themselves.