What is MSP proactive monitoring and why is it important?

MSP proactive monitoring uses automation to track device health, networks, applications, and security events in real time. By identifying issues early, MSPs reduce downtime, cut ticket noise, and resolve problems before they impact end users.

What are the most critical IT conditions to monitor for MSPs?

Key monitoring areas include system uptime, disk health, CPU and memory usage, bandwidth spikes, failed backups, firewall and antivirus status, failed login attempts, and unauthorized user account changes. These conditions help MSPs maintain reliable IT environments.

How can IT automation reduce alert fatigue for MSPs?

Automation reduces alert fatigue by only triggering actionable alerts, categorizing tickets by priority, and auto-remediating common issues such as restarting services or cleaning up disk space. This ensures MSPs focus on high-priority incidents.

What automation examples improve IT security monitoring?

MSPs can automate monitoring for disabled firewalls, missing or inactive antivirus/EDR tools, unencrypted drives, failed login attempts, and security events flagged by tools like Sophos or ThreatLocker. Automated remediation strengthens endpoint protection.

Which tools help MSPs automate IT monitoring and remediation?

Remote monitoring and management (RMM) platforms like NinjaOne allow MSPs to automate device monitoring, detect security risks, track software health, and remediate common IT issues at scale, reducing manual workloads.

28 Essential IT Automation Examples: Monitoring and Alerting

Key Points

Proactive MSP Monitoring: Implement automation for device health, applications, network, drives, and security to prevent downtime, reduce alert fatigue, and improve IT service delivery.
Device Health Checks: Monitor uptime, offline endpoints, unexpected reboots, and hardware changes to detect issues early and automate remediation when possible.
Drive and Storage Automation: Track SMART disk failures, RAID health, disk usage, and free space thresholds to avoid data loss and performance bottlenecks.
Application Monitoring: Ensure critical business apps and services (Exchange, SQL, AD, productivity tools) are installed, running, and not over-consuming resources.
Network Visibility: Monitor bandwidth spikes, open ports, device uptime, and client website availability to maintain reliable connectivity.
Security Automation: Detect firewall status, AV/EDR installation and threats, failed login attempts, unauthorized account changes, and disk encryption compliance.
Backup Monitoring: Automate alerts for failed backup jobs across Ninja Data Protection, Veeam, Acronis, and other solutions to ensure data protection.
Reduce Alert Fatigue: Use actionable alerts, categorize tickets, automate remediation of common issues, and fine-tune thresholds to cut noise.
MSP Best Practices: Build baseline monitoring templates, align with client priorities, track recurring issues, and hold regular alert housekeeping sessions.

Endpoint monitoring and alerting is a central part of IT management. If you’re an MSP, good monitoring and alerting practices enable you to proactively identify issues, resolve them faster, and save you and your users time and frustration further down the line.

The challenge is knowing

what to monitor for,
what requires an alert,
which issues can be automatically resolved, and
which need a personal touch.

That knowledge can take years to develop, and even then, the best IT teams can still struggle with reducing alert fatigue and ticket noise across their networks and devices.

To help condense that ramp-up time and narrow your focus, we’ve put together this list of ideas for conditions to monitor for, along with suggested triggers and actions for automation. These are based on recommendations from our customers and on NinjaOne’s experience in helping IT teams build more effective, actionable monitoring.

How to Use the MSP Monitoring Checklist Below

For each condition, we describe what’s being monitored, how to set up the monitor in NinjaOne, and what actions should be taken if the condition is triggered. Some monitoring suggestions are concrete while others may require a small amount of customization to fit them to your use case.

Note: While we’ve written this checklist with NinjaOne and our customers in mind, these monitoring ideas should be easily adaptable to any RMM or endpoint management solution.

This list is also obviously not exhaustive, and may not apply to every situation or circumstance.

Once you’ve gotten started building out your monitoring plan around these suggestions, you’ll want to develop a more customized and robust monitoring strategy specific to your needs. We’ll close out this post with additional recommendations to help with that effort and make monitoring, alerting, and ticketing more streamlined and effective.

Device Health Monitoring

Monitor for continuous critical events

Condition: Critical Events
Threshold: 80 critical events over 5 minutes
Note: The threshold may need fine-tuning or exclusion filters as it could create noise in environments with chatty logs.
Action: Ticket and investigate

Identify when a device is unintentionally rebooted

Condition: Windows Event
Event Source: Microsoft-Windows-Kernel-Power
Event ID: 41
Note: This condition is better suited for servers as workstations and laptops can create this error from user intervention.
Action: Ticket and investigate

Identify devices in need of a reboot

Condition: System Uptime
Threshold Rrecommendation: 30 or 60 days (though may be aggressive for servers with stable workloads)
Action: Restart the device during an appropriate window. Automated remediation may work for workstations.

Monitor for offline endpoints

Condition: Device Down
Threshold Recommendation:
- 10 minutes or less (servers).
- 24+ hours (workstations)
Action:
- Ticket and investigate
- Wake-on-LAN (servers only)

Monitor for hardware changes

Activity: System
Name: Adapter added/changed, CPU added/removed, Disk drive added/removed, Memory added/removed
Action: Ticket and investigate

Drive Monitoring

Monitor for potential disk failure

Condition: Windows SMART Status Degraded and/or Windows Event
Event Source: Disk
Event IDs: 7, 11, 29, 41, 51, 153
Action: Ticket and investigate

Identify when disk space is approaching capacity

Condition: Disk Free Space
Threshold: 20% and again at 10%
Action: Perform disk cleanup and delete temporary files

Monitor for potential RAID failures

Condition: RAID Health Status
Thresholds: Critical and Non-Critical for all attributes
Action: Ticket and investigate

Monitor for prolonged high disk usage

Condition: Disk Usage
Thresholds: 90% or greater to reduce noise, with 95%+ also being common over 30 or 60-minute periods
Action: Ticket and investigate

Monitor for high disk activity rate

Condition: Disk Active Time
Thresholds: Greater than 90% for 15 minutes
Action: Ticket and investigate

Monitor for high memory usage

Condition: Memory Utilization
Thresholds: Greater than 90% for 15 minutes
Action: Ticket and investigate

Application Monitoring

Identify if required applications exist on an endpoint

Condition: Software
Usage:
- Client line-of-business applications (e.g., AutoCAD, SAP, Photoshop)
- Client productivity solutions (e.g., Zoom, Microsoft Teams, DropBox, Slack, Office, Acrobat)
- Client support tools (e.g., TeamViewer, CCleaner, AutoElevate, BleachBit)
Action: Flag missing applications for review or automatic deployment where appropriate

Monitor whether critical applications are running (particularly for servers)

Condition: Process or Service
Threshold: Down for at least 3 minutes
Example Processes:
- For workstations: TeamViewer, RDP, DLP
- For an Exchange server: MSExchangeServiceHost, MSExchangeIMAP4, MSExchangePOP3
- For an Active Directory server: Netlogon, dnscache, rpcss
- For a SQL server: mssqlserver, sqlbrowser, sqlwriter
Action: Restart the service or process

Monitor resource usage for applications known to cause performance issues

Condition: Process Resource
Threshold: 90%+ for at least 5 minutes
Example Processes: Outlook, Chrome, and TeamViewer
Action:
- Ticket and investigate
- Disable at startup

Monitor for application crashes

Condition: Windows Event
Source: Application Hang
Event ID: 1002
Action: Ticket and investigate

Network Monitoring

Monitor for unexpected bandwidth usage

Condition: Network Utilization
Direction: Out
Threshold: Determined by the type of endpoint and network capacity
- Each server should have its own threshold based on its use case
- Workstation network monitor thresholds should be high enough to trigger only when a client’s network is at risk
Action: Ticket and investigate

Ensure network devices are up

Condition: Device Down
Duration: 3 minutes

Monitor which ports are open

Condition: Cloud Monitor
Ports: 80 (HTTP), 443 (HTTPS), 25 (SMTP), 21 (FTP)

Monitor client website availability

Monitor: Ping
Target: Client Website
Condition: Failure (5 times)
Action: Ticket and investigate

Basic Security Monitoring

Identify if Windows Firewall has been turned off

Condition: Windows Event
Event Source: System
Event ID: 5025
Action: Turn on Windows Firewall

Identify if antivirus and security tools are installed and/or running on an endpoint

Condition: Software
Presence: Doesn’t Exist
Software Examples: Huntress, Cylance, Threatlocker, Sophos
Action: Automate the installation of the missing security software
Condition: Process or Service
State: Down
Example Processes: threatlockerservice.exe, EPUpdateService.exe
Action: Restart the process

Monitor for unintegrated AV/EDR threats detected

Condition: Windows Event
Example: Sophos
Event Source: Sophos Anti-Virus
Event IDs: 6, 16, 32, 42

Monitor for failed user logon attempts

Condition: Windows Event
Event Source: Microsoft-Windows-Security-Auditing
Event IDs: 4625, 4740, 644 (local accounts); 4777 (domain login)
Action: Ticket and Investigate

Monitor for the creation, elevation, or removal of users on an endpoint

Condition: Windows Event
Event Source: Microsoft-Windows-Security-Auditing
Event IDs: 4720, 4732, 4729
Action: Ticket and Investigate

Identify if the drives on an endpoint are encrypted/unencrypted

Condition: Script Result
Script (Custom): Check Encryption Status
Action: Ticket and investigate

Monitor backup failures (Ninja Data Protection)

Activity: Ninja Data Protection
Name: Backup job failed

Monitor backup failures (other backup vendors)

Condition: Windows Event
Example Source/IDs (Veeam):
- Event Source: Veeam Agent
- Event ID: 190
- Text Contains: Failed
Example Source/IDs (Acronis):
- Event Source: Online Backup System
- Event ID: 1
- Text Contains: Failed

4 Keys to Leveling Up Your MSP Monitoring

Create a baseline device health monitoring template.
Talk to customers about their priorities.
1. Which servers and workstations are important?
2. What are their critical line-of-business or productivity applications?
3. Where are their IT pain points?
Monitor your PSA/ticketing system for recurring issues. Also, adjust alerting to avoid ticket noise.
Monitor clients’ event logs for recurring issues.

Ticketing and Alerting Best Practices

Only alert on actionable information. If you don’t have a specific response associated with a monitor, don’t monitor it.
Categorize your alerts to go to different service boards in your PSA based on the type or priority.
Host regular alert housekeeping meetings to answer the following questions:

- Which alerts are causing the most noise? Can they be removed or narrowed in scope?
- What’s not being monitored or creating notifications that should be?
- Which common alerts can be automatically remediated?
- Are there any upcoming projects that may generate alerts?

Clean up your tickets and alerts when they’re resolved.

- In NinjaOne, many conditions have a “Reset when no longer true” or “Reset when not true for x period” to help you resolve and clean up notifications that may resolve themselves.

More MSP Monitoring Ideas

See Kelvin Tegelaar’s excellent series on remote monitoring using PowerShell. He covers how to monitor everything from network traffic to Active Directory health, Office 365 failed logins, Shodan results, and more. Best of all, he shares PowerShell scripts that are designed to be RMM agnostic. You can also read our blog post on PowerShell vs CMD Prompt differences and when to use each.

We regularly feature Tegelaar’s blog posts along with plenty of additional tools and resources in our weekly MSP Bento newsletter. Subscribe now to get the latest edition along with a special list of the most popular tools and resources we’ve shared.

In addition, if you’re looking for software to help you automate how you monitor all your IT assets, NinjaOne’s IT asset management solution offers a complete and real-time view of your resources as well as lets you manage software on your endpoints at scale. Watch a demo of NinjaOne in action or sign up for a free trial of the software.