This article has been edited in Feb 2026 and also been published on Medium

There has been much written about the right way to handle alarms and alerts for Sysadmins, Ops and Reliability Engineers. I find that examining anti-patterns can be just as instructive as studying best practices. Below are some common pitfalls I’ve encountered. This focuses primarily on the organisational and communication aspects of alerting rather than technical implementation details, with particular attention to out-of-hours ‘on-call’ scenarios, though the principles apply broadly.

No Alerting

While this might seem obvious, it’s worth stating: having no alerting system at all creates blind spots that can lead to serious issues going unnoticed until they become critical.

Confusing or Conflating logging with alerting

Logging and alerting serve different purposes. Logs provide context for investigating problems, while alerts signal that immediate action is required. When logs are sent as alerts, it creates noise that obscures genuinely actionable items. If logging data needs to be distributed via email, consider using a dedicated logging address rather than subscribing people to a mailing list they’ll likely mute. The key distinction: if something doesn’t require immediate action, it shouldn’t be treated as an alert.

Having standing reports and billing sent to the same location

Mixing different types of communications dilutes their effectiveness. Alerts require immediate attention, while reports and billing information are typically reviewed on a schedule. Sending these to the same channel or mailing list reduces the signal-to-noise ratio. Consider separating these streams - if weekly reports or uptime summaries are valuable to stakeholders, they can opt into those channels rather than having them mixed with critical alerts.

Having meaningless alerts

It’s tempting to monitor everything that can be monitored, but alerts should only be created for conditions that require action. Alerts that don’t warrant a response add cognitive load to responders and reduce the effectiveness of the overall alerting system. Consider whether each alert passes this test: does this require someone to take action?

Having alerts for ’everything is OK’

Status update notifications confirming normal operation can quickly become noise. While it’s understandable to want confirmation that monitoring is working, consider alternative approaches like heartbeat checks or periodic reviews rather than real-time status updates. The same applies to notifications about tickets being opened or closed when they don’t involve your team - these updates may be valuable in a ticketing system but don’t belong in alert channels.

Alerting governance and stakeholder alignment

Effective alerting requires clear ownership and appropriate stakeholder involvement. Several related issues often arise in this area:

Clarity about who responds to alerts

Alert configuration works best when those who will respond to alerts have primary input into how those alerts are set up. When people who won’t be responding configure alerts without adequate consultation, misalignments can occur. For example, consider a billing alarm set for 75% quota usage checked at the end of a 24-hour period. At two minutes to midnight, either you get no alert (meaning you’re using far less than 75% of what you’re paying for) or the alert fires but there’s no time to take meaningful action. While well-intentioned, this type of configuration reveals a disconnect between the alert designer and operational reality. These issues are best resolved through collaborative discussion during planning, making them ideal topics for brainstorming sessions. If you’re not capable of responding to an alert, you may not be the right person to configure it.

The anxiety problem: alerts to people without context

When people receive alerts they don’t understand, particularly those containing technical jargon or negative-sounding terms, it naturally creates anxiety. This anxiety generates noise in the system - well-meaning escalations, urgent questions, and requests for updates that distract responders from actually resolving issues. Whether it’s junior staff triaging without adequate context or management “keeping an eye” on technical alerts, if recipients lack the knowledge to act on an alert, their involvement often adds overhead rather than value. Playbooks can help to some extent, but only for clearly defined scenarios with staff who have been properly trained. This reinforces the distinction between alerting (immediate, actionable) and reporting (periodic, informational) - stakeholders who need visibility into system health are often better served by appropriate reporting mechanisms they can review during business hours.

Matching authority with responsibility

Those monitoring and managing alert channels should ideally be the same people responsible for responding to them. If customer service leads manage customer service alerts, that’s appropriate. However, when alert channels are managed without adequate consultation with the responders, it can create friction.

There’s a fundamental trust issue here: if you trust your technical team to manage and resolve issues with a service, that trust should extend to letting them manage the alerting for that service. Processing alerts is part of the response workflow, and those closest to the work are best positioned to configure it effectively. If you need visibility into what’s happening, work with the team to establish appropriate reporting or logging that you can review during business hours rather than inserting yourself into the real-time alerting flow. If you find yourself unable to extend this trust, that may indicate a deeper organisational issue that won’t be solved by managing alert channels.

Channel proliferation

Creating numerous Slack channels or mailing lists for different alert categories can be counterproductive unless there are truly separate teams handling each category. While categorisation has its place, excessive fragmentation can increase monitoring overhead rather than reducing it. Consider whether the organisational structure truly benefits from this level of separation.

Having too many communication applications

Each messaging platform requires separate installation, management, and monitoring. When alert communications are spread across email, Slack, MS Teams, Skype for Business, Yammer, PagerDuty, and internal alerting channels, the operational burden multiplies. Consolidating communication channels where feasible can significantly reduce overhead.

Assigning all alerts the same priority/urgency

Not all alerts warrant the same level of urgency. Consider whether a given alert justifies waking someone at 3 AM or if it could wait until business hours. A useful test: would you be comfortable as a manager receiving this alert out of hours, contacting technical staff, and waiting for resolution before returning to rest? Some alerts genuinely require immediate attention, while others can be handled during business hours. Using priority tiers (e.g., critical vs. non-urgent channels) helps responders focus on what truly matters.

Having duplicate alerting

Sending the same alert through multiple channels (e.g., Slack and email simultaneously) or creating notification chains where automated alerts trigger manual notifications creates unnecessary redundancy. Like other code, alerting systems benefit from following DRY (Don’t Repeat Yourself) principles.

Nesting alerts

Alert complexity impedes rapid response. Consider this scenario: receiving a phone call, then checking a Slack channel for discussion, which references an automated alert in another Slack channel, which links to a PagerDuty page, which requires accessing Splunk/ELK, which shows a Cloudwatch alarm, which relates to a resource in a different AWS account. That’s a chain of: phone call → 2 Slack channels → PagerDuty → Splunk/ELK → Cloudwatch → target system. Whether responding at 3 AM or while mobile with family, this layered complexity significantly hinders effective response. Streamlining the path from alert to actionable information helps responders help you.

A few tips

Reflective practice

Effective alerting requires ongoing communication and review. The Google Site Reliability Engineering handbook offers one valuable framework, but the key is regularly reviewing your alerts and learning from responses. If team members have thousands of auto-filtered emails they’ve never read, or have muted most operational Slack channels, it’s worth examining why those communications were considered valuable initially. Regular reflection helps identify where alert fatigue has set in and provides opportunities to refine the system. If alerts are routinely ignored without consequence, that suggests a fundamental mismatch between the alerting system and actual operational needs.

Organisation

Ideally, there should be a clear, direct path from alert trigger (e.g., Cloudwatch) to responder, with escalation capabilities built in rather than multiple nested layers. Leveraging tools like AWS Organizations for consolidated management can also simplify alert routing.

Periodic review

Regular review of alerts and alarms helps keep the system relevant. Even without formal retrospectives after each incident, periodic reviews (frequent enough that the team still recalls recent events) ensure that alerts remain appropriate. Alert requirements often evolve as systems mature - what was necessary during initial deployment may become noise once the system is stable. Retiring outdated alerts keeps the signal clear.

The Golden principles

  • Alerts != Logging
  • Do not mix alerts with logging or status updates
  • Manage by Exception. This well-established principle from formal management frameworks like PRINCE2 emphasises that status updates and ‘situation OK’ reports don’t belong with alerts.
  • Silence is golden. When alerts are silent, operations should be running smoothly. This should be the target state.
  • Alert configuration works best when those responding to alerts have primary input. For technical alerts, technical team leadership should guide the approach. If broader stakeholder visibility is needed:
    • Ensure clear explanation of alert configurations
    • Establish appropriate logging systems for review during business hours
  • Prioritise urgency of alerts appropriately
  • From Lean processes: ‘stop doing things that have no value.’ If an activity doesn’t add value to a process, it detracts from it.
  • Concentrate on solutions rather than problems - effective alert management focuses on service reliability rather than individual heroics. As the saying goes: “the cemeteries are full of indispensable people” - systems should be sustainable and manageable, not dependent on extraordinary individual effort.