Alert Management: Maintenance Windows

Keep’s Maintenance Windows feature provides a critical mechanism for managing alert noise during scheduled maintenance periods or other planned events. By defining Maintenance Window rules, users can suppress alerts that are irrelevant during these times, ensuring that only actionable alerts reach the operations team.

Introduction

In dynamic IT environments, it’s common to have periods where certain alerts are expected and should not trigger incident responses. Keep’s Maintenance Windows feature allows users to define specific rules that temporarily suppress alerts based on various conditions, such as time windows or alert attributes. This helps prevent unnecessary alert fatigue and ensures that teams can focus on critical issues.

How It Works

  1. Maintenance Window Rule Definition: Users define Maintenance Window rules specifying the conditions under which alerts should be suppressed.
  2. Condition Specification: A CEL (Common Expression Language) query is associated with each Maintenance Window rule to define the conditions for suppression.
  3. Time Window Configuration: Maintenance Window rules can be set for specific start and end times, or based on a relative duration.
  4. Alert Suppression: During the active period of a Maintenance Window rule, any alerts matching the defined conditions are either suppressed and not shown in alerts feed or shown in the feed in suppressed status (this is configurable).

Practical Example

Suppose your team schedules a database upgrade that could trigger numerous non-critical alerts. You can create a Maintenance Window rule that suppresses alerts from the database service during the upgrade window. This ensures that your operations team isn’t overwhelmed by non-actionable alerts, allowing them to focus on more critical issues.

Core Concepts

  • Maintenance Window Rules: Configurations that define when and which alerts should be suppressed based on time windows and conditions.
  • CEL Query: A query language used to specify the conditions under which alerts should be suppressed. For example, a CEL query might suppress alerts where the source is a specific service during a maintenance window.
  • Time Window: The specific start and end times or relative duration during which the Maintenance Window rule is active.
  • Alert Suppression: The process of ignoring alerts that match the Maintenance Window rule’s conditions during the specified time window.

Status-Based Filtering in Maintenance Windows

In Keep, certain alert statuses are automatically ignored by Maintenance Window rules. Specifically, alerts with the statuses RESOLVED and ACKNOWLEDGED are not suppressed by Maintenance Window rules. This is intentional to ensure that resolving alerts can still be processed and appropriately close or update active incidents.

Why Are Some Statuses Ignored?

• RESOLVED Alerts: These alerts indicate that an issue has been resolved. By allowing these alerts to bypass Maintenance Window rules, Keep ensures that any active incidents related to the alert can be properly closed, maintaining the integrity of the alert lifecycle. • ACKNOWLEDGED Alerts: These alerts have been acknowledged by an operator, signaling that they are being addressed. Ignoring these alerts in Maintenance Windows ensures that operators can track the progress of incidents and take necessary actions without interference.

By excluding these statuses from Maintenance Window suppression, Keep allows for the continuous and accurate management of alerts, even during Maintenance Window periods, ensuring that resolution processes are not disrupted.

Creating a Maintenance Window Rule

To create a Maintenance Window rule:

  1. Define the Maintenance Window Name and Description: Provide a name and optional description for the Maintenance Window rule to easily identify its purpose.
  2. Specify the CEL Query: Use CEL to define the conditions under which alerts should be suppressed (e.g., source == "database").
  3. Set the Time Window: Choose a specific start and end time, or define a relative duration for the Maintenance Window.
  4. Enable the Rule: Decide whether the rule should be active immediately or scheduled for future use.

Best Practices

  • Plan Maintenance Windows in Advance: Schedule Maintenance Window periods in advance for known maintenance windows to prevent unnecessary alerts.
  • Use Specific Conditions: Define precise CEL queries to ensure only the intended alerts are suppressed.
  • Review and Update Maintenance Windows: Regularly review active Maintenance Window rules to ensure they are still relevant and adjust them as necessary.