Specifications and Stress Testing of Keep

If you are using Keep and have performance issues, we will be more than happy to help you. Just join our slack and shoot a message on the #help channel.

Overview

Spec and stress testing are crucial to ensuring the robust performance and scalability of Keep. This documentation outlines the key areas of focus for testing Keep under different load conditions, considering both the simplicity of setup for smaller environments and the scalability mechanisms for larger deployments.

Keep was initially designed to be user-friendly for setups handling less than 10,000 alerts. However, as alert volumes increase, users can leverage advanced features such as Elasticsearch for document storage and Redis + ARQ for queue-based alert ingestion. While these advanced configurations are not fully documented here, they are supported and can be discussed further in our Slack community.

How To Reproduce

To reproduce the stress testing scenarios mentioned above, please refer to the STRESS.md file in Keep’s repository. This document provides step-by-step instructions on how to set up, run, and measure the performance of Keep under different load conditions.

Performance Testing

Factors Affecting Specifications

The primary parameters that affect the specification requirements for Keep are:

Alerts Volume: The rate at which alerts are ingested into the system.
Total Alerts: The cumulative number of alerts stored in the system.
Number of Workflows: How many automation run as a result of alert.

Main Components:

Keep Backend - API and business logic. A container that serves FastAPI on top of gunicorn.
Keep Frontend - Web app. A container that serves the react app.
Database - Stores the alerts and any other operational data.
Elasticsearch (opt out by default) - Stores alerts as document for better search performance.
Redis (opt out by default) - Used, together with ARQ, as an alerts queue.

Testing Scenarios:

Low Volume (< 10,000 total alerts, hundreds of alerts per day):
- Setup: Use a standard relational database (e.g., MySQL, PostgreSQL) with default configurations.
- Expectations: Keep should handle queries and alert ingestion with minimal resource usage.
Medium Volume (10,000 - 100,000 total alerts, thousands of alerts per day):
- Setup: Scale the database to larger instances or clusters. Adjust best practices to the DB (e.g. increasing innodb_buffer_pool_size)
- Expectations: CPU and RAM usage should increase proportionally but remain within acceptable limits.

High Volume (100,000 - 1,000,000 total alerts, >five thousands of alerts per day):
- Setup: Deploy Keep with Elasticsearch for storing alerts as documents.
- Expectations: The system should maintain performance levels despite the large alert volume, with increased resource usage managed through scaling strategies.
Very High Volume (> 1,000,000 total alerts, tens of thousands of alerts per day):
- Setup: Deploy Keep with Elasticsearch for storing alerts as documents.
- Setup #2: Deploy Keep with Redis and with ARQ to use Redis as a queue.

Recommended Specifications by Alert Volume

Number of Alerts	Keep Backend	Keep Database	Redis	Elasticsearch
< 10,000	1 vCPUs, 2GB RAM	2 vCPUs, 8GB RAM	Not required	Not required
10,000 - 100,000	4 vCPUs, 8GB RAM	8 vCPUs, 32GB RAM, optimized indexing	Not required	Not required
100,000 - 500,000	8 vCPUs, 16GB RAM	8 vCPUs, 32GB RAM, advanced indexing	4 vCPUs, 8GB RAM	8 vCPUs, 32GB RAM, 2-3 nodes
> 500,000	8 vCPUs, 16GB RAM	8 vCPUs, 32GB RAM, advanced indexing, sharding	4 vCPUs, 8GB RAM	8 vCPUs, 32GB RAM, 2-3 nodes

Performance by Operation Type, Load, and Specification

Operation Type	Load	Specification	Execution Time
Digest Alert	100 alerts per minute	4 vCPUs, 8GB RAM	~0.5 seconds
Digest Alert	500 alerts per minute	8 vCPUs, 16GB RAM	~1 second
Digest Alert	1,000 alerts per minute	16 vCPUs, 32GB RAM	~1.5 seconds
Run Workflow	10 workflows per minute	4 vCPUs, 8GB RAM	~1 second
Run Workflow	50 workflows per minute	8 vCPUs, 16GB RAM	~2 seconds
Run Workflow	100 workflows per minute	16 vCPUs, 32GB RAM	~3 seconds
Ingest via Queue	100 alerts per minute	4 vCPUs, 8GB RAM, Redis	~0.3 seconds
Ingest via Queue	500 alerts per minute	8 vCPUs, 16GB RAM, Redis	~0.8 seconds
Ingest via Queue	1,000 alerts per minute	16 vCPUs, 32GB RAM, Redis	~1.2 seconds

Table Explanation:

Operation Type: The specific operation being tested (e.g., digesting alerts, running workflows).
Load: The number of operations per minute being processed (e.g., number of alerts per minute).
Specification: The CPU, RAM, and additional services used for the operation.
Execution Time: Approximate time taken to complete the operation under the given load and specification.

Fine Tuning

As any deployment has its own characteristics, such as the balance between volume vs. total count of alerts or volume vs. number of workflows, Keep can be fine-tuned with the following parameters:

Number of Workers: Adjust the number of Gunicorn workers to handle API requests more efficiently. You can also start additional API servers to distribute the load.
Distinguish Between API Server Workers and Digesting Alerts Workers: Separate the workers dedicated to handling API requests from those responsible for digesting alerts, ensuring that each set of tasks is optimized according to its specific needs.
Add More RAM to the Database: Increasing the RAM allocated to your database can help manage larger datasets and improve query performance, particularly when dealing with high volumes of alerts.
Optimize Database Configuration: Keep was mainly tested on MySQL and PostgreSQL. Different database may have different fine tuning mechanisms.
Horizontal Scaling: Consider deploying additional instances of the API and database services to distribute the load more effectively.

FAQ

1. How do I estimate the spec I need for Keep?

To estimate the specifications required for Keep, consider both the number of alerts per minute and the total number of alerts you expect to handle. Refer to the Recommended Specifications by Alert Volume table above to match your expected load with the appropriate resources.

2. How do I know if I need Elasticsearch?

Elasticsearch is typically needed when you are dealing with more than 50,000 total alerts or if you require advanced search and query capabilities that are not efficiently handled by a traditional relational database. If your system’s performance degrades significantly as alert volume increases, it may be time to consider Elasticsearch.

3. How do I know if I need Redis?

Redis is recommended when your alert ingestion rate exceeds 1,000 alerts per minute or when you notice that the API is becoming a bottleneck due to high ingestion rates. Redis, combined with ARQ (Asynchronous Redis Queue), can help manage and distribute the load more effectively.

4. What should I do if Keep’s performance is still inadequate?

If you have scaled according to the recommendations and are still facing performance issues, consider:

Optimizing your database configuration: Indexing, sharding, and query optimization can make a significant difference.
Horizontal scaling: Distribute the load across multiple instances of the API and database services.
Reach out to our Slack community: For personalized support, reach out to us on Slack, and we’ll help you troubleshoot and optimize your Keep deployment.

For any additional questions or tailored advice, feel free to join our Slack community where our team and other users are available to assist you.

Overview

Alerts

Incidents

AIOps

Workflow Automation

Providers

Deployment

Development

Keep API

Keep CLI

Getting started

Specifications and Stress Testing of Keep

Overview

How To Reproduce

Performance Testing

Factors Affecting Specifications

Main Components:

Testing Scenarios:

Recommended Specifications by Alert Volume

Performance by Operation Type, Load, and Specification

Table Explanation:

Fine Tuning

FAQ

1. How do I estimate the spec I need for Keep?

2. How do I know if I need Elasticsearch?

3. How do I know if I need Redis?

4. What should I do if Keep’s performance is still inadequate?

Overview

Alerts

Incidents

AIOps

Workflow Automation

Providers

Deployment

Development

Keep API

Keep CLI

​Specifications and Stress Testing of Keep

​Overview

​How To Reproduce

​Performance Testing

​Factors Affecting Specifications

​Main Components:

​Testing Scenarios:

​Recommended Specifications by Alert Volume

​Performance by Operation Type, Load, and Specification

​Table Explanation:

​Fine Tuning

​FAQ

​1. How do I estimate the spec I need for Keep?

​2. How do I know if I need Elasticsearch?

​3. How do I know if I need Redis?

​4. What should I do if Keep’s performance is still inadequate?

Specifications and Stress Testing of Keep

Overview

How To Reproduce

Performance Testing

Factors Affecting Specifications

Main Components:

Testing Scenarios:

Recommended Specifications by Alert Volume

Performance by Operation Type, Load, and Specification

Table Explanation:

Fine Tuning

FAQ

1. How do I estimate the spec I need for Keep?

2. How do I know if I need Elasticsearch?

3. How do I know if I need Redis?

4. What should I do if Keep’s performance is still inadequate?