Follow Monitoring Best Practices with Amazon ElastiCache for Redis with Amazon CloudWatch

Monitoring has always been one of the most important means of maintaining the reliability, availability, and performance of Amazon ElastiCache resources. In this article, we'll take a look at how to use Amazon CloudWatch and other external tools to maintain a healthy Redis cluster and prevent unexpected outages. We will also discuss in detail several possible ways to forecast and prepare for expansion needs.

the Amazon CloudWatch ElastiCache and Amazon a synergy

The combination of Amazon ElastiCache and Amazon CloudWatch can greatly improve the visibility of core performance metrics related to resources. Additionally, Amazon CloudWatch alarms can help you set metric thresholds and trigger notifications, ensuring timely reminders when preventive action is required.

Monitoring trends over time can also help you detect continued workload growth. You can set a time period of up to 455 days (15 months) for your data points and watch Amazon CloudWatch metrics scale within this window to predict subsequent utilization and usage of resources.

Monitoring resources

The health of an Amazon ElastiCache Redis cluster is determined by the utilization of key components such as CPU, memory, and network. Excessive use of these components can result in increased system latency and reduced overall performance. On the other hand, over-provisioning may lead to underutilization of resources and seriously affect the cost optimization effect.

Amazon ElastiCache provides metrics that enable you to monitor your clusters; as of this writing, Amazon Cloud Technologies has released 18 new Amazon CloudWatch metrics.

Amazon CloudWatch metrics for Amazon ElastiCache fall into two main categories: engine-level metrics (generated by the Redis INFO command) and host-level metrics (from the ElastiCache node's operating system). These metrics are measured and published for each cache node at 60-second intervals. Although Amazon CloudWatch allows you to arbitrarily specify statistics and detection periods for each metric, only a well-designed, relevant combination can truly serve the system. For example, statistics such as "Average", "Min", and "Max" in CPU usage are very important, but the "Sum" information is of little practical value.

CPU

Redis can use different CPUs to perform auxiliary operations such as snapshot saving or UNLINK, but can only run commands through a single thread. In other words, Redis can only process one command at a time.

Taking into account the single-threaded nature of Redis, Amazon ElastiCache provides the EngineCPUUtilization metric to provide precise visibility into the load of the Redis process itself to better help you understand the health of your Redis workload.

Different use cases have different tolerances for high EngineCPUUtilization, where there is no threshold of general nature. But as a best practice, this article recommends that you keep your EngineCPUUtilization below 90%.

Benchmarking your cluster with your application and expected workload can help you correlate EngineCPUUtilization with actual system performance. We recommend that you set up multiple Amazon CloudWatch alarms at different tiers for EngineCPUUtilization to notify you when each threshold is reached (for example, 65% WARN, 90% HIGH) before your cluster performance is actually impacted.

If your cluster has a high EngineCPUUtilization, you can consider the following remedies:

A high EngineCPUUtilization metric most likely originates from a specific Redis operation. Redis commands use Big O notation to define time complexity. You can use Redis SLOWLOG to help you determine how long a command takes to complete. A common problem is the overuse of the Redis KEY command, so keep this command in high gear in a production environment.
In terms of the time complexity of Redis commands, a non-optimal data model also has the potential to put unnecessary pressure on the EngineCPUUtilization metric. For example, the cardinality of a set might be a performance factor, while the time complexity of SMEMBERS, SDIFF, SUNION, and other set commands is defined by the number of elements in the set. The size of the hash (number of fields) and the type of operation to run also affect EngineCPUUtilization.
If you are running Redis in a node group with multiple nodes, it is recommended that you use replicas to create snapshots. When creating a snapshot from a replica, the master node is not affected by the snapshot save task and can therefore continue to process requests without slowing down. Please verify that the node is using SaveInProgress to create a snapshot. In the case of full synchronization, the snapshot should always be on the primary node.
A large number of operations may also lead to higher EngineCPUUtilization. Please specify the specific load type corresponding to the operation. If the high EngineCPUUtilization is primarily due to heavy read operations, you can configure it to disable cluster mode using the Amazon ElastiCache reader endpoint in the Redis client library, or configure it to cluster mode using the Redis READONLY command. If you are only reading from read replicas, you can add additional nodes (up to five read replicas) to the replica group or to individual shards. If the write operation results in a high EngineCPUUtilization , you need to provide more powerful computing capacity to the master node. You can upgrade to the latest generation m5 and r5 node types to perform tasks quickly with more powerful processors. If you are already using the latest generation of nodes, you should consider switching from cluster mode disabled to cluster mode enabled. To complete the switchover, you can create a set of backups of the existing cluster and restore the data in a new set of cluster mode enabled clusters. After enabling cluster mode, the solution is able to add more shards and scale out. The higher the number of shards, the more masternodes you can add, and the higher the computing capacity.

In addition to monitoring the Redis process load with the EngineCPUUtilization metric, you should also pay attention to the resource usage of the remaining CPU cores. With EngineCPUUtilization, you can monitor the percent CPU utilization of the entire host. For example, an influx of new connections may also drive up the EngineCPUUtilization metric, since establishing a connection brings part of the processing load.

For small nodes with 2 or less CPU cores, you also need to pay attention to the CPUUtilization metric. Since you need to reserve some computing capacity and share node CPU cores with Redis in addition to operations such as snapshots and hosting maintenance events, it is very likely that your CPUUtilization will reach 100% first before EngineCPUUtilization changes.

Finally, Amazon ElastiCache also supports T2 and T3 cache nodes. These cache nodes provide baseline-level CPU performance and can burst peak CPU capacity at any time until your free quota is exhausted. If you are using exactly T2 or T3 cache nodes, you should monitor CPUCreditUsage and CPUCreditBalance, as performance will gradually decrease to baseline levels after the quota is exhausted.

memory

Memory is a core element in Redis. Knowing your cluster's memory utilization can help you avoid data loss and keep adapting to growing dataset sizes.

The memory section of the Redis INFO command provides statistics about the current node memory utilization.

One of the most important metrics is used_memory, which is the amount of memory that Redis uses the allocator to allocate. Amazon CloudWatch provides a metric called BytesUsedForCache , derived from used_memory , that you can use to determine the current cluster's memory utilization.

With the release of 18 new Amazon CloudWatch metrics, you can now use DatabaseMemoryUsagePercentage to view memory utilization percentage based on current memory utilization (BytesUsedForCache) and maxmemory. Maxmemory will set the maximum amount of memory available for the target dataset. You can use the Redis INFO command and the memory section of the Redis node type specific parameters to tune the memory limit of the cluster. Its default value is determined by the amount of memory that needs to be reserved. So after setting, your cluster maxmemory will drop. For example, the cache.r5.large node type defaults to a maximum memory of 14037181030 bytes, but if you wish to reserve 25% of the memory by default, the maximum memory available will be 10527885772.5 bytes (14037181030 x 0.75).

When your DatabaseMemoryUsagePercentage reaches 100%, the Redis maxmemory policy will be triggered and data eviction will be decided based on the selected policy (eg volatile lru). If there is no foreign trade in the cache to evict (according to the eviction policy), the write operation fails and the Redis master then returns the following message: (error) OOM command not allowed when used memory > 'maxmemory'

Eviction does not necessarily imply a problem or performance degradation. Some workloads are designed to take advantage of eviction. To monitor evictions within a cluster, you can use the Evictions metric. This metric is also available through the Redis INFO command. Note, however, that a large number of evictions will also drive up the EngineCPUUtilization metric.

If your workload is not designed with eviction in mind, we recommend that you set up Amazon CloudWatch alarms to proactively alert on changes in the DatabaseMemoryUsagePercentage metric and expand memory capacity when scaling operations are required. For scenarios where cluster mode is disabled, you can scale to larger node types for higher memory capacity. For scenarios where cluster mode is enabled, scaling out to gradually increase the memory capacity is the most reasonable solution.

Another way to control dataset growth is to use TTL (Time To Live) in the key. After the time-to-live expires, if the client (passively) attempts to access or Redis (actively) periodically tests the random key, the key will no longer be available and will be deleted. You can use the Redis SCAN command to analyze certain parts within the dataset and scale up passive methods to remove expired keys. In Amazon ElastiCache, key expiration can be monitored through the Reclaimed Amazon CloudWatch metric.

During backup or failover, when writing cluster data to the .rdb file, Redis uses additional memory to record writes to the cluster. If this additional memory usage exceeds the available memory capacity in the node, excessive paging and SwapUsage can result in slower processing. Therefore, we recommend that you reserve a portion of the memory capacity. Reserved memory is a memory resource reserved to accommodate certain operations, such as backups or failovers.

Finally, this article recommends that you set up an Amazon CloudWatch alarm for SwapUsage. This metric should not exceed 50 MB. If the cluster is consuming swap space, verify that sufficient reserved memory is configured in the cluster's parameter group.

network

A big determining factor in determining your cluster's network bandwidth capacity is the actual node type you are using.

We strongly recommend that you benchmark your cluster before actual production use to evaluate its performance and set the correct thresholds for monitoring. You should run benchmarks over a period of hours to reflect the frequency and potential need for temporary network burst capacity.

Amazon ElastiCache and Amazon CloudWatch provide several host-level metrics to monitor network utilization, similar to Amazon Elastic Compute Cloud (Amazon EC2) instances. NetworkBytesIn and NetworkBytesOut represent the number of bytes the host has read from and sent to the network, respectively. NetworkPacketsIn and NetworkPacketsOut represent the number of packets received and sent on the network, respectively.

After defining the network capacity of the cluster, you can use this to map and establish the highest expected peak network utilization. This peak should not be higher than the network capacity of the selected node type. Each node type provides some burst capacity, but we recommend that you reserve this extra capacity for unexpected increases in traffic.

Based on your defined maximum utilization, you can create Amazon CloudWatch alarms to ensure that email notifications are sent when network utilization is higher than expected or close to this limit.

If your network utilization continues to increase and network alerts are triggered, you should take the necessary steps to add more network capacity. To make sure the tuning is correct, you need to identify what is causing the increased network utilization. You can use Amazon CloudWatch metrics to detect changes in the intensity of operations and categorize such fluctuations in categories of read or write operations.

If the increase in network utilization is due to read operations, first ensure that all existing read replicas are used to process the read operations. You can use the ElastiCache reader endpoint with the Redis client library with cluster mode disabled in the configuration, or the Redis READONLY command with cluster mode enabled. If you are already performing read operations in read replicas, you can add more nodes to the replica group or shard (up to five read replicas are supported).

If the increase in network utilization comes from write operations, more capacity needs to be added to the master node. In a scenario where cluster mode is disabled, you can scale the master node directly to a larger node type. This scale-up operation can also be performed when cluster mode is enabled.

In addition, when cluster mode is enabled, you can use another method of scaling for read and write use cases. This approach will add more shards and perform horizontal scaling, ensuring that each node is only responsible for a small subset of the dataset, thus keeping network utilization at each node low.

While horizontal scaling can solve most network-related problems, there are a few corner cases for hot keys to discuss here. The so-called hot key refers to a specific key or a subset of keys that is accessed with a special frequency and a far higher proportion than normal, so a single shard may need to withstand a higher traffic scale than other shards. If these keys remain on the same shard, it will bring high network utilization. In rare cases like these, scale-up methods tend to be more appropriate if the existing dataset is not altered. Of course, you can also refactor the data model to rebalance network utilization. For example, you can copy strings and split objects that store multiple elements.

connect

Amazon CloudWatch provides two metrics for building cluster connections:

CurrConnections – The number of concurrent and active connections registered by the Redis engine. This metric is derived from the connected_clients property in the Redis INFO command.
NewConnections – The total number of connections accepted by Redis during the specified period, including all active and closed connections. This metric is also derived from the Redis INFO command.

To monitor connections, we mainly need to focus on the maxclients limit metric in Redis. Amazon ElastiCache sets a default value of 65000 for this metric. In other words, each node can use up to 65,000 concurrent connections.

Both CurrConnections and NewConnections metrics can help detect and prevent performance issues. For example, an increasing CurrConnections could quickly exhaust the 65,000 available connections. An increase in such metrics could indicate a problem on the application side and the connection was not closed properly, so it ended up hogging the connection on the server side. In addition to investigating application behavior to troubleshoot problems, you can also use tcp-keepalive in a cluster to detect and terminate potentially dead peer connections. In Redis 3.2.4 and later, the default tcp-keepalive timer duration is 300 seconds. For older versions, tcp-keepalive is disabled by default. You can tune the tcp-keepalive timer in the cluster's parameter group.

Monitoring of NewConnections is also very important. Note that the maximum client limit of 65000 does not apply to this metric, as it measures the total number of connections created over a given period of time, which do not necessarily occur simultaneously. During a 1 minute data sampling period, a node may receive 100k NewConnections, but never reach 2000 CurrConnections (simultaneous connections). In this particular example, the workload is not hitting the connection limit risk that Redhis needs to step in. However, the rapid opening and closing of a large number of connections during this period may also affect node performance. It takes a few milliseconds to create a TCP connection, which is an additional payload that is often present when applications run Redis operations.

As required by best practices, applications should reuse existing connections whenever possible to avoid creating new connections and incurring additional costs. You can use the Redis client library (if supported) to establish a connection pool using a framework suitable for your current application environment, or you can create a connection pool yourself from scratch.

More importantly, because of the additional time and CPU usage associated with the TLS handshake, be careful about controlling the number of new connections when using Amazon ElastiCache's encryption-in-transit feature within a cluster.

copy

If there is at least one read replica, the master node needs to send a stream of replication commands to the read node. With the ReplicationBytes metric, you can see how much data needs to be replicated. Although this metric represents the write load on the replica group, it does not give us specific insights into the health of the replicas. For this, you can use the ReplicationLag metric. This metric provides a very convenient representation of latency between replicas and masters. Starting with Redis 5.0.6, this metric data will be captured in milliseconds. Although rare, you can monitor the ReplicationLag metric to detect potential problems, where spikes in replication lag would indicate that the primary or replica was unable to process replication operations in a timely manner. Once this happens, you may need to perform a full sync on the replica. A full sync represents a more complex process that requires snapshots to be created on the primary node and can result in performance degradation. You can use the ReplicationLag metric with the SaveInProgress metric to determine full synchronization attempts.

Severe replication lag trends often stem from issues such as excessive write operations, exhausted network capacity, or degraded underlying services.

With a single master node with cluster mode disabled, if your write activity is too high, you may want to consider enabling cluster mode, which spreads write operations across multiple shards and their associated master nodes. If the replication lag is due to running out of network capacity, follow the steps in the "Network" section of this article.

delay

You can measure command latency using a set of Amazon CloudWatch metrics that calculate the total latency for each data structure. You can calculate this using the commandstats statistics in the Redis INFO command.

In the chart below, we can see the StringBasedCmdsLatency metric, which represents the average latency (in microseconds) of string-based commands run within a specific time frame.

This latency does not include network and I/O time, which is the time Redis spends processing operations.

If Amazon CloudWatch metrics indicate an increase in latency for specific data structures, you can use Redis SLOWLOG to determine which specific commands are taking longer to run.

If your application is experiencing high latency, but Amazon CloudWatch metrics indicate that Redis engine-level latency is low, you should be primarily concerned with network latency. The Redis CLI provides a latency monitoring tool for investigating network or application problems in isolation (min , max and avg are all in milliseconds):

$ redis-cli –latency-history -h mycluster.6advcy.ng.0001.euw1.cache.amazonaws.commin: 0, max: 8, avg: 0.46 (1429 samples) — 15.01 seconds rangemin: 0, max: 1, avg: 0.43 (1429 samples) — 15.01 seconds rangemin: 0, max: 10, avg: 0.43 (1427 samples) — 15.00 seconds rangemin: 0, max: 1, avg: 0.46 (1428 samples) — 15.00 seconds range

min: 0, max: 9, avg: 0.44 (1428 samples) — 15.01 seconds range

Finally, you can monitor clients for activity that could impact application performance or cause increased processing time.

Amazon ElastiCache Events and Amazon SNS

Various Amazon ElastiCache log events related to your resources (including failovers, scaling operations, planned maintenance, etc.) include date and time, source name, source type, and description. You can easily access such events on the Amazon ElastiCache console, using the Amazon Command Line Interface (Amazon CLI) describe-events command, and the Amazon ElastiCache API.

Monitoring events can help you stay aware of the current state of your cluster and take necessary actions based on events. Although Amazon ElastiCache events can be obtained in several ways, we strongly recommend that you use Amazon Simple Notification Service (Amazon SNS) in your Amazon ElastiCache configuration to send notifications of important events.

When an Amazon SNS topic is added to an Amazon ElastiCache cluster, all important events related to the cluster are published to the Amazon SNS topic and can be emailed.

When using Amazon SNS with your cluster, you can also take action on ElastiCache events programmatically. For example, an Amazon Lambda function can subscribe to an Amazon SNS topic and run automatically when certain events are detected.

Summary

In this article, we discussed common challenges and associated best practices for monitoring Amazon ElastiCache Redis resources. With the knowledge mentioned in this article, you can now easily detect, diagnose, and maintain Amazon ElastiCache Redis resources.

Author of this article

Yann Richard

Amazon ElastiCache Solutions Architect

His personal goal is to transfer data in sub-milliseconds and complete a marathon in under 4 hours.

Follow Monitoring Best Practices with Amazon ElastiCache for Redis with Amazon CloudWatch

the Amazon CloudWatch ElastiCache and Amazon a synergy

Monitoring resources

Summary

亚马逊云开发者

引用和评论

利用 Amazon Bedrock Data Automation（BDA）对视频数据进行自动化处理与检索

数据库的下一场革命：S3 延迟已降至原先的 10%，云数据库架构该进化了

在 ApeCloud （云猿生数据）实习是怎样的体验？跟行业大佬练技术修为的一年小记

基于 KubeBlocks 的 PikiwiDB(原Pika) 云化下一站

阿里云 ESA 游戏行业解决方案｜安全防护、加速、低延时的技术融合

Linux系统安装更新Python3.x版本详细步骤

K3s + KubeSphere + DeepSeek 全流程部署指南：轻量 K8s 与 AI 大模型私有化实践