Online emergency rescue-AWS frequency limit

question

On a hot afternoon, I was drinking Coca-Cola and leisurely watching Cursor generate code. Suddenly, there was an expedited message in the group chats. I felt a thrill at that time. I clicked on it and saw that there was a problem with the online service, and multiple energy statistics interfaces reported errors but did not return data.

Troubleshooting

First, check the online ES logs and query the logs of the energy statistics interface. The error message is as follows:

GetEnergyAnalysisRpc err:error: code = 1083 reason = erdr: code = 10083 reason = error: code = 10003 reason = ThrottlingException: Request rate limit exceeded

The log is displayed very clearly, and the reason for the error is that AWS Timestream is limited in frequency. Our energy-related data are time-series. When selecting technology, the time-series database uses Timestream. Timestream's query throughput is limited and will be billed and allocated based on the purchased TCU (query capacity unit). Each TCU will provide a certain amount of query resources (including CPU, memory, storage and network bandwidth).

When the frequency or complexity of the query request initiated by the application exceeds the currently purchased TCU quota, Timestream will returnTooManyRequestsorThrottlingExceptionWait for errors to limit subsequent requests. Generally, a large number of concurrent queries and complex queries (involving aggregation operations with a large range of time spans, high cardinality, or queries that are missed indexes) may cause the TCU quota to be quickly exhausted.

solve

Emergency measures

First of all, it is necessary to ensure that the online service is normal, so it is decided to expand Timestream. The corresponding resources of a TCU purchased are 4 CPUs and 16GB of memory. This time, 64 TCUs were expanded urgently, providing more computing resource quotas, supporting more concurrent, complex and high-frequency queries, temporarily solving this problem, and energy statistics can be displayed normally.

Trace the roots

The online problem has been solved, but it cannot be ended directly. According to the previous prediction, there should be no frequency limit problem. There should be certain problems in data query. Further investigation is needed. First, let’s take a look at the mainstream optimization plan.

Mainstream optimization solutions

SQL execution logging and AI analysis

Turn on SQL logging: Enable detailed logging in AWS Timestream configuration, record SQL statements, execution time, etc., and store the logs in Amazon CloudWatch Logs.
AI statistical analysis: build models through AWS Glue or Amazon SageMaker, analyze high-frequency queries, execution time, etc., identify inefficient SQL and classify and optimize.

There is a new feature on the TimeStream console calledqueryInsight, can assist in query optimization and tuning.

High-frequency interface cache optimization

ElastiCache application: For high-frequency query interface, Amazon ElastiCache (Redis or Memcached) caches data to reduce the direct query pressure of the database.

Cache strategy design: update the cache immediately after the write operation, and set the dynamic expiration time for read operations (such as frequent data changes to set the expiration time, and set the long expiration time for low-frequency data)

Data sharding and read and write optimization

Sharding strategy: shard by time, geographical location or user ID, distribute reading and writing load, such as time series data is stored in sharded by day/week
Partition key optimization: Set partition keys in unique fields such as user_id (determines how data is distributed to different shards), improve query efficiency and reduce TCU consumption

Resource expansion and configuration adjustment

Extended TCU: Temporarily boost the cluster's Query Limit according to monitoring or upgrade the TCU configuration to cope with high concurrency
Adjust the hot data storage time: extend the Memory Retention time (such as 12 hours to 30 hours), expand the hot data cache range, and optimize query performance

SQL Optimization

Avoid full table scanning: accurately specify the time range and reduce the amount of scanned data
Merge wide table storage: Combine multiple pieces of data into wide tables to reduce I/O pressure and storage costs

Final implementation plan

The original table was not partitioned, which would result in the entire table being scanned for querying each data query, which wasted a lot of computing resources. Therefore, we decided to modify it from the data partition.

Sharding strategy formulation

Select the shard dimension according to the data characteristics:
- Meature name sharding (metric name): Applicable to time series data
- Custom Partitionkey (specific business fields): Use specific business scenarios
Disadvantages: The partition key needs to be defined when building the table, and the existing table structure cannot be directly modified, because modifying the shard key requires reallocating all data to the new shard, which will cause the system to be unavailable for a long time or severely degraded performance.
Partition key optimization

Set partition keys on query high-frequency fields (such as user_id, etc.) to improve query efficiency and reduce TCU consumption. After multiple fields comparison, the user id is used as the partition key and the user dimension is used as the data multi-partition query field.

Specific plan

Solution 1: Use the double-write method, and new data is written to the old and new data tables at the same time. When the data query is used, different tables are checked according to the data distribution. If the data range is only in the old table or only in the new table, the corresponding old table or new table will be queryed separately. If the data exists in the new table and the old table at the same time, the union statement will be used to query across two tables.
Solution 2: Use the double-write method, and write new data to the old and new data tables at the same time, and data migration is performed at the same time.

Solution comparison:

plan	Change the code	Data migration	Impact on online services	Comprehensive comparison	suggestion
Double writing + Union compatible	yes	no	Union query has an impact when crossing new and old tables. But the new table has partition keys, and the impact is controllable	1. No need to migrate data 2. No need to confirm the data retention time 3. It has little impact on online data 4. It meets the user data usage scenario. Generally, users will only query the latest data of the day, and these data will be in the new table.	adoption
Double write + data migration	yes	yes	Migrating data will affect query	1. Data needs to be migrated 2. Retention time needs to be confirmed 3. It will affect online service inquiries 4. After migration, the data will be queried for new tables	It is necessary to migrate data, and it is also necessary to confirm the data retention time, which is quite troublesome

Finally, we have chosen a solution to further solve this problem.

Outlook

The first phase solution was solved through partition keys. Finally, after testing and verification, the efficiency of the latest data query within 48 hours increased by 48%, with significant results. However, in order for similar problems to no longer happen, we have formulated a phase two optimization plan - AWS Timestream Redis Cache.

AI analysis method for high-frequency interface screening

Data source:Extract the execution frequency, response time, resource consumption and other indicators of the query interface from the SQL execution log of Timestream.
AI analysis logic:
1. Feature Extraction:Extract the types of query interfaces (such as SELECT, INSERT), time distribution, parameter patterns and other characteristics through log analysis tools (such as AWS CloudWatch Logs, Amazon Kinesis Data Analytics).
2. Pattern recognition:Use machine learning models (such as AWS SageMaker's classification algorithm or sequence model) to identify patterns of high-frequency interfaces.
  - Interfaces whose query rate per second (QPS) exceeds the threshold (such as 100 QPS).
  - Interfaces whose response delay exceeds the service tolerance value (such as 200ms).
  - Frequent access to interfaces with the same or similar data (such as queries in fixed time windows).
3. Priority sort:Generate a list of interfaces that require priority cache based on the interface's QPS, latency sensitivity and resource consumption.
Tool support:AWS provides CloudWatch Logs, Amazon Athena, or third-party log analysis tools such as Splunk, which can be used for data extraction and preliminary analysis.

Cache Engine

Cache selection

AWS ElastiCache provides two engines:RedisandMemcached. Selection needs to be based on business scenario requirements:

engine	Applicable scenarios	Advantages	limitation
Redis	It is necessary to support complex data structures (such as hash tables, lists, collections, etc.) or scenarios that require high read and write performance. For example: structured data is required (such as user sessions, device status), and atomic operations are required (such as counters, queue management)	Supports rich data types and persistence; high concurrent read and write performance (million-level QPS).	High memory footprint; configuration complexity is slightly higher than Memcached.
Memcached	Scenarios where only simple key-value pairs are stored and concurrent access is extremely high (such as tens of thousands of requests per second). For example: static data accessed at high frequency (such as configuration information, simple metadata) requires minimally configured distributed cache	High memory efficiency; minimalist design supports extremely high concurrency performance.	Only simple key-value storage is supported; no data structure extension is extended; persistence is not supported by default.

Based on the above comparison, Redis is obviously more suitable for the use of complex businesses, so it is planned to choose Redis as the cache engine.

Cache strategy design: write update mechanism

Core Objectives:Ensure strong consistency between cached data and data in Timestream.

Implementation mechanism

Synchronous update process:
1. The application layer performs write operations (such as INSERT, UPDATE) to Timestream.
2. After the write operation is successful, a cache update event is triggered (can be implemented through AWS Lambda or application layer code).
3. Update the corresponding cache key value in Redis/Memcached now to overwrite old data.
Consistency Guarantee:
- Avoid "dirty reading" problems caused by inconsistent cache and database data.
- Ensure the order of cache update operations and Timestream write operations at the transaction level (such as through distributed locks or queue guarantees).

[!WARNING]

Write delay risk: The synchronous update mechanism will prolong the total time-consuming of write operations and need to balance consistency and performance within the business allowance.

Failure processing: If the cache update fails (such as network interruption), a compensation mechanism needs to be designed (such as retry, asynchronous queue processing).

Scope of application: Only suitable for scenarios where write operations are strongly related to read operations (such as dashboards that need to display the latest data in real time).

Grading expiration strategy

background:The access pattern of time series data usually fluctuates, and the hierarchical expiration strategy dynamically manages the cache life cycle, balancing performance and resource usage.

Grading expiration strategy design

High frequency change data (TTL: 5-30 minutes):
- Applicable scenarios: Data is updated frequently and the business needs real-time (such as sensor real-time data).
- Implementation: Set a short expiration time (TTL) for this type of data to ensure that the data in the cache does not expire for too long.
- Advantages: Reduce cache miss caused by data expiration, while avoiding stale data being accessed.
Low frequency change data (TTL: 24 hours):
- Applicable scenarios: The data update cycle is long but the access frequency is high (such as user configuration, static metadata).
- Implementation: Set a longer TTL to extend the survival time of data in the cache.
- Advantages: Improve hit rate and reduce Timestream's query pressure.

Active preheating mechanism

definition:Before business peaks (such as promotions, daily morning rush hour), hotspot data is loaded into the cache in advance by analyzing historical access patterns.

Implementation steps:

Data analysis: Analyze historical hotspot data (such as time ranges of high-frequency queries, device IDs, user IDs, etc.) through AWS CloudTrail or application logs.
Trigger: Start warm-up before peaks using AWS Lambda or scheduled tasks such as cron jobs.
Data loading: Execute batch queries (such as SELECT statements) to write data to the cache and set the appropriate TTL.

Things to note in actual optimization

Cache hit rate monitoring

Monitoring Metrics: Monitor the following metrics in AWS CloudWatch:
- Cache Hit Ratio: Measures whether the cache effectively reduces the number of Timestream queries.
- Cache request delay: Ensure that the performance of read cache meets business requirements.
- Cache size and memory usage: Avoid memory overflow or swap (Swap) due to excessive cache.
Alarm threshold:
- An alarm is triggered when the hit rate is below 85%, and possible reasons include:
  - Improper TTL setting: Hotspot data fails prematurely.
  - Cache Missing Policy: The cache does not overwrite the high-frequency query interface.
  - Data update frequency is too high: causes frequent cache failure.

The coordination of Redis cache and Timestream hot data time adjustment scheme

Timestream's storage hierarchy:
- Timestream divides data into Hot Storage and Cold Storage by default, where hot data is stored in memory and query performance is higher.
- The retention time of hot data can be adjusted (default is 1 day) through configuration. Extending retention time can reduce cold data access frequency, but increase hot storage costs.
Optimized combination:
- Redis Cache: further reduce the query pressure of Timestream by cacheing high-frequency query results.
- Extended hot data time: Ensure that low-frequency hot data that needs to be retained for a long time remains in the thermal storage layer, avoiding high latency of cold data queries.
Implementation requirements:
- Interval execution changes: For example:
  1. First adjust the hot data retention time of Timestream and observe changes in indicators (such as query delay, cost).
  2. Redis cache optimization is implemented to evaluate its performance improvement effect separately.
- Purpose: To avoid the mutual interference of the effects of the two strategies, and to facilitate the analysis of the independent contribution of the optimization plan.

Best Practices

Cache key design:
- Use meaningful key names (such as device:123:status) to facilitate the management of cache life cycles according to business dimensions.
- For complex queries, Query Caching (hashing SQL expressions as keys) or Query Pattern Caching (caches query patterns instead of specific data).
Distributed Cache:
- Data consistency: Memcached's distributed cache is based on hash sharding, and it is necessary to ensure that write operations are synchronized to all nodes (if cluster mode is used).
- Cache avalanche/breakdown: Use random expiration times (such as random addition of 0-5 minutes on the underlying TTL) and mutexes (such as Redis's Redlock algorithm) to alleviate these problems.
Cost Optimization:
- Combined with Timestream’s pay-on-demand model, reducing query volume through caching can significantly reduce Timestream costs.
- Redis persistence (such as RDB, AOF) needs to be configured with caution to avoid additional I/O overhead.