System design: elimination of slow interfaces

1. Introduction

The response time of some interfaces is significantly slower or even times out. This part of the interface to the entire system on the overall throughput rate and availability will have an impact, of course, will also affect the user experience.

Targeted optimization needs to be done for core interfaces with high traffic access, for example:

Asynchronous processing, or add concurrent processing to avoid synchronous blocking
If frequent accesses to the database are considered, add caching
Batch access to avoid network overhead caused by for loop calls to the database
Avoiding interfaces that return too much data at once

Error interface part of this nothing to say, belongs to the hard and fast rules, eliminate it, if there are interfaces with error rates higher than 0.1% or frequent error logs printed that must belong to the program level problems.

2. Generic

Observe interface throughput, time consumption, error rate. Acquisition can be done through various middleware, buried points.

3. Slow interface optimization

3.1 Downstream issues/network jitter

Slow downstream interfaces result.

3.2 Coding issues

Are there invalid fields being populated? I.e., some fields are not needed by the process and there is extra overhead when querying
Is there an amplification call?
Is it possible to optimize it for batch queries and batch fills?
Is there a slow SQL?

3.3 Common approaches to asynchronous processing

First look at the scenarios that need to be considered for use:

Programming Interface Ease of Use
Execution environment: single JVM or cluster?
Performance and stability, persistent or not: processing progress may be lost if the machine suddenly fails/restarts. Need to support idempotency if you want to be able to recover or redo it.
How to get the result of asynchronous execution.

List some common ways of asynchronization

typology	connector	Execution environment	Persistent or not	Performance and Stability	note
Java Threading Model	native support	single JVM	self-control	stand-alone
Java Concurrency Utility Class (JUC)	native support	single JVM	self-control	stand-alone
Spring Thread Pool	The API is simple and can be annotated	single JVM	self-control	stand-alone	Using annotations that do not specify a thread pool may confuse the
Eventbus（Guava）	API simplicity, event modeling	single JVM	Interruptions cannot be recovered and cluster scheduling is not supported	stand-alone	Need to be aware of cross-bundle dependencies for event classes
redis queue	Need to access Redis, generate external dependencies, and write access code for event delivery and consumption	single JVM	Interruptions cannot be recovered and cluster scheduling is not supported	high performance	Possible single point of failure, requiring high availability design
mircotask	Feature-rich, including monitoring, etc., with some access costs	clustering	to be supplemented	to be supplemented
Implementing an Asynchronous Event Model with MQ	Event model, need to access the MQ middleware, generate external dependencies, need to write their own event processing framework	Cluster deployable with cross-application processing	May need to be persisted	to be supplemented
Implementing an Asynchronous Event Model with Timed Task Middleware	Event model, need to access the timed task middleware, generate external dependencies, need to write their own event processing framework	Cluster deployable with cross-application processing	Requires persistence	Controlled execution speed

3.4 Caching

"Everything in sight is cached. "

Explanation: For any thing, when an observer goes to observe it, because it takes time for the signal to propagate, the signal received by the observer will always be the one it sent in the past. And while this signal is propagating, the thing being observed may have changed. The signal received by the observer can be viewed as a snapshot of its past, and all judgments made by the observer based on this signal can be thought of as caching the snapshot.

3.4.1 Cache Selection

Proximal Cache
Remote Cache

Cache Selection	an actual example	application scenario	access cost	limitation
JVM Cache	HashMap、BloomFilter、WeakReference、SoftReference	broader	easy realization	The standalone machine needs to be warmed up and the JVM memory is limited to approximately
distributed cache	Redis、Memcache	broader	Introducing additional dependencies	Reliability and degradation strategies need to be considered
browser cache	Use client resources to conserve server resources	restrictive		Only part of the experience can be solved, not controllable for back-end developers
CDN Caching	-	restrictive		Significantly optimized for large object access speeds, with additional cloud infrastructure costs

3.4.2 Caching FAQs

(by chatgpt)

Consistency problem: The problem of data inconsistency between the cache and the database, especially in the case of frequent data updates under high concurrency. Need to use appropriate cache update strategy and cache invalidation mechanism to ensure data consistency.
Penetration problem: refers to a large number of requests directly accessing the database, resulting in the cache not being able to play its proper role. The reason for cache penetration may be that there is no data in the cache that needs to be accessed, and mechanisms such as Bloom filters need to be used to prevent cache penetration.
Avalanche problem: A large amount of data in the cache is invalidated at the same time, resulting in requests directly accessing the database, thus causing excessive pressure on the database. Distributed locking, flow limiting and other mechanisms need to be used to prevent cache avalanche.
Memory Leakage Problems: If the data in the cache is not accessed all the time or the cache is not cleared, it may lead to memory leakage problems. A suitable cache elimination strategy is required to avoid memory leakage problems.
Capacity issues: You need to set the appropriate cache capacity according to the actual business requirements and hardware resources.

3.4.3 Scenarios where caching is not appropriate

(generated by chatgpt)

High data real-time requirements: If the data real-time requirements are high and need to be updated in a timely manner, then caching cannot be used. Examples include online payment systems, stock trading systems, etc.
Cache update cost is high: there are some data update frequency is very high, and each update needs to pay a high price, in this case the use of cache instead of reducing performance, because the cache needs to be updated in a timely manner, the high cost of updating will lead to a lower cache hit rate. For example, live video system, game ranking system.
High business complexity: There are some businesses that involve interactions between multiple systems with high business complexity, plus the cache also needs to consider the cache consistency and update strategy, which in turn will increase system complexity. For example, distributed transaction systems, complex financial transaction systems and so on.
Low access: For applications with low access, the use of caching does not significantly improve performance, but adds additional system complexity and development costs. Examples include internal management systems, small portals, etc.

During the design phase, be sure to think ahead about how you will use caching.

3.5 Avoid returning too much data at once

Adverse effects:

Elevated JVM memory footprint when constructing return requests
Long interface response times lead to elevated RTs
network bandwidth usage
May exceed the limitations of the browser, server configuration (HTTP protocol itself does not limit the size)

4. Elimination of erroneous interfaces

Error rate statistics caliber: the interface return value is not 2XX.

Therefore, you only need to handle the exception within the application, and return HttpStatus=200 externally, and the code, suitess in the Result class will be returned according to the actual situation, and will not be counted in the error rate.

Special case: some SEO specifications, can not find the data interface to be processed to return 404.