Don't be a rash, talk about those usual life-saving tactics in software architecture

Hello everyone, I'm vzn ah, hello again.

Something interesting came out not too long ago:

A platform UP master released a review video of the Xiaomi SU7 collision, saying that after the collision the Xiaomi SU7's small battery malfunctioned leading to problems such as the door could not be opened and the emergency call system malfunctioned, which caused a lot of sensation. Just when everyone was eating melon to see how Xiaomi responded, Xiaomi officially threw out an internal investigation report, focusing on just one: I also hid a backup power supply! I reported the process to the national monitoring platform! You are hacking me!!! This plot reversal degree, like a bridge in the rebirth of the novel, millet designers seem to anticipate that there will be such a thing, hiding a small battery in advance, just to wait for this moment of the Jedi counterattack.

Throwing away the cool ingredients of eating melons and delving into its kernel, it is not difficult for us to find that the design thinking of Xiaomi SU7 backup power supply has a wide resonance and deep significance in the software field. In the vast world of software development and design, one is faced with a world full of variables, which includes both theregular userThe day-to-day operations of thenetwork failure、hardware crashgo so far as tomalicious attack (e.g. on a cell phone)of the various challenges. In accordance withMurphy's lawThe elaboration, as long as the probability is greater than 0% of the things are very likely to happen (to put it plainly, is afraid of what will come). Doing a good job of responding to abnormal scenarios is a mature programmer's advanced mandatory courses, but also a software system on the line is the inherent cornerstone of smooth operation.

That said, it is important to mention a deep philosophy of system architecture design that emphasizes theReverence and reflection on unknown risksA good system architecture design should be able to foresee and cope with various challenges in the future. An excellent system architecture design should be able to foresee and cope with various challenges that may arise in the future, to tolerate and accept the existence of local errors objectivity and strive to control local errors within a certain range, at the same time, when the system irreversible disaster can try to maximize the protection of the system to ensure that the core business of the system continues to be available.

In this post, we'll talk about software development for systemfault toleranceas well asDisaster responseCompetency considerations.

1. Fault-tolerant design with a second chance

A piece of software that we develop and face when it goes live is extremely complex, and it's almost impossible to say that it's error-free:

What to do if an abnormal parameter is entered upstream
What if the external interface hangs?
What to do if a dependency fails
What should I do if my request fails due to network jitter?
What if the hard disk is broken?
…

As I said earlier, anomalies are objective facts that are bound to exist. If we insist on pursuing absolute zero deviation, we are actually making things difficult for ourselves. Therefore, in order to improve the system's ability to cope with abnormal situations, it is a good idea to consider adding some fault-tolerant design. Allow a limited range of anomalies, try to accommodate these anomalies, and try to ensure that the final goal is achieved in line with expectations.

Fault tolerance can be implemented in a variety of ways, typicallyRetesting mechanismtogether withcompensation strategy。

1.1 Retrying: The Return of the Prodigal Son

retryIt is a common way to minimize the impact of anomalies on business requests and try to ensure that requests are handled as expected. So do all failure scenarios need to retry? Obviously not, for example, when the login password verification, enter a wrong password leads to authentication failure, this no matter how many times to retry will always remain a failure. In general, only by some instantaneous chance factors interfere with the failure of the scene, only need to consider retry strategy. For example:

A certain external network request was made because thenetwork jitteretc., or a request processing timeout, you can pass theLimited number of timesretries to improve the success rate of external interaction handling
In a distributed system, a certainNode Service ExceptionThe gateway redistributes the request to another node for reprocessing, improving the fault tolerance of the entire cluster.
In a queue processing scenario for robbing distributed locks, fail to acquire the lock at a certain time, wait for some time and try to acquire the lock again

follow the retryTrigger timingThe retry strategies can be categorized intoRetry Nowcap (a poem)delay retry：

Trigger timing	Applicable Scenarios
Retry Now	It is suitable for some failures due to accidental factors, for example, if the link fails due to network jitter when requesting, you can try to retry immediately.
delay retry	Applicable to failure scenarios triggered by resource constraints. For example, when the external request, due to the downstream interface traffic is too large to trigger a failure caused by the flow restriction, if you immediately retry, the probability that the retry still fails, this scenario can be considered to wait for a certain period of time and then retry, in order to improve the success rate.

Trigger timing

Applicable Scenarios

Retry Now

It is suitable for some failures due to accidental factors, for example, if the link fails due to network jitter when requesting, you can try to retry immediately.

delay retry

Applicable to failure scenarios triggered by resource constraints. For example, when the external request, due to the downstream interface traffic is too large to trigger a failure caused by the flow restriction, if you immediately retry, the probability that the retry still fails, this scenario can be considered to wait for a certain period of time and then retry, in order to improve the success rate.

And according to the specific implementation logic of the retry operation, it can also be categorized intoretry from the original pathcap (a poem)Discrepancy retry：

Retest category	Examples of scenarios
retry from the original path	(1) Call the HTTP interface of the three-party system, and reissue the request when there is an abnormal situation such as response timeout or network inaccessibility. (2) When acquiring a distributed lock fails, try to re-request the acquisition
Discrepancy retry	(1) In a distributed system, after a request processing failure at one node, the gateway distributes the request to another node for retrying (2) Fail to get data from Redis, try to fish for data from MySQL

There are 2 other foundational principles that can't be ignored when it comes to landing a retry mechanism:

Limit retries: Ensure that in extreme cases, the system does not fall into an infinite loop of retries.
Reasonable number of retries: Avoid too many retries, which waste system resources, and don't retry for the sake of retrying.

In addition, on a longer processing link that involves too many retries, one also needs to consider triggering theRequest Stormof the risk. For example, in the scenario below, assume that the maximum number of retries is limited to N:

So the retry means is not zero-cost, its use also has side effects, especially in some complex link scenarios. In order to circumvent the potential danger of serial storms that may be caused by serial retries, it is also necessary to introduce some auxiliary means to cope with it.

Breaking the Chain Retry

The request storm is formed because the anomaly at the very end is transmitted to all upstream links without restriction, which then triggers repeated retries by the upstream links, amplifying the number of requests exponentially. In reality, however, just a problem with a request between the last node and the DB actually requires only a retry of this operation; the upstream node does not need to retry. In order to realize this effect, it is necessary to plan at the request interaction level, through the return value, return code, etc., to inform the upstream node whether it needs to retry, and limit the scope of the retry to the location where the failure occurred, rather than the chain chain reaction of the whole link.

Combined with a fusion strategy

Combined with the fusion mechanism, the fusion operation is directly executed when a certain threshold is reached, based on the judgment of the failure rate of request processing on this link. Subsequently, through certain probing mechanisms, a small amount of trial traffic is allocated, and if the success rate reaches the set threshold, the subsequent processing of this link is resumed.

1.2 Compensation: it's not too late to make amends

The main purpose of the retries described above is to improve the success rate of the operation as much as possible. However, there are always some abnormal scenarios that can not be solved by instant retries. For example, in some large microservices distributed system, a request process will be processed across multiple services, and request processing is often asynchronous, if there is an abnormal problem that can not be solved by retrying, it is necessary to compensate for additional mechanisms to ensure the consistency of the final results of the processing.

Compensation mechanisms are often used in distributed systems, and one of its core premises is to allow and accept temporary data problems in the process, and through compensating measures, ensure final data consistency. So, how do you know if and which data needs to perform a compensation operation? This requires "reconciliations"Up.

The so-called "reconciliation" is a periodic inventory of business processing data over a period of time to identify data that does not meet expectations at the data level. Based on the abnormal records found in the reconciliation, the corresponding compensatory correction process is performed.

An example:

An e-commerce platform system with an order system is designed in such a way that buyer's orders and seller's orders are stored in separate repositories. After an order is created and payment is completed, the order information flows to the downstream consumer services to be processed by each and written to the buyer's order repository and seller's order repository respectively.

In the microservicing scenario, although some distributed transactions and other means to prevent, but still may be due to some extreme circumstances, resulting in an order has not been successfully written to the buyer's order database and the seller's order database at the same time, which may have an impact on the user's use. In this case, you can consider getting a timed task, regularly scanning the order data over a period of time, calibrate the differences between the two sides, and then correct the abnormal data processing. As shown below:

In this way, based on ex post factoReconciliation + compensationThis dual approach guarantees that the system's goal of "ultimate consistency" is achieved.

2. Considering the big picture and sacrificing the small for the big picture

There are also business scenarios that may involve multiple side-by-side dependencies with the ultimate goal of blending the results of multiple dependencies together. In this case, a problem with one of the dependencies will have limited or no impact on the end-user experience. It is clear that a loss-at-all-costs solution is not optimal.give up pawns to save the car (idiom); fig. to abandon one's position to save the countryIt would make more sense.

For example, in a news and information software, the content stream list on the home page is aggregated from multiple data sources:

Breaking News

Top Current Affairs Articles

Follow account to post

Possible contents of interest

Paid promotional content

xxx

Ultimately, data from multiple sources is mixed into a single stream of content for the user. If one of the sources (e.g. breaking news) fails to fetch the data, it's not a big deal to the user because they don't know if there's something wrong with the system or if there's no breaking news. But because of the failure of a certain way to get data, directly to the user to report an error exception, or give the user a white screen display, but instead of the user to amplify the user.

In the actual project, when the emergence of faults has been unavoidable and can not be avoided or retried to solve the time, in order to avoid the further expansion of the problem, through a certain degree of "compromise" and "give up" to try to minimize the loss, to avoid the failure of the impact of the surface of the amplification is also a conventional operation. Enlargement, is also a conventional operation, there are many means of realization, mainstream degradation, current limitation, fusing, put through, isolation and so on.

2.1 Degradation

demoteAs an underpinning strategy, it is usually made at the business level in failure scenarios as anreach termsStrategy. Generally, it is a kind of response program when encountering local dysfunction, or resource load level problems. When there are some unexpected situations, resulting in system resources are not enough to support the normal development of the full range of business functions, in order to focus the limited resources to ensure the availability of core functions, and take the initiative to deactivate some of the non-core functions of the idea.

There are many scenarios for the use of degradation, for example:

E-commerce every year 618 or double 11 and other big promotions, in order to protect the normal advancement of the rush order, the order evaluation, history of the order query and other non-core functions first downgraded deactivation, all the resources to fully support the product browsing, ordering, payment and other operations!

Interactive social platform, sudden super flow star of the big melon news, reduce the frequency of some non-core functions (promotion, attention to the flow) update, will be more resources to support the explosive topic of access and interactive browsing operations

For instant messaging IM scenarios, if there are network failures that result in limited bandwidth capacity in the server room, downgrade to make video and voice services unavailable, and do your best to ensure that text messaging functions are still available.

Degradation, by its very nature, is aaccept or rejectThe process of discarding the parts you don't care about and preserving the parts you care about most. Who to give up and who to preserve needs to be judged according to the characteristics of one's business. In general, there are several dimensions:

Degradation dimension	Examples of scenarios
Reduced user experience	The interface is not refreshed in time, does not display dynamic effects, does not display high-definition pictures, does not display system push notifications
Relinquishing some features	Do not allow to view the history of orders, do not allow data export operations, do not allow uploading files operations
Security concessions	Do not do complex secondary checks, skip the wind control judgment, do not record operation logs
Reduced accuracy	List data is not updated in a timely manner, statistical reports are not updated in a timely manner
Reduced consistency	The number of comments displayed in the list does not match the number of comments displayed when clicking into the body, and deleted articles still appear in the list.
Reduced data volume	The order center only shows the last 100 records, and only the last 1 year of data can be queried.

Implementation of the premise of the degradation operation, the need to cooperate at the level of system business planning, to do a good job of the system's business functionsSLA planningThe core functions and non-core functions should be delineated. At the same time, in the system architecture level to do a good job of decoupling and isolation of core functions and non-core functions.

2.2 Flow limitation

Generally in the Spring Festival, May Day, or National Day and other holidays, some popular scenic spots will limit the flow of passengers into the scenic area, so as to ensure that the tourists' experience and personal safety. By the same token, the software system is limited by its own realization, business planning and hardware resource carrying capacity and many other limitations, its pressure capacity is also an upper limit. If the request traffic increases suddenly and obviously exceeds the affordable range of system planning, it may cause system downtime and other accidents. In order to protect system security and avoid the impact of unexpected traffic on the normal operation of the system, it is necessary to limit and control the traffic into the system.

current limitImplementation can generally be based on two dimensions:

Restricted dimensions	Examples of scenarios
Limit concurrency	For example, limiting the number of connections in the connection pool, the number of threads in the thread pool, and so on.
Limit QPS	Limit the number of incoming requests per second.

The implementation of the current-limiting operation cannot be separated from the current-limiting algorithms, the mainstream ones areleaky bucket algorithmcap (a poem)token bucket algorithm。

sieve

leaky bucket algorithmThe principle is very simple, it does not limit the inflow of requests, but will be a relatively restricted rate from the bucket to obtain requests for consumption processing, if the outflow rate is less than the inflow rate, the request will be backlogged in the bucket to wait for the order of the bucket to be processed, once the capacity of the bucket by the backlog of requests to support the full, it will be overflowed, can not get into the bucket of the request will be discarded.

As its name suggests, the principle of the leaky bucket resembles the funnel used in life. It is also an example that reaffirms that many implementations and processing strategies in software architecture design come from the most rustic life.

token bucket

token bucketThe logic is slightly different from the leaky bucket, it will have a token issuance module is responsible for generating tokens at an even rate and put them into the token bucket, and then try to get a token before each request is processed, and only when a token is obtained will the corresponding request be processed.

One point worth noting is that while the tokens are designed to be generated at an even rate and placed into a token bucket, this does notThere's no guarantee.Requests will always be processed at an even rate. In extreme cases, there may be a transient request volume breakthrough rate-limiting value (e.g., most of the time the request volume is less than the token generation volume, resulting in a bucket full of tokens, and then a sudden wave of heavy traffic will consume all the stock tokens in the token bucket in one gulp), so it is necessary to reasonably set a threshold for limiting the flow of traffic according to the load-bearing situation of the system design. However, this design also has its advantages, occasional brief pulse fluctuations can be absorbed as much as possible, while ensuring that the overall long-term processing rate is in a controlled state.

There is also a rudimentary counter-based "false speed limit"Program, this idea is very simple, each counting cycle to maintain a counter, and then come to a request counter on the cumulative 1 time, counting the full threshold will be rejected subsequent requests, until the next week counter recount. This essence can only control the flow, can not control the process flow rate, extreme cases of some of the request peak, it is highly likely that the system will be crushed, as far as possible, the flow control counting cycle set up a shorter, try to avoid the use of this program in the core of the important system.

In addition, for some clustered multi-node deployment scenarios, planning for flow limiting requires attention to bestand-aloneof the traffic limit, orclusteringoverall traffic limitations and choose an implementation that suits your business.

2.3. fusion

fuseThe most intuitive application in the real world is the fuse in the air switch in your home's power box. When the current is overloaded, the fuse will blow, thus protecting the overall circuitry of the home from burning out, as well as various electrical appliances from damage.

By the same token, in software implementations, there is a general design idea similar to circuit fuses, which avoids the node from spending a lot of resources waiting for a response with a high probability of error by adding fuses at the external request invocations of the service, which will remove the corresponding dependent service from its own request link when a predefined condition is met.

Meltdown, a kind of its ownprotection racket, with the goal of preventing external node anomalies from draining themselves to death. There are two general strategies for fusion:

Fusing according to the failure rate of the request

If the failure rate of requests to a target service is higher than a certain threshold in a short period of time, a fusion policy is executed and the target service is not invoked any further. The decision to continue fusing or to resume the request is then made through a periodic heartbeat probe mechanism or a small amount of trial traffic.

Fusing by request response time

For some high concurrency processing scenarios, if the request latency of the invoked target service is too large, it will inevitably drag down the overall system throughput. In this case, in order to protect the processing performance of its own nodes, you can also decide whether or not to trigger the meltdown operation according to the request response time.

In addition, in a cluster deployment environment, gateway nodes often provide fusion as a basic feature to achieve a more granular control than service fusion, when a node in the service cluster fails, the node will be directly culled out of the cluster, and then added to the cluster after it recovers.

For specific applications, you can directly use some mature open source solutions, such asHystrixorSentineletc. One point to emphasize is that meltdowns generally target some non-core, non-essential dependent services, and essentially, meltdowns are also a form of downgrading implementation.

2.4.

incommunicadoAs a means of fault control, the design idea is to separate the resources, do not interfere with each other, so that when the system fails, the fault can be limited to a certain range of propagation, to avoid the snowball effect, the overall situation. Common isolation measures, there aredata isolation、machine isolation、Thread Pool Isolationas well assignal level isolationAnd so on.

data isolation

The most intuitive manifestation of this is thedatabases and tablesThe data will be stored in separate libraries according to the business dimensions, for example. For example, the system's data, according to the business dimensions of the library storage, or according to the importance of the business to identify the data identified as key data/non-key data, or confidential data/non-confidential data, and then in accordance with the results of the implementation of the results of the segmentation of the differentiated data storage and security strategy. For example, for non-priority data, a simple master-slave dual-copy can be, and priority data, you may have to consider off-site multi-copy reliable storage and backup.

machine isolation

Different businesses use different machines, fromhardware resourceSegregation at the level. By grouping machines together, special machines can be dedicated to key services or high-risk services, while general services can be mixed with the same set of machines, thus realizing differentiated segregation and disposal.

Thread Pool Isolation

The idea of isolation is not only reflected in the data level or process machine nodes and other macro level, the idea is also applicable to the implementation of a single process. Because the same process handles a lot of different logic, if a processing logic creates unlimited threads of execution, occupying all the system CPU resources, the rest of the logic in the entire process will be affected.

To deal with this situation, it can be based onthread poolThe isolation design specifies the corresponding execution thread pool for the main business processing methods, restricts the specific business methods to be scheduled and used only in accordance with the thread resources provided by the allocated thread pool, and prohibits the business methods from occupying the system's threads and CPU execution resources excessively. In this way, even if a business occupies all of its own thread pool resources, it will not affect the normal processing of the rest of the thread pool, which guarantees the normal operation of the rest of the business.

Because the maintenance of the thread pool also takes up additional resources, the granularity of the isolation should also be controlled by thestop before going too far (idiom); to stop while one can, following the principle of proportionality.

3. Hardware Disaster Recovery: The Power of Money

The smooth operation of software is inextricably linked to the robust support of hardware. Although the software level through clever fault-tolerant design, flexible degradation strategy and accurate flow-limiting mechanism and other means, can significantly improve its self-recovery ability and availability, but in the face of the hardware failure of this hard challenge, purely rely on software means appear to be unable to do. Therefore, when designing and planning the overall architecture for building a reliable software service, theHardware deployment planningThe topic of designing for reliability at the time is also inescapable.

Compared to the various fault tolerance strategies at the software level, the response at the hardware level is simple and brutal ----stack resources! That is, by means of the resource'sRedundant deploymentTo enhance the fault tolerance of the system. Of course, the implementation of this strategy will inevitably increase the economic costs, so the specific implementation and planning, but also need to be combined with the budgetary situation, within the cost of maximizing the reliability of the ability to guarantee.

Common hardware layer redundancy practices, one that guarantees high availability for business applicationsmultiple activation mechanism (MAM), the other is to guarantee reliable data storage of theMulti-copy storage mechanism。

3.1 Multi-activity

As more and more life scenarios are moved online to be handled, the white-hot era of the Internet, for with business7*24Hourly continuous availability poses a serious challenge. But for a software service, no matter how perfect the architecture is and how elegant the code is, ultimately the program has to run on top of the hardware foundation, and the risks at the hardware level are beyond the reach of the code. So how to deal with the risk of hardware damage or unavailability? Simple.Spend money to eliminate disasters!Spend more money, more hardware resources, more deployment of several sets of services on the line. But this deployment of multiple sets, the actual also has to pay attention to.

A number of different ways of stacking hardware have also been derived to cope with different layers of risk:

clustering

In order to cope with damage to a single server's hardware, such as hard disk damage, power supply burnout, etc., multiple nodes are deployed in a single server room, consisting of a number of different machines that, together, form anclusteringIn this way, if one of the nodes fails, the rest of the nodes can still handle the business normally, effectively avoiding the probability of a single point of failure and improving the reliability of the business.

Co-location Dual Activation

The above approach of utilizing multiple nodes in the same server room to form a cluster, although it can cope with the failure scenario of a single machine, it will still result in a total loss if there is an overall failure of the server room, such as a power outage, fire, or the fiber optic cable being dug up, and so on. To deal with this possible risk, the natural solution is to build another server room, so that the two server rooms back up each other and the risk is greatly reduced. Generally speaking, the synchronization of data between the two server rooms will be involved, so the network transmission speed and latency between the server rooms have very high requirements, which requires that the two server rooms can not be too far away, preferably in the same city. -- This creates what is often referred to asCo-location Dual ActivationArchitecture.

two locations and three centers

Based on the co-located dual-activity model, its reliability can already meet the requirements for system reliability in most common business scenarios. However, if the business system is extremely important, especially in some financial, social, basic service providers and other areas that involve the national economy and people's livelihood, the reliability of the system and the security of the data have more demanding requirements. In the co-location of dual-activity architecture, in order to control the network delay between the room, the distance between the two rooms will not be too far, in case of some force majeure natural disasters (such as earthquakes) caused by all the damage to the two rooms, will still lead to damage to the business or data. So how to cope? The answer is already out, across different cities to build more server rooms! So.two locations and three centers、Three Places, Five CentersAnd so on and so forth.

See? The reliability of the system depends, in part, on the thickness of the stack of bills.

3.2 Multiple copies

redundant backupAlso known asmultiple copy. Essentially, in order to prevent a single point of failure caused by the loss of data level, and take the same data dispersed in multiple locations to store multiple copies of a way. This approach will result in additional resource costs, but the reliability and high availability of the data brought about by the "lone book" can not be compared.

The strategy of multiple copies is widely used in various data storage components. For example:

Local Cache Multi-Copy
Redis Multi-copy
Multiple Copies of MySQL
Kafka Multi-Copy

The most common and simplest multi-copy strategy, theMaster-SlaveThis architecture is similar to MySQL's master-multi-slave architecture. In this architecture, the Master node is usually responsible for data write operations, and then through the inherent data synchronization mechanism, the data changes are synchronized and updated to each Slave node for multi-copy storage of data. In order to improve hardware utilization, Slave nodes are not only used for reliable multi-copy storage of data content, but can also provide read-only query operations to support the business of read-write separation requirements in the necessary scenarios.

Master-SlaveOne fatal problem with this multi-copy strategy of master-slave architecture is that each node stores the full amount of data files, which makes the total amount of dataLimited to stand-alone storageThe bottleneck exists. For large data volume scenarios, there will be a need for more complex multi-copy scheme, slicing the overall data, multi-copy support for each slice of the data, which can support the capacity of the horizontal expansion. Like Redis clusters or kafka used is this strategy.

Sharded Storage Format 1: Scattered Storage on Different Machines

Sharded Storage Format 2: Multi-Cluster Hosted Sharded Mode

This way of sharding data and spreading it across multiple physical storage nodes, theBreaking down the stand-alone capacitylimitations, but it also increases the complexity of reading and writing data and data synchronization. Because the data is scattered on multiple nodes, when reading and writing, it is necessary to support the distribution of request routing to the node where the data is sliced, which is more common to use theconsistent hash algorithmto carry out sharding. In addition, the synchronization and consistency guarantee of the sharded data on each node also requires more complex processing logic to support, such as Kafka is specially designed for theISR algorithmto handle data synchronization between multiple copies.

4. Human intervention to ensure control of the system

When we implement functions according to business scenarios and business requirements, we will envision in advance how to handle and respond to each scenario, and we will also consider some possible abnormal scenarios of automatic compatibility and response at the code level. However, there may be some scenarios that break through all the reasonable planning we have set up for the system in advance, or when the system has some unforeseen scenarios that cannot be recovered, or when the impact of automatic recovery or rollback processing is too large, you may need tohuman interventionProcessing. Therefore, it is very necessary to build some manual intervention means and capabilities when the system is planned and realized.

This manual intervention capability, there are many practical application scenarios, can be used to enhance the operation and maintenance personnel to a high degree of control of the system to better respond to a variety of unexpected scenarios, but also as a kind of high-powered background disposition of the operations staff to be reserved.

4.1 Human intervention emergency response capacity

Let's look at an example first:

Background:
A business needs to obtain data from a remote data source and carry out business logic processing. Since the business itself is a particularly core and important service, the number of remote data sources is an external dependency, and the data accuracy and service availability are not controllable.

Realization:

When business is being processed, regular data source pulls are performed from the remote end to update and prioritize the use of data from the remote data source.

In order to cope with the uncontrollable risk of remote data sources, regular updates are made by writing a copy of the data obtained from the remote end to the local disk for backup, and the local disk keeps the last N backups.

If the remote service fails to pull data, the service automatically tries to read the most recent backup file from the local to support the continued operation of its own business. If the recent file processing failure or data anomaly, then automatically load the previous backup file, and so on, until after retrying all the local backup file, if still processing failure, the system is not available, give up struggling.

Combined with the above background, it can be seen that the realization of the response strategy is quite thoughtful, to do the use of local backups for remote data request failure in the case of the bottom of the processing, but also take into account the data loading anomalies to increase the automatic retry mechanism to automatically load forward until the attempt to load to a usable history of the backup file so far. But consider a scenario: assuming that the remote service interface is normal, the return of the data response format is also correct, but due to the remote data source service developers upgraded a version last night, resulting in the content of the data issued by the data itself there is a serious error, which leads to downstream business use of the data business damage. In this case, all the self-recovery and self-protection means in the previous planned implementation are invalid.

So what if, at the planning stage, on top of the above realized points of safeguards, you plan an additional oneManually commanded intervention channelsIn case of emergency, the system can be manually instructed to disconnect the real-time update logic with the server and force the Xth local backup file to be loaded, so that the service can quickly get rid of the failure of the remote data source, and then issue instructions to resume the real-time update of the remote data source after the failure has been repaired. The modified schematic is shown below:

4.2 Manual power of disposal for unintended scenarios

The human intervention capability, which can also be counted as a management-side system, is a "privileged function", providing higher operational authority to back-office personnel to solve certain seemingly unreasonable but highly likely business-level problems, such as dealing with certain difficult customer complaints.

A simple example:

A securities company develops a stock trading app and provides a paid investment advisor feature, which allows users to use the corresponding advanced features after paying. The business planning strategy is that users are not allowed to unsubscribe after purchase, and the interface and the user agreement clearly indicate that unsubscription is not allowed after purchase.

User A had to unsubscribe for a refund after purchase and then kept pestering customer service and threatening to go to the regulator, the SEC, etc. to file a complaint if it wasn't dealt with.

Ideally, we expect that the user in accordance with the product planning strategy for the purchase, and also has done the obligation to inform, do not support the user unsubscribe. However, in the face of the customer's nonsense, in order to maintain the company's image point of view, in order to calm the dispute as soon as possible, the customer service department will privately agree to the background operation for the user to refund the order. If the system design and implementation of the time, there is no planning to build the corresponding background manual refund refund capacity, processing will be very passive ---- is called:You don't have to, but you can't live without it.。

5. Monitoring and early warning to prevent problems before they occur

The fault-tolerant design and some of the disaster recovery solutions mentioned earlier are oriented to how to respond to failures to ensure the availability of the system business in the event that a failure has occurred. A more robust expectation is to be able toProblems can be detected and neutralized at the first sign of trouble., here is where the system implementation needs to be done with some necessarydata burialtogether withIndicator collectionMonitoring, timely warning information of the system to inform specific maintenance personnel, reminding maintenance personnel to intervene early to deal with.

It is not true that "invisible problems are no problems, invisible failures are no failures", as the person in charge of the system, it should be to know the overall operating status of the system and the health of the system, through the status monitoring, indicator monitoring and other means, so that the operating status of the online system from a black box to a white box.

5.1 Monitoring alarms

In general, monitoring platforms are built independently of the business and providePushorPullTwo mechanisms for obtaining data on indicators. In terms of monitoring content, it can coverresource (such as manpower or tourism)Utilization,systemsStatus,professional workOperational data and other dimensions.

Monitoring alarms is an important means for development and operation personnel to know the abnormal status of the online system, and the implementation needs to be careful not to abuse the alarm channel. The best way to send alert messages is to supportPacket Aggregation、Message suppressionTo avoid the indiscriminate bombardment of useless alarm messages, paralyzing the nerves of the receiver and drowning out the really important "help" signals. At the same time, in the construction of the monitoring and alerting platform, consider as much as possible independent of the business, so that the logic related to the alarm from the business decoupled, reducing the monitoring of the erosion of business logic.

For more information on how to design and plan for building a monitoring and alerting platform, you can read one of my previous posts.What are the architectural requirements for a common monitoring and alerting platform?。

5.2 Real-time Dashboard

since we are going toprevention is better than cure, first and foremost, is to have a clear picture of the overall health status of the system. At this time, the value of the capabilities related to system health monitoring becomes apparent. This is like a system implementation medical report, based on which potential pressure points, risk links, and suspicious trends in the system can be identified, and then early intervention can be made to respond and nip the failure in the bud.

5.3 Disaster preparedness exercises

As mentioned earlier, the system is built with a series of high profile anomaly response and disaster recovery tools, but how to ensure that these tools can achieve the desired results when anomalies occur? This will have to be tested through disaster recovery exercises. Like military exercises in times of peace, disaster recovery drills are a regular part of many large-scale projects. By simulating some possible disaster failure scenarios, it verifies the effectiveness of the system's fault-tolerance and abnormality protection means, and discovers the problems in the emergency plan and repairs them in time.

For some large-scale systems, the whole business process involves upstream, downstream and peripheral dependencies, and the implementation of some disaster recovery plans is also triggered by upstream and downstream linkages. Therefore, another purpose of regular disaster preparedness drills is to practice the tacit understanding of the development, operation and maintenance personnel in the implementation of emergency plans.

6. Awareness-raising to maintain sensitivity to risks

As mentioned earlier, there are many kinds of mature and implementable solutions at the realization level that can bring scenarios such as anomaly response and disaster recovery for systems to life, but these are really specific "various genera of flowers of Asteracea family (daisies and chrysanthemums)", is that we know that there is this risk or claim under the premise, in order to cope with these known possible scenarios and make the specific response method. As an IT practitioner, on the one hand, to experience the business requirements into reality of the landing process, but also with a variety of abnormalities in the game of adventure.Sensitivity to riskIt is a quality that should be engraved in the bones of a good programmer. This quality is not only reflected in the coding level, nor is it limited to the architectural design, but in all aspects.castIt's an instinctiveconditioned reflex。

Maintaining a keen awareness of risk allows you to see potential risks and allows for a variety of risk prevention techniques to take hold.

Take a simple, non-technical implementation level example.

There's a temporary problem with the online system that needs to be fixed urgently by manually changing the package and restarting the process.

Head Iron Warrior's Stud

What a simple thing to do, process stopped, old package deleted, new package uploaded, then process has started, perfect solution.

Perhaps, for the most part, it is true that there have been no problems. However, for those who have some seniority, it is often a bit "scary" to see this operation in the online environment. For example:

If there is a problem with the uploaded package and the startup fails, the online package will be deleted, the service will not be able to start, and the online service will be scrapped.

Bowing down after a small loss

Combined with the above operation of the possible risks, the improved version of the practice, naturally, the old package is not deleted, but renamed back up, so that the new package in case of problems, you can directly use the old package to roll back to restore the online service can be.

This improved method of operation is indeed a big improvement in terms of reliability, and gives you plenty of room for rollbacks and fallbacks. But on closer examination, there is still room for improvement:

The operation of uploading a new packet carries the risk of failure due to the network transmission and the high impact of network fluctuations.

If the packet is very large, or if the upload has to go through layers of vpn, bastion, etc., it may be slower, and the whole transfer process will take a long time. In this case, it may cause the online process to be down for too long.

Prudence after surviving a beating.

To further optimize the steps of the above operation, the whole action can be divided into two parts: the pre-preparation link and the online operation link, and some more time-consuming and risky operations are put in the pre-preparation link to be completed in advance.

In this way, only actions with high certainty need to be executed during the formal online operation session, which ensures a quick end to the execution of the action and also reduces the uncertainty of the execution of the action. By risk-front operation, the probability of problems during the whole operation is reduced.

Speaking of which, perhaps some children will retort, thinking that the company's bandwidth is very high, transferring files very quickly, there is no need to be so troublesome, just straight Stud Dry. This is in fact a consciousness level of the consensus problem, but also a kind of risk coping strategy problem. In fact, or the previous sentence, can realize the risk is not the real risk, often those seemingly impossible risk is the real risk. The starting point of all behaviors is actually just one: this actionThe consequences of a failed operation are not something you can afford.. If you can, then you can just pike it, otherwise think twice.

7. Revisiting the Heart: The Story of the Three Magpie Brothers

Finally, a story.

Legend has it that when Bian Magpie traveled to the state of Wei, King Wen of Wei received him and asked him, "Your three brothers are all medical students, so who among the three of you is the most skillful in healing? Bian Magpie replied: "My elder brother is the most skillful, my second brother is the second most skillful, and I am the least skillful". King Wen of Wei was puzzled: "Why is it that the world honors you as a miracle healer, but has never heard of your elder brother and younger brother?" The magpie explained:

My elder brother's medical skills are the best because he can tell if you are sick before you are sick. At that time, the patient would not feel that they were sick, and my big brother cured them before they realized it. It was because of this that my big brother's healing skills had not been recognized by others, and there was no fame.
My second brother is the second best healer in the family because he is able to see and cure patients in the early stages of their illnesses, and in this way, patients think that my second brother is only good at treating minor ailments.
When patients came to me for treatment, they were already in the middle to late stages of their illnesses, and their conditions were already very serious. I became even more famous after I cured those patients with serious illnesses. But fundamentally, my medical skills are not as good as those of my two older brothers.

Put into the current increasingly involution of the IT industry, magpie big brother, magpie second brother, this kind of people, perhaps belong to a highly skilled class of people, they quietly guard their own code, do not give the opportunity to abnormal outbreaks. And so? Always stable online services, so that people slowly forget the existence of the relevant developers, so that instead of becoming marginalized transparent people. The ones who really have a chance to stand out and win the favor of the leaders are often the team'sfiremanThese people, who are constantly on the front line, to solve the problem of those who press up the gourd to start a scoop, over time will become the leadership of the heart trustpillarThe opportunities and resources associated with it are also skewed in its favor.

Things outside of technology, while thought-provoking, also seem to be unsolvable. As the saying goes.A sage treats the unhealthy, not the chaotic, the unrulyIn contrast, we ourselves, how to choose, the initiative in the individual, follow the heart of the most important. But believe that time will prove everything, all the perseverance and technical pursuit, will eventually be seen (a little chicken soup flavor). So what, a bit of technical pursuit, is always a positive solution.

8. Summary

Well, this concludes the discussion on exception handling and disaster recovery capabilities in software development and design. What is mentioned herefault tolerancetogether withDisaster responseCapabilities are as important as the insurance you buy for your life - they may seem obscure or even "costly" on a regular basis, but in critical moments, they can be a solid defense against risk, and their value is immeasurable.

Just like the Xiaomi SU7 comes with a backup battery, this design provides users with an extra layer of survival protection in case of an emergency. In the world of software development and design, is there a need to build similarAlternative Programsmaybedisaster recovery system, depending on the business tolerance for potential losses. If the catastrophic consequences are beyond the business, it is especially important to add some extra costs and build a comprehensive set of disaster recovery pockets, exception protection, and monitoring and warning mechanisms.

Also as the old saying goes:Prepare for a rainy day.。

I'm VZN. I talk about technology, not just technology.

If you find it useful, please click a follow, you can also follow my public number [is vzn ah], to get more timely updates.

I look forward to talking with you and growing into a better version of myself together.