Hello, I'm a programmer fish skin, the afternoon of August 19, NetEase cloud music sudden serious failure, and topped the microblogging hot search, with the black myth of Goku grabbed the heat.
According to user feedback, the specific manifestations of the failure are: users can not log in, song list loading failure, playback information failure, can not search for songs, etc., almost unusable, a proper P0-level accident!
According to the official note released, the main reason for the outage was infrastructure, which prevented NetEase Cloud Music from working properly on all ends:
What is infrastructure? It refers to the basic services and resources that support the operation of the entire system, including servers, network equipment, databases, storage systems, content distribution networks (CDN), various cloud services, caching, DNS, load balancing and so on. The importance of infrastructure can be seen from the fact that the massive failure of Station B and Xiaohongshu was due to a problem with the network of a cloud service provider.
I'm not an insider, so the exact cause of the failure is not known, there is a lot of speculation on the Internet, what "the development of deletion of libraries to run away", "relocation to a new server room has created problems", "layoffs lead to reduce costs and increase laughter " and so on, but these claims were officially denied.
According to online news, the failure may be related to NetEase Cloud's self-developed Curve storage system, which NetEase officially said at the time had been online for more than 400 days and had never experienced data inconsistency or lost data, with data reliability reaching 100% and service availability as high as 4 9s (99.99%).
It is reasonable to say that a system that has been running stably for so long should not have a problem on its own. It is said that one of the students performed an O&M operation according to the previous person's documentation, which led to the failure of the storage system. Generally speaking, the release of changes to such an important infrastructure requires a very thorough process, and would not be performed by someone unfamiliar with the documentation of the previous person, unless there is a case that the "previous person" is no longer there. According to online sources, the department has experienced layoffs, and there are even whispers that very few people remain in the department.
The truth is unknown to us, but it sounds quite reasonable. This is because there are usually gray scale releases and disaster recovery drills within the big players that don't directly affect all users.
-
A gray-scale release is an incremental deployment approach to updating IT infrastructure, where changes are made on a subset of devices first to see how they work. If all is well, then the changes are gradually expanded.
-
Disaster Tolerance Exercise refers to the testing and validation of an infrastructure's emergency response and recovery capabilities in the event of a disaster to ensure that in the event of a critical infrastructure failure or disaster, the system can be quickly restored to minimize the impact of business interruption.
The architects of the big factories, especially the people in the infrastructure team, must be aware of these operations, but why are they not implemented? It may be due to lack of manpower, or laziness, or lack of experience of the current people, or incomplete documentation left by the previous people. In short, the stability of the system has a lot to do with "people".
Reminds me again of the last Microsoft global blue screen thing, and it's true that serious bugs often only require one or two programmers, or some minor actions.
The whole failure recovery lasted 2 hours, which is already relatively slow, the use of preparatory programs to restore services, or shield part of the failure, or roll back the release, it should not take these times, I guess there is a problem with the data. If the data is damaged or inconsistent in the failure, the difficulty of service recovery will indeed increase greatly, in order to ensure data integrity, may need to carry out data recovery, rebuild the index, synchronize the data and other operations, and these may extend the failure recovery time.
Haven't seen an official failure report yet, so this is all just speculation now.
After the recovery, NetEase Cloud Music quickly issued a compensation measure -- users can receive 7 days of free membership benefits!Note that it can only be picked up on August 20th!
Enter the cloud music can be seen in the search bar to collect members of the entrance, although only 7 days, almost meaning, but as a netease cloud music 10 level members, I have to lead the explosion!
From this incident, we can also see that once the failure occurs, the head is not only the development and operation and maintenance personnel! Product students need to quickly formulate a compensation strategy to ensure user satisfaction; operations and customer service need to urgently respond to user questions and complaints to calm their emotions; and PR must quickly respond to public pressure, control the development of the situation, and prevent the spread of negative impact. At the same time, management also needs to coordinate all departments to ensure that the problem is fully handled.
We ourselves have done a lot of products, there have been failures, we are a small-scale response to the sweat, it is difficult to imagine NetEase Cloud Music, the team behind this national product, yesterday under much pressure. The more labor you wear, the more responsibility you have!
Friends, what do you think about this outage, have you ever suspected a problem with your network or equipment?