Hello, I'm Crooked.
Had a rant the other day:
It was meant to be a mere troll, but so many people were more interested in the details of it.
We are all engaged in technology, for "stepping on bugs" such a pleasant thing, it is normal to be interested.
In fact, this bug of mine can't really be called a bug in the strictest sense of the word, because it has nothing to do with the program, and it doesn't even have much to do with technology. As you can guess from the title, it's related to a business parameter.
But in the process, because I was a personal witness to the whole thing, I still have something to think about when I look back on it now.
I think this is a "typical event" for programmers.
Then use this article to review it together.
contexts
To contextualize this issue, you don't even need a specific business scenario, just the following very common interest calculation formula around which the problem started:
Interest = Interest accruing amount * Daily rate.
Daily rate = Annual rate/360
Since the calculation of the daily interest rate involves division, when the corresponding requirement was first developed, the business requirement was to keep 7 decimal places for the daily interest rate.
In the program, the annual interest rate and the daily interest rate is saved in two fields, the daily interest rate in the initialization of the time even if the good fall library, the subsequent procedures directly take this calculated daily interest rate on the line.
The system is online, and it's safe and sound.
After running for a while, the business came up with another requirement: the current precision is not enough and needs to be adjusted to 11 as a decimal.
You do not have to be curious about what business scenarios on the side of the crooked master, anyway, I went to look at the business data, the demand is reasonable, then the demand to take over the work on the line.
Saving 7-digit and 11-digit decimals, we are all developers, and surely we know that this is a small change and can be done very quickly.
It was also true that although I didn't write the code corresponding to the previous requirements, I had read the code and knew exactly where the changes were, so it was developed quickly.
As mentioned earlier, this requirement was previously run online for a while at 7 decimal places, so some stock configurations exist.
In response to this stock data, when I was in the requirements review meeting, I mentioned: what to do with the stock configuration?
Business Reply: When the requirement goes live this time, you should recalculate it according to 11 decimal places and then write SQL to update it.
I made a calculation in my mind: the formula was clear, and I had the APR, so I could do a handful of calculations without much problem.
Just say yes.
Then, unsurprisingly, there was an accident.
Assuming an annual interest rate of 2.5%, after dividing by 360 and retaining 11 decimal places, it should be 0.00006944444.
And I don't know why my hand was shaking, but I wrote 0.00069444444 in the SQL.
I'll give you a comparison:
0.00006944444
0.00069444444
The equivalent of the daily rate I wrote out was expanded tenfold.
Then go back and look at the formula:
Interest = Interest-bearing amount * Daily rate
The daily interest rate is expanded by a factor of ten, then the corresponding accrual amount is also expanded by a factor of ten.
That is the context of the question.
A simple human error has nothing to do with the program, so it is not strictly a program bug.
But the question is really low enough.
Why wasn't it discovered?
So how did this incorrect SQL make it past the code review and test validation hurdles to production?
First of all, this time the code was submitted with no review session at all.
I have code submission access and code review access. So I submitted it myself and reviewed it myself.
Actually, this requirement should have been done by another partner in the group, but he was transferred to another group at that time.
When he was still in our group, our collaborative model was that he submitted code and I reviewed it.
If there is this link, I think I have a 50% chance of finding the problem.
Why is it 50%?
Because it depends on whether or not I'm working on something else at the time I'm reviewing the code, if I'm working on something else, I might take a formalistic look at it a couple times. If there's nothing else going on and the amount of code in this commit is small, I'll basically go over the commit carefully.
After passing the code review, the next step should be the testing session.
The test focuses on whether the final interest calculated meets expectations after the precision is changed from 7 to 11 bits.
He tested it by going through the entire process of the entire business.
This 11-digit precision daily interest rate is automatically calculated by the program when you configure the APR on the page in "Full Process", so there is no mistake.
And when he was verifying the SQL statement, the test environment didn't have the configuration of the production environment, so when he took the SQL I provided, he could only make sure that the syntax of the writing was okay and it could be executed properly, but he couldn't make sure that the data in it was correct.
And I remember very clearly, I said to him at that time: just execute the SQL without reporting errors, I'll make sure the values are correct.
And dramatically, the test coworker looked at the value very carefully, and he went and counted it and it was indeed 11 decimals. But unfortunately, from his point of view, he couldn't realize that the value had been expanded by a factor of ten.
So, the testing session didn't find this problem either:
0.00006944444
0.00069444444
Just bring on the production.
A problem should not normally be taken to production, but we really can't guarantee that the testing session will be able to detect all the problems, so production validation of new projects and iterations is also very necessary.
This one we did too.
It is reasonable to assume that the data on the production is already wrong, and it is an obvious error of "10 times the amount of interest", which should have been detected if the data validation had been done voluntarily.
So why did you do a production validation and not find a problem?
Since there were three stock configurations at the time, I provided 3 SQLs, one of which was calculated correctly.
Each stock allocation corresponds to a large amount of interest data, and the one that is calculated correctly corresponds to even more data, in the proportion of more than 60%.
When I did the production validation, I randomly selected two out of a large amount of interest accrual data, and both of the selected ones happened to be the correct SQL counterparts.
So I found that it met expectations and concluded that the production verification passed.
Standing at this juncture and reviewing the entire event, this should be the time when problems are most likely to be identified.
But not found.
The fundamental reason is that the validation scheme is not rigorous, and the metaphysical reason is that luck is not on my side.
How was it exposed?
Think about it, can you monitor this kind of business parameter misconfiguration by any monitoring rules?
It's actually very difficult.
We generally do technical level monitoring to monitor whether the program is working properly as expected. For example, if there is an anomaly in the calculation process, then we can monitor it.
However, in the case where the value involved in the calculation is just incorrect, but a value can be calculated normally, no error is reported.
This kind of problem is difficult to monitor by technical means. If you insist on doing the monitoring, you can definitely do it, such as from the dimension of anomalous float, the dimension of horizontal data comparison, but the supporting development costs go up again.
How did I find out about this?
It was also pure luck.
It was a Friday night when I was doing another data validation in a scenario that had nothing to do with this problem, and I stumbled upon a data amount that was significantly larger than the previous days.
This is not business as usual.
Then further tracking, and finally localized to the previous problem SQL. by this time, three days have passed since the SQL was uploaded, and a batch of error data has already been generated.
If I hadn't stumbled across this problematic data, at what point would the problem have been exposed?
It's when the business uses this data to do reconciliation.
At that point, the nature of the whole problem changes. It is not just a question of whether it is too late to deal with it, but whether the problem is of a completely different nature, whether it is discovered by "development autonomy" or by "external feedback".
Generally speaking, no matter what the issue is, putting aside the severity for a moment, anything that the devs discover on their own can somewhat make things less embarrassing.
That is why we have emphasized the importance of "monitorability".
I then contacted the business and gave feedback on the situation. He indicated that it would be fine to fix the data before he uses the batch next time. In about a month's time, he will use this data.
This gives me close to a month to deal with the problem and prevent it from growing.
There was plenty of time, and standing on that, I lucked out pretty good.
The problem has been exposed, followed by the development of a fix for this erroneous data.
Repair program is related to business scenarios, belonging to a number of business scenarios superimposed on each other, so the repair program is in fact more complex, involving the "repair number" and "complementary number", there is no need to expand the description.
Just wanted to briefly mention that this fix was something I came up with over the weekend, and there were a lot of details I needed to take into account, even writing over the pseudo-code in my mind.
It's true that it's a waste of a weekend, but it's paying for your own mistakes and half blaming others for deserving it.
And for colleagues involved in subsequent program discussions, it is the time spent on this matter that falls into the category of unmitigated disaster.
That's how the whole thing went down, a decimal point that started a bloodbath.
look back
Now that the whole picture is in front of you, what lessons have you learned?
Because my hand shook and I wrote one decimal wrong, which is really the direct cause, so is the idea to be a little more careful next time when dealing with this kind of data?
I don't think so.
The lesson I learned is my title: developers, don't touch the damn business parameters!
If at the very beginning of the requirements review meeting, when we discussed the stock data.
The business says: when the requirement goes live this time, you recalculate it according to 11 decimal places, and then write SQL to update it.
I said: No, this belongs to the business parameters, I can not go to move. After the go-live is completed, it will have this function, you can go through the page configuration to modify.
I know their process for modifying business parameters, it's long and complicated.
First the business needs to initiate an OA process for a parameter change, which then goes to his department head for approval.
After the approval by the head of the business department is completed, it will go to the person who is specifically responsible for the configuration of the business parameters, and will also need to be reviewed by the head of the department corresponding to that person.
Only after the completion of the audit have the authority to modify this business parameters, and the modification of this parameter, there are two or even three levels of audit on the corresponding system functions.
After the entire process is complete, the person initiating the OA will need to confirm the changes to see if the page is configured the way they want it to be.
With this set of processes going on, do you think anything could go wrong?
It's hard to go wrong.
You can criticize the process for being too bloated, but you always end up recognizing that the process actually protects the hitters.
I know he process is more complex and it costs me almost nothing to write a SQL, but that's assuming the SQL is correct.
If he hadn't agreed to help him deal with the stock data via SQL, he would have actually had a more formalized process to deal with the data, and it wouldn't have gone wrong.
Afterwards, when we review, some colleagues also privately raised this question to me: why not go through the OA process to adjust this parameter?
Also, regarding process, let me give you an example from the programmer's side.
A core developer has operational access to the online database, and let's assume for a moment that this person is absolutely loyal, absolutely trustworthy, absolutely scrupulous, and absolutely not going to delete the database and run away.
One day, he received an alert message, and after troubleshooting, he realized that he needed to go and change the status of some data inside the database, which he did straight away.
This operation is very common, especially in small companies or in some companies that are in a rapid growth phase.
Later the company grew up, began to pay more attention to operational risk, recovered the database rights of all personnel, the past is not to be blamed, in the future you want to modify the database data, you have to initiate an approval process, after layers of approval before execution.
This process is a million times heavier than the "go straight to revision" action.
From a programmer's point of view, in previous years it was possible to directly manipulate production data, and suddenly this system came out and greatly affected the previous development inertia. So when the implementation just started, you might have cursed: xxx.
But in the long run, the process actually protects you.
When you have database privileges, operate right and no one will praise you. Get it wrong and you're the culprit.
Having an approval process reduces the cost of errors while compounding the cost of operations.
The length of time to deal with the problem may increase, the agility to deal with the problem may be reduced, but from the company's point of view, with the development of the company "stability" is the eternal theme, in the face of stability, agility can be sacrificed.
Crooked Master used to have this kind of access in the era of the first company's business savagery, a time when he had just joined the workforce for more than two years and felt that this was the way things should be, that this was the right thing to do, that he could be agile enough to deal with problems quickly enough.
Then the permissions were recalled and I cursed a few times in private at the time.
Then come back, with experience and have seen more and more things in the workplace, only gradually realized: the barbaric era is indeed out of the hero, but I did not grasp the opportunity to become a hero. After the barbaric era of process standardization, rules and regulations are in fact to protect the group of people who did not become heroes, including me.
Finally, a word of advice for you, and for myself: developers, you better know the business meaning behind every business parameter in your database, but don't touch the damn business parameters. It's not your turn to touch them, the people who should touch them will do so in the right process.
Keep that string taut in your mind no matter what.