I was repeatedly tortured by a bug in the .NET8 JIT for half a year

A long time a long time did not write a blog, just recently the garden has earned a ray of hope, must be a lively water a. Things are like this, earlier this year, part of the company's api server .net version from 6 to 8, after all, 6 will soon be the EOL (. Things are like this, in the early this year, part of the company's api server .net version from 6 upgraded to 8, after all, 6 will soon be EOL (.NET6 TLS to November 12, 2024). NET8 runtime JIT bug was triggered 3 months after the upgrade, and in the absence of any changes to the code. NET8 runtime JIT bug was triggered 3 years after the upgrade, and without any changes to the code. It's a strange mother who opens the door for the strange, and it's so strange that I'm going to tell you how the bug was found and how it happened.

What did I do?

As mentioned above, a portion of the api server was upgraded from NET6 to NET8 earlier this year.

What's the problem?

Of all the dozen or so API projects that upgraded to NET8 (there are about a dozen projects that are sliced and diced based on service scope/traffic, etc., but common functionality is published via shared libs), only one of them consistently reports a business error at some point in one/or a handful of the deployed instances (AWS ECS Tasks) (time/number of occurrences varies). /The number of instances is variable). The phenomenon is that the plaintext after AES decryption (this part is common to all projects) is always missing some characters. And once this bug occurs in this instance, all subsequent executions of this AES decryption will lose characters.

Initial examination and suspicion?

After reviewing the code, no problems were found, and the unit tests were always passable, so any doubts about the current implementation of the logic code were initially eliminated. The simplified code is as follows

        public int Decrypt(byte[] buffer, string key, string iv, out byte[] decryptedData)
        {
            int decryptedByteCount = 0;
            decryptedData = new byte[];

            using var aes = ();
             = ;
             = 16;
             = ;
            
            var instPwdArray = (key);
            var instSaltArray = (iv);

            using (var decryptor = (instPwdArray, instSaltArray))
            {
                using (var memoryStream = new MemoryStream(buffer))
                {
                    using (var cryptoStream = new CryptoStream(memoryStream, decryptor, ))
                    {
                        int read;
                        do
                        {
                            read = (decryptedData, decryptedByteCount,  - decryptedByteCount);
                            decryptedByteCount += read;
                        } while (read != 0);
                    }
                }
            }
            while (decryptedData[decryptedByteCount - 1] == 0 && decryptedByteCount > 0)
            {
                decryptedByteCount--;
            }

            return decryptedByteCount;
        }

The bug is that the byte count of the plaintext returned by this method after decryption is not what is expected, i.e. the decryptedByteCount in the code. For example, if the plaintext is 10 characters, the decryptedByteCount is 8. The caller uses this decryptedByteCount to read the plaintext, and then uses the decryptedByteCount to read the plaintext. The caller will use this decryptedByteCount to read the plaintext.

At the same time, the host information collected when the bug appeared online (twice) had a similar feature, that is, the CPUs were AMD EPYC series CPUs, and it happened that not long ago we had a bug on another .NET8 service.NET8 GC Exception on AMD EPYC CPUsI was wondering if it could be a similar problem, I was about to give a ticket to the dotnet/rumtime team, but halfway through writing the ticket, it came up again on the line, and the host information was captured with an intel CPU in it. Oh well, that's not going to work.

That said, the AMD EPYC bug took a couple of days, so I'll think about doing a post on it next time, so I won't take up any more text in this JIT bug blog post.

Action: Add debug code for online debugging?

We added debugging code to the outer call of the core method, and if we detected that the bug was triggered, we executed another method copied from the core method (with more debugging information, e.g. logging the value of each parameter/variable, etc.), and then we confidently released the debugging code version online, waiting for the bug to show up, and then catching it! We even assured the client that we had everything under control and that we would have the bug fixed by the end of the day.
Result: After the debugging code went live, the bug was never triggered again. And I never left work again. In the end, since the bug was small and didn't reoccur, I stopped (R&D resources are expensive, after all).

One day, 2 months later, I met the bug again!

In the stall about to leave work, a colleague came to me and said: "my friend (* accent), I called the base library AES decryption on a project on my side, encountered a loss of plaintext problem, can you help me look at it?" , this bug performance than my side is even more outrageous, my side only decrypt the plaintext loss of the last few characters, while his side is lost to the value of the remaining characters, that is, the value of the remaining characters, is even more outrageous than my side. , this bug in his project performance than in my side is even more outrageous, my side is only decrypted plaintext after the loss of the last few characters, and his side is lost value left a character, that is, no matter what the ciphertext (the same rules, the same length) decrypted only a character (such as AAA encrypted BBB, decrypted BBB after the return of the "A", that is, look forward to) decryptedByteCount return 3, the result are returned 1). And his project triggers this bug more often. It was triggered within 10 minutes of going live. I got an important clue from my coworker that the bug is more likely to be triggered when executed frequently, so I said to my coworker : "My friend (with a * accent), I'm off work now, so you can roll back to the previous version. We'll continue tomorrow."

And then, the next day, we went deeper, and finally found it.

Book yesterday, yesterday we captured an important information, that is, the frequent execution of frequent trigger this bug, so we write a local dead loop to call the AES decryption method, and as expected, a short period of time, the execution of more than about 10,000 times, this bug can always be triggered. And as the test progresses, more and more information is available. For example, the bug can only be triggered in release mode, and the core method code has to be unchanged to trigger the bug. Obviously, "it" is becoming clearer and clearer that it has something to do with the runtime JIT compilation optimization.
We need to know that after running a method in dotnet for many times, runtime will determine whether a method is a hot method or not based on the time/number of calls and other information, and compile and optimize it one or more times to provide execution performance. And we triggered a local bug before and after the dump, from the analysis of the dump also confirmed that the bug is indeed due to the second optimization of the JIT caused.
Before the bug was triggered, the version of this method in the dump was 00007ffe03122e70 (QuickJitted), and after the bug was triggered, the version of this method was Current CodeAddr: 00007ffe03134620 (OptimizedTier1).

> ip2md 00007FFE0312323A
MethodDesc:   00007ffe031de8d0
Method Name:          [......].Decrypt(Byte[], , , Byte[] ByRef)
Class:                00007ffe031ef318
MethodTable:          00007ffe031de918
mdToken:              000000000600036B
Module:               00007ffe031dd740
IsJitted:             yes
Current CodeAddr:     00007ffe03122e70
Version History:
  ILCodeVersion:      0000000000000000
  ReJIT ID:           0
  IL Addr:            0000018f23a1c864
     CodeAddr:           00007ffe03122e70  (QuickJitted)
     NativeCodeVersion:  0000000000000000

> ip2md 00007FFE03134AE3
MethodDesc:   00007ffe031de8d0
Method Name:          [......].Decrypt(Byte[], , , Byte[] ByRef)
Class:                00007ffe031ef318
MethodTable:          00007ffe031de918
mdToken:              000000000600036B
Module:               00007ffe031dd740
IsJitted:             yes
Current CodeAddr:     00007ffe03134620
Version History:
  ILCodeVersion:      0000000000000000
  ReJIT ID:           0
  IL Addr:            0000018f23a1c864
     CodeAddr:           00007ffe03129cb0  (QuickJitted + Instrumented)
     NativeCodeVersion:  0000018F0054A360
     CodeAddr:           00007ffe03134620  (OptimizedTier1)
     NativeCodeVersion:  0000018F02253050
     CodeAddr:           00007ffe03122e70  (QuickJitted)
     NativeCodeVersion:  0000000000000000

At this point, we didn't hesitate to send the information about the investigation to the dotnet/runtime team, waiting for them to dig deeper. And we stopped there, why did we stop? Why? Can't read any of the assembly code! I can't read any of it! Before optimization 1,000 lines, after optimization 3,000. In addition to the fact that there is not much information on JIT, there is no way to start.
The runtime team was also very helpful, and within an hour they had reproduced the bug, and found theroot (cause)
While waiting for the runtime team to backport the fix to NET8, we had to fix it to avoid triggering the bug, and the fix was very simple - just break the syntax that was caught in the bug, and we ended up with code that looked like this

 using (var aes = ())
 {
      = ;
     ......
 }
 while (decryptedData[decryptedByteCount - 1] == 0 && decryptedByteCount > 0)

something be taken up later in speech or writing

If we can capture/monitor enough information and analyze the changes before and after a bug occurs, we may be able to get to the root cause of the problem faster, e.g. in this case, if we capture/monitor the JIT's hierarchical compilation events, we may be able to solve the problem earlier in the timeline. AI can also be called upon to analyze the differences between the events and make recommendations. I'm looking forward to this.