A little reference counting, a lot of performance considerations

This article is based on Netty version 4.1.

in the previous postA Chat About the Design and Implementation of the Netty Data Mover ByteBuf Architecture. In the author's detailed introduction to the design of the entire ByteBuf system, which I think Netty for the design of reference counting is very wonderful, so this part of the design content is specifically independent.

Netty introduced the reference counting mechanism for ByteBuf, in the whole design system of ByteBuf, all ByteBuf will inherit an abstract class AbstractReferenceCountedByteBuf, which is the implementation of the interface ReferenceCounted.

public interface ReferenceCounted {
     int refCnt();
     ReferenceCounted retain();
     ReferenceCounted retain(int increment);
     boolean release();
     boolean release(int decrement);
}

Each ByteBuf maintains an internal reference count called refCnt.refCnt() method to get the current reference count refCnt of ByteBuf. When ByteBuf is referenced in other contexts, we need to pass theretain() method to add 1 to the reference count of ByteBuf. Alternatively, we can add 1 to the reference count of ByteBuf via theretain(int increment) method to specify the size of the refCnt increase.

If there is a reference to a ByteBuf, then there is a release of the ByteBuf, and whenever we are done using the ByteBuf, we need to manually call therelease() method decrements the reference count of the ByteBuf by one. When the reference count, refCnt, becomes 0, Netty passes thedeallocate method to free the memory resource referenced by ByteBuf. At this point therelease() method returns true , or false if refCnt is not already 0. Similarly, we can pass therelease(int decrement) method to specify how much the refCnt is reduced (decrease).

1. Why introduce reference counting

"What does it mean to refer to an ByteBuf in another context? Let's say we create an ByteBuf in thread 1, and then drop it to thread 2 for processing, which in turn may drop it to thread 3, each of which has its own contextual processing logic, such as handling the ByteBuf, releasing it, etc. This makes the ByteBuf de facto shared across multiple thread contexts. Each thread has its own contextual processing logic, such as ByteBuf handling, releasing, etc. This makes ByteBuf in fact a shared situation in multiple thread contexts.

In this case, it's hard to tell in the context of a single thread whether an ByteBuf should be freed or not. For example, thread 1 may be ready to free the ByteBuf, but it may be in use by another thread. That's why it's important for Netty to introduce reference counting for ByteBufs, so that every time a ByteBuf is referenced you need to pass it through theretain() method adds 1 to the reference count.release() When you release the ByteBuf, subtract the reference count by 1. When the reference count reaches 0, there is no other context referencing the ByteBuf, so Netty can release it.

In addition, compared to JDK DirectByteBuffer, which relies on the GC mechanism to free the Native Memory referenced behind it, Netty prefers to manually free DirectByteBuf in a timely manner. Because the JDK DirectByteBuffer needs to wait until GC occurs, it is difficult to trigger GC because the JVM heap memory occupied by the DirectByteBuffer's object instance is too small, which leads to a delay in releasing the referenced Native Memory, and in serious cases, it will accumulate more and more, resulting in OOM. This will cause a delay in releasing the referenced Native Memory, and in serious cases, more and more will accumulate, resulting in OOM. This will also cause a very large delay in the process of requesting DirectByteBuffer.

Netty avoids this by manually freeing Native Memory after each use, but without relying on the JVM, there will always be memory leaks, such as forgetting to call therelease() method to release it.

So in order to detect memory leaks, this is another reason why Netty introduced reference counting for ByteBuf. When ByteBuf is no longer referenced, i.e., there are no strong references or soft references, and if a GC occurs at this time, then the ByteBuf instance (located in the JVM heap) needs to be reclaimed, and Netty will check if the reference count of Netty will check if the reference count of the ByteBuf is 0. If it is not 0, then we forgot to call therelease() The ByteBuf has been freed, and a memory leak has been detected.

After detecting that a memory leak has occurred, Netty then passes thereportLeak() The information about the memory leak is summarized in the form of aerror of the log level is output to the log.

Here, you may wonder, is not just the introduction of a small reference counting, this is not difficult? Is it worth mentioning here? Isn't it just initializing refCnt to 1 when creating the ByteBuf, adding 1 to refCnt every time it is referenced in another context, and subtracting 1 from refCnt every time it is released? When it reaches 0, the Native Memory will be released, that's too simple, right?

In fact, Netty's design of reference counting is very careful, not so simple, even a little complex, behind which lies a great deal of performance considerations and a thorough consideration of complex concurrency issues, and the repeated trade-offs between performance and thread-safety issues.

2. Initial design of the reference count

So in order to make sense of the whole design lineage regarding reference counting, we need to roll back to the original starting point -- version 4.1. and take a look at the original design.

public abstract class AbstractReferenceCountedByteBuf extends AbstractByteBuf {
    // Atomic Updates refCnt (used form a nominal expression) Updater
    private static final AtomicIntegerFieldUpdater<AbstractReferenceCountedByteBuf> refCntUpdater =
            (, "refCnt");
    // reference count，initialize to 1
    private volatile int refCnt;

    protected AbstractReferenceCountedByteBuf(int maxCapacity) {
        super(maxCapacity);
        // reference countinitialize to 1
        (this, 1);
    }

    // reference count增加 increment
    private ByteBuf retain0(int increment) {
        for (;;) {
            int refCnt = ;
            // at a time retain (used form a nominal expression)时候对reference count加 1
            final int nextCnt = refCnt + increment;

            // Ensure we not resurrect (which means the refCnt was 0) and also that we encountered an overflow.
            if (nextCnt <= increment) {
                // in the event that refCnt have already provided 0 Or an overflow.，failing which an exception is thrown
                throw new IllegalReferenceCountException(refCnt, increment);
            }
            // CAS update refCnt
            if ((this, refCnt, nextCnt)) {
                break;
            }
        }
        return this;
    }

    // reference count减少 decrement
    private boolean release0(int decrement) {
        for (;;) {
            int refCnt = ;
            if (refCnt < decrement) {
                // 引用(used form a nominal expression)次数必须和释放(used form a nominal expression)次数相等对应
                throw new IllegalReferenceCountException(refCnt, -decrement);
            }
            // at a time release reference count减 1
            // CAS update refCnt
            if ((this, refCnt, refCnt - decrement)) {
                if (refCnt == decrement) {
                    // in the event thatreference count为 0 ，then release Native Memory，and returns true
                    deallocate();
                    return true;
                }
                // reference count不为 0 ，come (or go) back false
                return false;
            }
        }
    }
}

In the design of versions prior to 4.1., it was really as simple as we thought it would be, initializing refCnt to 1 when the ByteBuf was created. Each time a reference was made to retain, the reference count was increased by 1, and each time a release was made, the reference count was decreased by 1, which was replaced by CAS in a for loop. When the reference count is 0, it is replaced bydeallocate() Release Native Memory.

3. Introduction of command-level optimization

4.1. The design is clean and clear, and we see no problems with it at all, but Netty's performance concerns don't stop there. Because the XADD instruction is more powerful than the CMPXCHG instruction on the x86 architecture, the compareAndSet method is implemented underneath by the CMPXCHG instruction, and the getAndAdd method is underneath by the XADD instruction.

So in the quest for the ultimate in performance, Netty replaced the compareAndSet method with getAndAdd in version 4.1.

public abstract class AbstractReferenceCountedByteBuf extends AbstractByteBuf {

    private volatile int refCnt;

    protected AbstractReferenceCountedByteBuf(int maxCapacity) {
        super(maxCapacity);
        // The reference count is still initially 1
        (this, 1);
    }

    private ByteBuf retain0(final int increment) {
        // compared with compareAndSet implementation，Here will for Remove the loop.
        // And each time, it's the first thing to do with the refCnt count up increment
        int oldRef = (this, increment);
        // convex curve refCnt The counting is done to determine the anomaly.
        if (oldRef <= 0 || oldRef + increment < oldRef) {
            // Ensure we don't resurrect (which means the refCnt was 0) and also that we encountered an overflow.
            // If the original refCnt have already provided 0 or refCnt spillage，classifier for written items (such as an official statement) refCnt perform a fallback，and throw an exception
            (this, -increment);
            throw new IllegalReferenceCountException(oldRef, increment);
        }
        return this;
    }

    private boolean release0(int decrement) {
        // firstly refCnt lower count decrement
        int oldRef = (this, -decrement);
        // in the event that refCnt have already provided 0 follow Native Memory release
        if (oldRef == decrement) {
            deallocate();
            return true;
        } else if (oldRef < decrement || oldRef - decrement > oldRef) {
            // in the event that释放ordinal number大于 retain ordinal number or refCnt underflow
            // classifier for written items (such as an official statement) refCnt perform a fallback，and throw an exception
            (this, decrement);
            throw new IllegalReferenceCountException(oldRef, decrement);
        }
        return false;
    }
}

In the 4.1. implementation, Netty updates the refCnt via CAS after checking for retain and release exceptions in a for loop, otherwise it throws an IllegalReferenceCountException, which is a pessimistic strategy for updating reference counts.

In the 4.1. implementation, Netty removes the for loop, which is the opposite of compareAndSet, and instead updates the refCnt via getAndAdd, then determines the relevant exceptions after updating it, and if an exception is found, it backs out and throws an IllegalReferenceCountException, which is an optimistic strategy for updating reference counts.

For example, when retain increases the reference count, first increment the refCnt, then determine whether the original reference count oldRef is already 0 or whether refCnt has overflowed, if so, back off the value of refCnt and throw an exception.

When release decrements the reference count, first decrements the refCnt, then determines whether the number of releases is greater than the number of retains to prevent over-release, and whether the refCnt overflows, and if so, backs out the value of the refCnt and throws an exception.

4. Introduction of concurrent security issues

In 4.1. we designed the retain and release operations for reference counting to be more performant than they were in 4.1., and while they are now more performant, they introduce new concurrency issues.

Let's assume a scenario where we have a ByteBuf with refCnt = 1 and thread 1 executes therelease() Operation.

In the 4.1. implementation, Netty first updates the refCnt to 0 with getAndAdd and then calls thedeallocate() method to free up Native Memory, that's simple and clear, right? Let's add a little more concurrency complexity to it.

Now we insert a thread 2 between steps 1 and 2 of the above diagram, and thread 2 executes this ByteBuf concurrently.retain() Methods.

In 4.1.'s implementation, thread 2 first updates refCnt from 0 to 1 via getAndAdd, and then thread 2 realizes that refCnt's original value, oldRef, is equal to 0. That is, thread 2 is callingretain() At that point, ByteBuf's reference count is already 0, and Thread 1 is ready to release the Native Memory.

So thread 2 needs to call the getAndAdd method again to back out the value of refCnt, from 1 to 0 again, and finally throw an IllegalReferenceCountException. this is obviously the correct and semantic result. After all, you can't just call theretain() 。

Now that everything seems to be calm and organized as we envisioned, we might as well add a little more concurrency complexity to it. Insert a thread 3 between step 1.1 and step 1.2 above, and thread 3 executes the ByteBuf concurrently again.retain() Methods.

Since the update of the reference count (step 1.1) and the fallback of the reference count (step 1.2) are not atomic operations, if a thread 3 is inserted between these two operations, and the thread 3 is concurrently executing theretain() method, the reference count refCnt is first increased from 1 to 2 by getAndAdd.

Note that at this point, thread 2 hasn't had a chance to back out of the refCnt yet, so thread 3 sees a refCnt of 1 instead of 0. 。

Since the oldRef seen by thread 3 at this point is 1, thread 3 successfully calledretain() method increases the reference count of the ByteBuf to 2 without backing out or throwing an exception. In thread 3, it appears that the ByteBuf is a fully functional ByteBuf.

Immediately after that, Thread 1 begins executing Step 2--deallocate() method releases the Native Memory, after which Thread 3 has problems accessing the ByteBuf because the Native Memory has already been released by Thread 1.

5. Trade-offs between performance and concurrency security

Netty now has two choices. The first choice is to roll back to version 4.1. and forgo the performance gains of the XADD directive. The CMPXCHG directive that was used in the previous design was less powerful, but did not suffer from the concurrency safety issues mentioned above.

Because Netty uses a pessimistic strategy to update the reference count in a for loop, first determining exceptions and then updating the refCnt via CAS, it doesn't matter if more than one thread sees the intermediate state of the refCnt, because the next CAS performed will fail along with it.

For example, in the above example, when thread 1 releases ByteBuf, in the gap before thread 1 executes CAS to replace refCnt with 0, refCnt is 1. If thread 2 executes the retain method concurrently in this gap, the refCnt seen by thread 2 is indeed 1, which is an intermediate state. CAS replaces refCnt with 2.

At this point, thread 1 will fail to execute the CAS, but will replace refCnt with 1 in the next round of for loops, which is entirely consistent with reference counting semantics.

Another case is that thread 1 has already executed CAS to replace refCnt to 0, and then thread 2 goes to retain, since the design of 4.1. is to check the exception first and then CAS replacement, thread 2 will first check the ByteBuf's refCnt to 0 in the retain method, and then throw an IllegalReferenceCountException and does not perform CAS. This is also consistent with the semantics of reference counting; after all, you can't access a ByteBuf that already has a reference count of zero.

The second option was to retain the performance gains of the XADD directive while also addressing the concurrency safety issues introduced in the 4.1. release. There is no doubt that Netty chose this option.

Before we get into the exciting design of Netty, I think we should review what the root cause of this concurrency security issue is ?

In the 4.1. design, Netty first updates the value of refCnt via the getAndAdd method, and then rolls back if an exception occurs. The two operations, update and rollback, are not atomic, and the intermediate state between them is visible to other threads.

For example, Thread 2 sees the intermediate state of Thread 1 (refCnt = 0) and adds the reference count to 1
, before thread 2 rolls back, the intermediate state (refCnt = 1, oldRef = 0) is seen again by thread 3, which increases the reference count to 2 (refCnt = 2, oldRef = 1). Thread 3 thinks this is a normal state, but thread 1 thinks the value of refCnt is already 0, and then thread 1 releases the Native Memory, which is a problem.

The root cause of the problem is that different values of refCnt represent different semantics. For example, for thread 1, the release reduces the refCnt to 0, which means that the ByteBuf is no longer referenced and the Native Memory can be released.

Thread 2 then adds refCnt to 1 via retain, which changes the semantics of ByteBuf to mean that the ByteBuf is referenced once in thread 2. Finally, thread 3 adds refCnt to 2 via retain, again changing the semantics of ByteBuf.

As long as the XADD instruction is used to update the reference count, the above concurrent update of refCnt is unavoidable, and the key is that the semantics of ByteBuf change every time the value of refCnt is concurrently modified by other threads. This is the key issue in version 4.1.

If Netty wants to enjoy the performance gains of the XADD directive while also addressing the concurrency safety issues mentioned above, it will have to re-design the reference count. The first requirement is to continue to use the XADD directive for reference count updates, but this will result in a change in the semantics of ByteBuf due to concurrent modifications by multiple threads.

Since multi-thread concurrent modification is unavoidable, can we redesign the reference counting so that the semantics of ByteBuf will always remain the same no matter how many threads modify it. That is, as long as thread 1 reduces refCnt to 0, then no matter how thread 2 and thread 3 concurrently modify refCnt or increase the value of refCnt, the semantics of refCnt being equal to 0 will always remain the same?

6. Introduction of parity design

Here's one of Netty's most ingenious and brilliant designs, where reference counting is no longer designed in the logical sense of the word0 , 1 , 2 , 3 ....., rather, they fall into two broad categories, either even or odd.

The semantics of an even number is that the refCnt of a ByteBuf is not 0, i.e., as long as a ByteBuf is being referenced, its refCnt is an even number, and the exact number of times it has been referenced can be determined via therefCnt >>> 1 to get.
The semantics of an odd number is that the refCnt of a ByteBuf is equal to 0. As soon as a ByteBuf is no longer referenced anywhere, its refCnt is an odd number, and the Native Memory referenced behind it is then freed.

When ByteBuf is initialized, refCnt is initialized to 2 (an even number) instead of 1. Each time it is retained, instead of adding 1 to refCnt, it is added 2 (an even number of steps), and each time it is released, instead of subtracting 1 from refCnt, it is subtracted 2 (also an even number of steps). In this way, as long as the reference count of a ByteBuf is even, no matter how many threads call the retain method concurrently, the reference count will still be even, and the semantics will remain the same.

   public final int initialValue() {
        return 2;
    }

When a ByteBuf is released without any reference count, Netty does not set the refCnt to 0, but to 1 (odd), and for an odd refCnt, no matter how many threads concurrently call the retain and release methods, the reference count will still be odd, and the semantics of a ByteBuf with a reference count of 0 will remain unchanged. The semantics of the ByteBuf reference count of 0 will remain unchanged.

Let's take the concurrency safety problem shown above as an example. In the new reference counting design scheme, first thread 1 performs a release method on the ByteBuf, and Netty sets refCnt to 1 (an odd number).

Thread 2 calls the retain method concurrently and adds refCnt from 1 to 3 via getAndAdd. refCnt is still an odd number, and according to the semantics of an odd number -- the ByteBuf reference count is already 0 -- then Thread 2 throws an IllegalReferenceCountException in the retain method.

Thread 3 calls the retain method concurrently and adds refCnt from 3 to 5 via getAndAdd. See, in the design of the new scheme, no matter how many threads execute the retain method concurrently, the value of refCnt is always just an odd number, and then Thread 3 throws an This is entirely consistent with the concurrency semantics of reference counting.

This new reference counting design scheme was introduced in 4.1. version, just through a parity design, it is very clever to solve the concurrency security problems in 4.1. version. Now that the core design elements of the new scheme are clear, I will continue to introduce the implementation details of the new scheme with the 4.1. version.

All ByteBufs in Netty inherit from AbstractReferenceCountedByteBuf, which implements all ByteBuf reference counting operations, and the implementation of the ReferenceCounted interface is here.

public abstract class AbstractReferenceCountedByteBuf extends AbstractByteBuf {
    // gain refCnt The fields in the ByteBuf Offsets in Object Memory
    // follow-up action Unsafe treat (sb a certain way) refCnt carry out an operation
    private static final long REFCNT_FIELD_OFFSET =
            (, "refCnt");

    // gain refCnt field (used form a nominal expression) AtomicFieldUpdater
    // follow-up action AtomicFieldUpdater to operate refCnt field
    private static final AtomicIntegerFieldUpdater<AbstractReferenceCountedByteBuf> AIF_UPDATER =
            (, "refCnt");

    // establish ReferenceCountUpdater，treat (sb a certain way)于引用计数(used form a nominal expression)所有manipulate最终都会代理到这个类中
    private static final ReferenceCountUpdater<AbstractReferenceCountedByteBuf> updater =
            new ReferenceCountUpdater<AbstractReferenceCountedByteBuf>() {
        @Override
        protected AtomicIntegerFieldUpdater<AbstractReferenceCountedByteBuf> updater() {
            // pass (a bill or inspection etc) AtomicIntegerFieldUpdater manipulate refCnt field
            return AIF_UPDATER;
        }
        @Override
        protected long unsafeOffset() {
            // pass (a bill or inspection etc) Unsafe manipulate refCnt field
            return REFCNT_FIELD_OFFSET;
        }
    };
    // ByteBuf 中(used form a nominal expression)引用计数，initialize 2 （even number）
    private volatile int refCnt = ();
}

A refCnt field is defined to record the number of times a ByteBuf has been referenced. Due to the parity design, when creating a ByteBuf, Netty initializes the refCnt to 2 (an even number), which logically means that the ByteBuf has been referenced once. Subsequent retains on the ByteBuf will add 2 to the refCnt, and releases will subtract 2 from the refCnt, and the single operation for reference counting is done in steps of 2.

Since there is a more general reference counting abstract class AbstractReferenceCounted in Netty, in addition to AbstractReferenceCountedByteBuf, which is specifically designed to implement ByteBuf's reference counting functionality, there is a more general reference counting abstract class AbstractReferenceCounted, which is used to implement reference counting functionality for all system resource classes (of which ByteBuf is just one). ByteBuf is just one of the memory resources).

Since both are implementations of reference counting, these two classes contained a lot of duplicate logic for reference counting related operations in previous versions, so Netty introduced a ReferenceCountUpdater class in the 4.1. version to aggregate all reference counting related implementations here.

ReferenceCountUpdater There are two ways to manipulate the reference count refCnt, one is to manipulate the refCnt via the AtomicFieldUpdater, which we can do via theupdater() Gets the AtomicFieldUpdater corresponding to the refCnt field.

The other is to manipulate refCnt via Unsafe, which we can do via theunsafeOffset() to get the offset of the refCnt field in the ByteBuf instance's memory.

Why does Netty provide two ways of accessing or updating refCnt when it is logical to do it in one way ? Wouldn't that be redundant? This point you can first think about why , then we analyze the source code details when the author in the answer for you.

OK, here we formally start to introduce the specific implementation details of the new version of the reference counting design scheme, the first question, in the new design scheme, how do we get the logical reference count of ByteBuf ?

public abstract class ReferenceCountUpdater<T extends ReferenceCounted> {
    public final int initialValue() {
        // ByteBuf The reference count is initialized to 2
        return 2;
    }

    public final int refCnt(T instance) {
        // pass (a bill or inspection etc) updater gain refCnt
        // according to refCnt exist realRefCnt 中gain真实的引用计数
        return realRefCnt(updater().get(instance));
    }
    // gain ByteBuf The logical reference count of the
    private static int realRefCnt(int rawCnt) {
        // parity judgment
        return rawCnt != 2 && rawCnt != 4 && (rawCnt & 1) != 0 ? 0 : rawCnt >>> 1;
    }
}

Because of the parity reference counting design, we need to determine whether the current rawCnt (refCnt) is odd or even when we get the logical reference count, which represent different semantics.

If rawCnt is an odd number, the current ByteBuf is not referenced anywhere, and the logical reference count returns 0.
If rawCnt is an even number, then the current ByteBuf is still referenced somewhere and the logical reference count israwCnt >>> 1。

The realRefCnt function is actually a simple parity judgment logic, but its implementation reflects Netty's extreme pursuit of performance. For example, it's easy to determine whether a number is odd or even by using therawCnt & 1 If it returns 0, rawCnt is an even number, and if it returns 1, rawCnt is an odd number.

But then we see that Netty prefixes the parity condition with therawCnt != 2 && rawCnt != 4 What's this for?

In fact, Netty is here to try to use the more performant== replace& operations, but it is not possible to use== operation to enumerate all even values (and it's not necessary), so just use the== operation to determine the reference counts that occur frequently in real-world scenarios, generally the most frequent reference counts are 2 or 4, which means that ByteBuf will only be referenced 1 or 2 times in most scenarios, and for such high-frequency scenarios, Netty uses the== operations to target optimizations, and low-frequency scenarios fall back to the& Arithmetic.

Most of the performance optimizations are the same, and we usually can't come up with a big, global optimization scheme, it's impossible and inefficient. Often, the most effective optimizations with immediate results are those that are specific to local hotspots.

The same is true for the setting of the reference count, both of which need to take into account the parity conversion, which we do in thesetRefCnt The parameter refCnt specified in the method represents the logical reference count -- the0, 1 , 2 , 3 ....The logical reference count is multiplied by 2 so that it is always an even number, but to set it to a ByteBuf, the logical reference count is multiplied by 2.

    public final void setRefCnt(T instance, int refCnt) {
        updater().set(instance, refCnt > 0 ? refCnt << 1 : 1); // overflow OK here
    }

With these foundations in place, let's take a look at how Netty addresses the concurrency safety issues that existed in version 4.1. in the design of the new version of the retain method. First, Netty's parity design for reference counting is transparent to the user. The reference count is still a normal natural number to the user--0, 1 , 2 , 3 .... 。

So whenever the user calls the retain method in an attempt to increase the reference count of a ByteBuf, the logical increment step - increment - is usually specified (from the user's point of view), whereas from an implementation-specific point of view, Netty adds twice as many increments (rawIncrement ) to the refCnt field.

    public final T retain(T instance) {
        // Logically add 1 to the reference count, but actually add 2 (from an implementation point of view).
        return retain0(instance, 1, 2); }
    }

    public final T retain(T instance, int increment) {
        // all changes to the raw count are 2x the "real" change - overflow is OK
        // rawIncrement is always twice the logical count increment
        int rawIncrement = checkPositive(increment, "increment") << 1; // Set rawIncrement to the logical count of increment.
        // Set rawIncrement to the ByteBuf's refCnt field.
        return retain0(instance, increment, rawIncrement);
    }

    // rawIncrement = increment << 1
    // increment is the logical incremental step of the reference count.
    // rawIncrement represents the actual incremental step of the reference count.
    private T retain0(T instance, final int increment, final int rawIncrement) {
        // First add up the value of refCnt with the XADD instruction.
        int oldRef = updater().getAndAdd(instance, rawIncrement); // If oldRef is an odd value, then add it.
        // If oldRef is an odd number, i.e. the ByteBuf is no longer referenced, throw an exception
        if (oldRef ! = 2 &&; oldRef ! = 4 && (oldRef & 1) ! = 0) {
            // If oldRef is already an odd number, it will be an odd number no matter how many threads retain it concurrently here, and an exception will be thrown here
            throw new IllegalReferenceCountException(0, increment);
        }
        // don't pass 0!
        // refCnt can't be 0, it can only be 1!
        if ((oldRef <= 0 && oldRef + rawIncrement >= 0))
                || (oldRef >= 0 && oldRef + rawIncrement < oldRef)) {
            // If the refCnt field is already overflowed, backoff and throw an exception
            updater().getAndAdd(instance, -rawIncrement);
            throw new IllegalReferenceCountException(realRefCnt(oldRef), increment);
        }
        return instance; }
    }

First of all, the new version of the retain0 method still retains the performance benefits of the XADD instruction introduced in version 4.1. The general processing logic is also similar, where the refCnt is first added to the rawIncrement via the getAndAdd method.retain(T instance) For example, just add 2 here.

Then determine whether the original reference count oldRef is an odd number, if it is an odd number, then it means that ByteBuf does not have any reference, the logical reference count is already 0, then throw IllegalReferenceCountException.

In the case of an odd reference count, no matter how many threads concurrently add 2 to refCnt, refCnt will always be an odd number, and will eventually throw an exception. The point of solving the concurrency safety problem is to ensure that concurrent execution of the retain method does not change the original semantics.

Finally, it will determine whether the refCnt field is overflowed or not, and if it is overflowed, it will be backed out and an exception will be thrown. Let's take the previous concurrency scenario as an example, and use a concrete example to recall the subtleties of the parity design.

Now thread 1 executes the release method on a ByteBuf whose refCnt is 2, and the logical reference count of the ByteBuf is 0. For an ByteBuf without any reference, the new design of the ByteBuf is that its refCnt can only be an odd number, not 0, so Netty will set the refCnt to 1 here. Netty will set refCnt to 1, and then call the deallocate method in step 2 to free the Native Memory.

Thread 2 inserts between step 1 and step 2 and executes the retain method concurrently on the ByteBuf. At this point, thread 2 sees a refCnt of 1, and then adds the refCnt to 3 through getAndAdd, which is still an odd number, and then throws an IllegalReferenceCountException.

Thread 3 inserts itself between steps 1.1 and 1.2 and executes the retain method again concurrently on the ByteBuf. At this point, Thread 3 sees a refCnt of 3, and then adds the refCnt to 5, which is still an odd number, via getAndAdd, and then throws an IllegalReferenceCountException.

This ensures the concurrent semantics of reference counting -- as long as a ByteBuf doesn't have any references (refCnt = 1), other threads will get an exception no matter how concurrently they execute the retain method.

However, the concurrent semantics of reference counting cannot be guaranteed by the retain method alone, it also needs to cooperate with the release method, so in order to guarantee the concurrent semantics, the design of the release method can not use the higher performance of the XADD directive, but to fall back to the CMPXCHG directive to achieve.

Why do you say so? Because the new version of reference counting design adopts parity implementation, refCnt is even means ByteBuf still has references, refCnt is odd means ByteBuf doesn't have any references any more, so you can release the Native Memory safely. For a ByteBuf with an odd refCnt, no matter how many threads concurrently execute the retain method, the refCnt is still an odd number, and an IllegalReferenceCountException will be thrown, which is the concurrency semantics of reference counting.

To ensure this, you need to update the refCnt in an even number of steps for each call to the retain and release methods, such as adding 2 to the refCnt for each call to the retain method and subtracting 2 from the refCnt for each call to the release method.

But there is always a moment when refCnt will be reduced to 0, right? In the new version of parity design, refCnt is not allowed to be 0, because once refCnt is reduced to 0, the concurrent execution of retain by multiple threads will add refCnt to an even number again, which will lead to concurrency problems.

And every time we call the release method, we subtract 2 from the refCnt. If we use the XADD instruction to implement release, think back to the design in version 4.1., the first thing that comes in is to subtract 2 from the refCnt via the getAndAdd method, and then the refCnt becomes 0, which is a concurrency safety issue. So we need to update refCnt to 1 with the CMPXCHG instruction.

Here some students may want to ask, then can not do a little if judgment, if the refCnt after subtracting 2 becomes 0, we getAndAdd method to update the refCnt to 1 (minus an odd number), so can not also take advantage of the performance of the XADD instruction?

The answer is no, because the two operations, the if judgment and the getAndAdd update, are still not atomic, and multiple threads can still execute the retain method concurrently during this gap, as shown in the following figure:

Between thread 1's if judgment and getAndAdd update, thread 2 sees refCnt 2, then thread 2 adds refCnt to 4, and thread 3 immediately adds refCnt to 6. This ByteBuf looks completely normal to both thread 2 and thread 3, but thread 1 releases Native Memory.

And with this design, you can subtract an odd number from refCnt by getAndAdd, and add an even number to refCnt by getAndAdd, which messes up the original parity design.

So our design goal is to make sure that the release method atomically updates the refCnt to 1 when the ByteBuf does not have any reference counts. This must be accomplished using the CMPXCHG instruction and not the XADD instruction.

Furthermore, the CMPXCHG instruction can atomically determine if there is a concurrency situation, and if there is, CAS will fail and we can continue to retry. However, the XADD instruction cannot atomically determine whether there is a concurrency situation, because it updates first and then determines the concurrency every time, which is not atomic. This is particularly evident in the following source code implementation。

7. Minimize memory barrier overheads

    public final boolean release(T instance) {
        // First attempt to read the refCnf in an unSafe nonVolatile way
        int rawCnt = nonVolatileRawCnt(instance); // If the logical reference count is reduced to 0, then tryFinalRawCnt(instance).
        // If the logical reference count is reduced to 0, then use CAS to update refCnf to 1 by tryFinalRelease0.
        // If CAS fails, retry with retryRelease0
        // If the logical reference count is not 0, subtract 2 from refCnf with nonFinalRelease0
        return rawCnt == 2 ? tryFinalRelease0(instance, 2) || retryRelease0(instance, 1) ?
                : nonFinalRelease0(instance, 1, rawCnt, toLiveRealRefCnt(rawCnt, 1)); }
    }

One small detail that again demonstrates Netty's dedication to performance is that the refCnt field is declared by Netty to be a volatile field in the ByteBuf.

private volatile int refCnt = ();

Our normal reads and writes to refCnt have to go through a memory barrier, but Netty uses nonVolatile to read the value of refCnt for the first time in the release method, without going through a memory barrier, and reads the cache line directly, avoiding the barrier overhead.

    private int nonVolatileRawCnt(T instance) {
        // gain REFCNT_FIELD_OFFSET
        final long offset = unsafeOffset();
        // pass (a bill or inspection etc) UnSafe way to access the refCnt ， Avoiding Memory Barrier Overhead
        return offset != -1 ? (instance, offset) : updater().get(instance);
    }

Some students may ask, if the refCnt is read without going through the memory barrier, won't the refCnt read be an incorrect value?

In fact, it does, but Netty doesn't care, and it doesn't matter if it reads an incorrect value, because reference counting is parity-designed, and we don't need to read an exact value the first time we read the reference count, and since we can just read it through UnSafe, there's still a memory-barrier overhead left.

Why don't we need an exact value? Because if the original refCnt is an odd number, then no matter how many threads concurrently retain, the final number is still an odd number, we only need to know that the refCnt is an odd number can be thrown IllegalReferenceCountException. It doesn't really matter if you read a 3 or a 5.

What if the original refCnt is an even number? It doesn't really matter, we may read the right value or the wrong value, and if we happen to read the right value, so much the better. If we get the wrong value, it doesn't matter because we're using CAS to update it later, in which case CAS will fail and we'll just need to update it correctly in the next round of for loops.

If the refCnt read happens to be 2, that means the logical reference count of the ByteBuf is 0 after this release, and Netty will update the refCnt to 1 via CAS.

   private boolean tryFinalRelease0(T instance, int expectRawCnt) {
        return updater().compareAndSet(instance, expectRawCnt, 1); // any odd number will work
    }

If the CAS update fails, it means that multiple threads may concurrently execute the retain method on the ByteBuf, and the logical reference count may not be 0 at this time, for this concurrent situation, Netty will retry in the retryRelease0 method, and subtract 2 from the refCnt.

    private boolean retryRelease0(T instance, int decrement) {
        for (;;) {
            // Read the refCnt in a Volatile way.
            int rawCnt = updater().get(instance), // get the logical reference count if refCnt has been changed.
            // Get the logical reference count and throw an exception if the refCnt has become an odd number.
            realCnt = toLiveRealRefCnt(rawCnt, decrement); // If this release is performed , the logical reference count is thrown.
            // If this release is complete, the logical reference count is 0.
            if (decrement == realCnt) {
                // CAS updates refCnt to 1
                if (tryFinalRelease0(instance, rawCnt)) {
                    return true; }
                }
            } else if (decrement < realCnt) {
                // original logical reference count realCnt is greater than 1 (decrement)
                // then reduce refCnt by 2 via CAS
                if (updater().compareAndSet(instance, rawCnt, rawCnt - (decrement << 1))) {
                    return false; }
                }
            } else {
                // Throw an exception if the refCnt field overflows.
                throw new IllegalReferenceCountException(realCnt, -decrement); }
            }
            // Call yield after a CAS failure
            // Reduce fearless contention, otherwise all threads in high concurrency situations CAS fail here
            (); }
        }
    }

As we can see from the implementation of the retryRelease0 method, CAS can atomically detect if there is a concurrency situation, and if there is, all CASs here will fail, and then the correct value will be updated to refCnt in the next round of for loops. This is something that the XADD instruction cannot do.

If the refCnt read for the first time after entering the release method is not 2, then instead of following the tryFinalRelease0 logic above, the value of the refCnt is subtracted from 2 by CAS in nonFinalRelease0.

   private boolean nonFinalRelease0(T instance, int decrement, int rawCnt, int realCnt) {
        if (decrement < realCnt
                && updater().compareAndSet(instance, rawCnt, rawCnt - (decrement << 1))) {
            // ByteBuf (used form a nominal expression) rawCnt minimize 2 * decrement
            return false;
        }
        // CAS Failure is always a retry，If the reference count is already 0 ，Then throw an exception.，Can't do it again. release
        return retryRelease0(instance, decrement);
    }

summarize

Here, Netty's wonderful design of reference counting, I will provide you with a complete analysis of the end of a total of four very exciting optimization design, we summarize the following:

Replace the CMPXCHG instruction with the better performing XADD instruction.
Reference counting uses a parity design to ensure concurrency semantics.
Adoption of better performance== Operate to replace& Arithmetic.
Try not to go through the memory barrier if you can.