Inferences about set's implementation of the principle of automatic structure de-duplication

From my blog.Original link

Let's start with the conclusion

With each operation being of log complexity, set is unable to complete the de-duplication of the elements of a structure while judging the order and duplicate keywords differently.

First of all, let's look at this structure definition, the purpose is to first de-emphasize by num equal, and then by key descending order.

struct node{
    int num;
    int key;
    bool operator < (const node &b) const{
        if(num == )
            return false;
        else{
            if(key != )
                return key > ;
            return num > ;
        }
    }
};

We then test the following four sets of input data

From the first set of data, our de-duplication requirement is indeed satisfied, the second inserted (3, 7) does not enter set. but we see the second set of data, we inserted after (3, 7) set but did not correctly perform the de-duplication operation we thought. The first thing we need to know is that the set container does this when determining whether the existing element a and the newly inserted element b are equal:

1) Call the compare function with a as the left operand and b as the have operand and return the comparison value

2) With b as the left operand and a as the have operand, call the comparison function once more and return the comparison value.

If the return values of steps 1 and 2 are both false, then a and b are considered equal, and b will not be inserted into the set container

If the return value of both steps 1 and 2 is true, unknown behavior may occur

The fact that (3, 7) was not properly de-emphasized according to the priority of our struct definition only means that it was not compared to (3, 1) at the time of insertion, but was inserted after comparing it to the other elements and directly determining the order of size. Combined with the third set of data, could it be that set was inserted in order and compared from the beginning? Obviously not, because the fourth set of data we replace (2, 4) with (3, 4) on the successful de-emphasis. So the insertion of (2, 4) should affect the structure of the data stored in set.

And we know that set is implemented using a red-black tree at the bottom. So we make the following conjecture to verify the results of three, four sets of data.

For the third set of data, first insert (3, 1) as the root node, after inserting (3, 7) compared with the root node found duplicated, not inserted. Then insert (1, 2) since we sort by the size of the key (1, 2) > (3, 1) so put to the right child node. (2, 4) is similarly placed in the right child node of (1, 2). At this point, the tree loses its balance, so it is left-rotated to rebalance the tree. After the left rotation, the root node is (1, 2), the left child node is (3, 1), and the right child node is (2, 4). Then insert (3, 7), first compared with the root node, found to be greater than, after (3, 7) into the right subtree, and (2, 4) compared with the direct determination of the order, will not be compared with (3, 1), which leads to the inability to correctly de-emphasize.

And for the fourth set of data, we can find that the balance of the tree is not lost after the insertion of (1, 2), so the root node is still (3, 1) so the later inserted elements are still compared with the root node to be able to correctly de-emphasize.

For further verification, we can replace the post-inserted (3, 7) with (3, 0) so that it will go into the left subtree to compare with (3, 1) after comparing with the left-rotated root node (1, 2), and thus be able to be correctly de-weighted. The experimental data is as follows:

It is found that (3, 0) is indeed not inserted, so our conjecture is verified.

In fact, the essential reason for this situation is that we judge the duplication and sorting keywords are different. The element with the same num may be divided into two completely different subtrees because of the different size relationship with the key of the root node. And if we if we replace the keywords of duplication and sorting with the same as below:

struct node{
    int num;
    int key;
    bool operator < (const node &b) const{
        if(num == )
            return false;
        return num > ;
        return key > ;
    }
};

Then there is no case of inability to de-weight. This is because in the latter case, as long as num is the same, the size relationship with the other elements will be determined and will be stored in the same position in the red-black tree.

On the flip side, we chose set because it has excellent complexity at the log level for each operation. And log-level insertion operations can't possibly do comparisons with all elements.