Data Structures - Hashtables, Revisited

The book picks up where the previous one left off, and we continue to talk about loose lists.

From the above chapter it is not difficult to find, no matter how to build the hash function will always collide, at most, can only reduce the probability of collision, but can not eliminate the collision, so how to solve the problem of collision has become the most important hash.

01Collision Solutions

Let's take a look at a few collision handling strategies.

1. Chain method

The core idea of the chaining method can be understood as putting the colliding keys directly into a large set;

Let's first review what is a collision, collision is two different key, through the hash function to calculate the same hash value.

Why is there a problem if there is a collision? Because a hash table is essentially an array, and a subscript of an array can only correspond to one value. If key1 and key2 collide, it means that the value2 corresponding to key2 will overwrite the value1 corresponding to key1, which will lead to data loss.

My intuitive idea to solve this problem is that if value2 does not overwrite value1 when a collision occurs, but both are stored at the same time, wouldn't the problem be solved?

The answer is yes, but there are a lot of details to be handled inside. For example, how do you store more than one element in an array? How do you find keys that collide?

First of all, how to store multiple elements in an array element? In fact, there are many ways, such as array elements directly stored in an array, such as an array stored in a chain table header pointing to a chain table and so on.

In the hash table, the location of each array element is called "bucket" or "slot", and each bucket (slot) corresponds to a chain table, all the keys that collide are put into the same bucket corresponding to the chain table, this collision handling strategy is called chained method.

When accessing an element, there are two steps, the first step is to find the bucket where the key is located, and the second step is to access the element in the chain list corresponding to the bucket.

This scheme is very simple and practical, and because the chain table nodes are dynamically applied this also greatly improves the space utilization.

2. Open addressing method

The core idea of the open addressing method can be understood as the idea that when a conflict occurs, if the current hash location is already in use it somehow continues to probe for the next available hash location.

And some way to continue to detect specifically when a conflict occurs, then the current hash value on top of the specified step and determine whether the current hash location is available or not, if not available then repeat this step.

Thus the open addressing method can be summarized as:hash_value=(hash(key)+step(i))％m，1≤i≤m-1

As shown in the figure above when if the array of light green on behalf of the empty address, light orange that has been occupied, when the key is passed into the key4, after the hash function to calculate the index of the array should be 6, and at this time the index 6 has been occupied, so continue to detect backward, the index 8 has been occupied, continue to detect backward, the index 0 has been occupied, continue to detect backward, the index 1 is free, deposited into the value4.

According to the different step generation rules, it can be divided into the following types of detection methods: linear detection, square detection, double hash detection, random detection, etc.

(1) Linear detection method

The linear detection method solves collisions by linearly detecting the next available position in the hash table.

When a collision occurs, the next element position is probed sequentially until an available hash address is found or the table is exhausted. Therefore, when probing to the end of the table, it does not end the probing, but goes back to the beginning of the table and continues probing until it reaches the beginning of the probing position.

The formula can be expressed as:hash_value=(hash(key)+step)％m，1≤step≤m-1

vantage: 1. Simple to realize; 2. Higher efficiency, especially more prominent when the load factor is low.

drawbacks: It may lead to aggregation phenomenon, i.e., multiple elements will be clustered together after collision, which leads to occupancy of neighboring positions and increases the detection time.

(2) Square Detection Method

The square probing method is used to resolve hash conflicts by using square increments, thus effectively dispersing elements.

If the step size of each detection in the linear detection method is STEP, then the detection step size of the square detection method is the square of STEP.

The formula can be expressed as:hash_value=(hash(key)+step^2)％m，1≤step≤m-1

Compared with the linear detection method, the main advantage of the square detection method is that the number of incremental squares makes the detection position distribution wider and the position more dispersed, which reduces the focusing phenomenon of the elements and thus improves the finding efficiency. Therefore, the square detection method is usually better than the linear detection method when dealing with collision, of course, the choice of which method should be based on the specific application scenarios and performance requirements to decide.

vantage: Reduced aggregation and uniform distribution of elements.

drawbacks: 1. There may still be secondary aggregation (i.e., some locations may be frequently accessed because of the detection rules). 2. It may not be possible to detect all locations, especially when the load factor is high.

(3) Double hash detection method

Double hash probing is a method that uses two hash functions to determine the probe sequence, thus spreading out conflicts more efficiently. That is, when a collision occurs, a second hash function is used to generate an increment thus determining the next location to look for.

The formula can be expressed as:hash_value=(hash(key)+ i * hash2(key))％m，1≤i≤m-1

vantage: 1. Reduced aggregation phenomenon, high efficiency, more random detection path. 2. Can detect all positions, theoretically avoiding the aggregation problem. 3.

drawbacks: The implementation is relatively complex and requires the choice of a suitable second hash function.

(4) Randomized detection method

The randomized probing method resolves conflicts by randomly selecting the next probe location. That is, when a collision occurs, the incremental style of determining the next location to look up is a random number.

The formula can be expressed as:hash_value=(hash(key)+ r)％m，1≤r≤m-1

vantage:: 1. Significantly reduce the aggregation phenomenon, location detection randomness. 2 can effectively disperse elements and reduce the impact of collision.

drawbacks: 1. Randomness may lead to erratic performance, especially at high load factors.2. Difficult to predict and debug.

summarize

3. Re-hashing

The core idea of re-hashing can be understood as expanding an existing hash table and recalculating the existing element positions.

Let's recall the concept of load factor, which is a measure of how well a hash table is populated and is calculated by dividing the number of stored elements in the hash table by the size of the hash table.

Too large a load factor can lead to a number of columns of problems, starting with an increase in the probability of collisions, which in turn leads to an increase in the cost of dealing with collisions, which in turn leads to performance degradation. It can also lead to memory fragmentation and poor utilization, among other problems.

And the load factor is the timing that triggers re-hashing, which is performed when the load factor exceeds a certain threshold (typically 0.75).

The difficulty of re-hashing is how to deal with the old and new data, is it to migrate all the old data to the new hash table at one time when triggering re-hashing, or to migrate the old data in batches until all the old data have been migrated?

One-time migration may be relatively simple to realize, but it also raises a lot of questions, what if there are other operations during the migration process? What if the data volume is too large and the migration time is too long?

The implementation of batch migration will be a little more complex, to deal with the new and old data coexistence when the find, insert, delete and other operations.

The overall idea is as follows:

(1) When the load factor reaches a set threshold, new memory space is reclaimed and no old data is migrated;

(2) When there is new data to be inserted, insert the new data into the new memory space and take out an old data and insert it into the new memory space;

(3) Repeat step (2) until all the old data is migrated to the new memory space;

(4) When the old and new data coexist in the search, can be used first in the old space for the search, if it does not exist and then to the new space to find.

classifier for sums of money: The test method code as well as the sample source code have been uploaded to the code repository for those who are interested./hugogoos/Planner