Data Structures - Hashtables, A First Look

Today we continue our study of the new data structure, the hash table.

01Definitions

Let's start with some common conceptual terminology explanations.

hashing (computing): A hash table implementation called hashing is a technique that implements techniques for performing lookups, insertions, and deletions with constant-level time complexity;

hash: The result computed by the hash function on the input value (key) to indicate the uniqueness of the key, which is usually a value of fixed length, i.e., no matter how big the key is, the length of the hash value is fixed;

hash address: The storage location calculated from the hash value, i.e., the location where the value is stored, usually using the hash value and the size of the hash table modulo the hash address.

collisions: also known ascollision (of interests), means two different keys, after the hash function calculation to get the same hash value;

load factor: also known asload factor, which is used to measure the extent to which a hash table is populated, is calculated by dividing the number of elements already stored in the hash table by the size of the hash table;

hash function: also known ashash function, which refers to a function that maps a key to a hash value;

A hash table, also known as a hash table, is a data structure that stores data in the form of "key-value". It can be understood that any key can be uniquely corresponded to an address in memory, and this address is stored in the value, through the key to quickly find the value. you can also understand the hash table as a kind of high-level array, in which the key is equivalent to the array subscripts, the array subscripts can not only be an integer, but also can be a floating-point number, string, structure, etc., which is equivalent to the value of the array element value. is equivalent to the value of the array element.

Through the previous data structure learning, we know that the array is easy to find, insertion and deletion of difficult; chained table is difficult to find, insertion and deletion of easy; and the hash table is a collection of the two into one, do a large find, insertion, deletion are easy to a data structure.

Essentially a hash table is an array. Although the key is compared to an array subscript, the key is not really an array subscript. So you need a conversion tool to convert the key to an array subscript, and that tool is the hash function.

The following figure shows how to generalize the key to get to the value in the hash table. Input key4 is calculated by hash function to get array index 3 and finally value4 is taken out by array subscript.

02Hash Functions

One of the core difficulties of hashing technology is how to design an accurate, fast, uniform and collision-resistant hash function. And a good hash function is crucial for hash table, which affects the storage and retrieval efficiency of hash table.

1、Four major characteristics

（1）deterministic: means that the same input key always outputs the same result;

（2）quick calculation: means that the hash is computed as fast as possible;

（3）uniform distribution: means that the input key is mapped to the hash table as evenly as possible to minimize collisions;

（4）crashworthiness: means minimizing the probability that different input keys generate the same hash value;

2, common hash function algorithm

(1) Algorithm for taking modes

This algorithm is one of the simplest and commonly used methods for constructing hash functions.

principle: Computes the hash value by taking the key to be the size of the hash table (usually prime);

formulas：hash(key) = key % m；

vantage: Simple and easy to implement and fast to compute;

drawbacks: If m is not prime, it will increase the collision probability, so m is better to choose a prime number to reduce the collision probability.

typical example: Suppose we have a hash table of size 10 with the following keys: 11, 23, 4, 19.

hash(11) = 11 % 10 = 1

hash(23) = 23 % 10 = 3

hash(4) = 4 % 10 = 4

hash(17) = 17 % 10 = 7

Then the final hash values generated are 1, 3, 4, and 7, and these values can then be used as the index of the hash table.

(2) Multiplication algorithm

The algorithm is used to reduce the collision probability by multiplication in order to make the hash value uniformly distributed;

principle: By multiplying the key by a fixed constant A (usually the inverse of the golden ratio), taking the fractional part of the product, and multiplying by the size of the hash table (usually a prime number), the integer part of the result is the hash value;

formulas：hash(key) = ⌊m * (key * A % 1)⌋；

vantage: Simple and easy to implement, with a more even distribution of hash values;

drawbacks: The choice of constant A has a great influence on the calculation of the resultant hash value, and a bad choice can easily increase the collision probability;

typical example: Suppose we have a hash table of size 10, with constant A of 0.618 and the following keys: 11, 23, 4, 19.

hash(11) = ⌊10 * (11 * 0.618 % 1)⌋ = 7

hash(23) = ⌊10 * (23 * 0.618 % 1)⌋ = 2

hash(4) = ⌊10 * (4 * 0.618 % 1)⌋ = 4

hash(17) = ⌊10 * (17 * 0.618 % 1)⌋ = 5

Then the final hash values generated are 7, 2, 4, and 5, and these values can then be used as the index of the hash table.

(3) DJB2 algorithm

DJB2 is a widely used string hashing algorithm that is simple and efficient, proposed by Daniel and J. Bernstein. Its core formula: hash = hash * 33 + c, where 33 is a common base.

principle: An initial hash value (usually 5381) is used, and then the hash value is finally obtained by progressively weighting the ASCII value of each character of the key;

formulas：hash(key) = {

hash = 5381

for character c in string:

hash = ((hash << 5) + hash) + c

return hash & 0xFFFFFFFF

}

vantage: Easy to understand, fast, bottom crash;

drawbacks: The choice of base (33) has an impact on the hashing performance, and although it generally performs well, it is not optimal; the output of the hash value usually requires a masking operation (e.g., & 0xFFFFFFFF), which can lead to partial loss of information, especially when dealing with larger data.

Instead of using hash * 33 + c, the above formula uses ((hash << 5) + hash) + c because bitwise operations are more efficient.

(4) Other algorithms

In addition to the above three algorithms there are many other algorithms, such as DJB2 algorithm similar to the BKDR algorithm, PJW algorithm, for example, there are also such as: direct fixing algorithm, numerical analysis algorithm, squaring algorithms, collapsing algorithms, random number algorithms, etc., here do not go into detail, interested in their own research study.

classifier for sums of money: The test method code as well as the sample source code have been uploaded to the code repository for those who are interested./hugogoos/Planner