3. Hash tables

hash table

1. Introduction

A hash table, also known as a hash table, is a data structure that stores data in the form of "key-value".
Storing data in the form of "key-value" means that any key corresponds uniquely to a location in memory. You can quickly find the corresponding value by simply typing in the key you are looking for.
A hash table can be thought of as a high-level array whose subscripts can be very large integers, floats, strings or even structures.

If the elementstorage locationrather than...keycodeA deterministic correspondence function, Hash(), is established between the keycodes such that each keycode corresponds to a unique storage location:

Address ＝ Hash(key)

This function calculates the storage position during insertion and stores it in that position;

The same calculation is performed on the keycode of the element during the search, and the value of the derived function is treated as the element storage location, and then the search is performed according to this location. This is thehashing method。

The conversion function used in the hashing method is called the hash function (also known as thehash function). The table constructed by this method is called a hash table (also called ahash table)。

hash functionis a compressed image function with a much larger set of keycodes than the set of hash table addresses.

It is therefore possible to map different keycodes to the same hash address after the calculation of the hash function, which produces thecollision (of interests)。

Example: There is a set of table entries with keycodes 12361, 07251, 03309, 30976 , and the hash function used is hash(x) = x % 73 + 13420.

Then we have hash(12361) = hash(07251) = hash(03309) = hash(30976) = 13444.

That is, for different keycodes, the same hash address is obtained by the calculation of the hash function.

In summary, the following two issues need to be discussed with respect to the hashing method:

For a given set of keycodes, choose asimple calculationboth (... and...)Addresses are more evenly distributedof hash functions to avoid or minimize conflicts;
Designing programs for conflict resolution.

2.* Hash functions

A few requirements when constructing hash functions (hash functions):

The hash function should be simple and able to compute the result in a short period of time.
The definition domain of the hash function must include all the keycodes to be stored, and its value domain must be between 0 and m-1 if the hash table allows m addresses.
The address computed by the hash function should be uniformly distributed throughout the address space: if the key is a randomly selected key code from the set of key codes, the hash function should be able to take each value from 0 to m-1 with equal probability.

2.1 Direct addressing method

This type of function takes the value of some linear function of the keycode as the hash address:

Hash(key) = a * key + b (a, b are constants)

This type of hash function is a one-to-one mapping and generally does not create conflicts. However, it requiresThe size of the hash address space is the same as the size of the keycode collection。

Example: There is a set of keycodes as follows: { 942148, 941269, 940527, 941630, 941805, 941558, 942047, 940001 }.

The hash function is: Hash(key) = key-940000

Hash (942148) = 2148 Hash (941269) = 1269 Hash (940527) = 527 Hash (941630) = 1630 Hash (941805) = 1805

Hash (941558) = 1558 Hash (942047) = 2047 Hash (940001) = 1

2.2 Numerical analysis

There are n d-bits, each of which may have r different symbols. These r different symbols in the frequency of occurrence of each may not be the same. According to the size of the hash table, select a variety of symbols evenly distributed in a number of bits as the hash address.

Formula for calculating the uniformity of the distribution of symbols λ(k) in each digit:

Alt Text

where α(i,k) denotes the number of times the i-th symbol appears in the k-th position, and n/r denotes the expected value of the symbol appearing in n numbers.

The smaller the calculated value of λ(k), the more evenly the symbols are distributed in that bit (kth bit).

Example:

9 4 2 1 4 8 ① position, λ(1) = 57.60; ② position, λ(2) = 57.60 ;

9 4 1 2 6 9 ③ position, λ(3) = 17.60; ④ position, λ(4) = 5.60 ;

9 4 0 5 2 7 ⑤ position, λ(5) = 5.60; ⑥ position, λ(6) = 5.60 ;

9 4 1 6 3 0

9 4 1 8 0 5

9 4 1 5 5 8

9 4 2 0 4 7

9 4 0 0 0 1

① ② ③ ④ ⑤ ⑥

If the address range of the hash table has 3 digits, take ④, ⑤ and ⑥ digits of each key code as the hash address of the record.

The numerical analysis method is only applicable to ex anteClearly know the distribution of the values of each bit of all the key codes in the table, which is completely dependent on the keycode set. If the keycode set is changed, which bits are chosen have to be decided again.

2.3 Dividing by the remainder method (most commonly used)

Let the number of allowable addresses in the hash table be m. Take an address that is not larger than m, but the mostNear or equal to The prime p of m is used as the divisor to convert the keycode to a hash address using the following function:

hash (key) = key % p (p <= m)

where "%" is the operation of taking the remainder of an integer division, which requires that the prime number p is not a number close to a power of two.

Example: We have a key = 962148 and a hash table of size m = 25, HT[25]. Take the prime number p = 23. hash(key) = key % p.

The hash address is then hash(962148) = 962148 % 23 = 12.

Can be calculated according to the address to store records. Note that the range of addresses calculated using the hash function is from 0 to 22, while the addresses 23 and 24 can not actually be calculated using the hash function, and can only be stored in theConflict managementwhen reaching these addresses.

2.4 Squared center method

It first calculates the square of the inner code of the identifier that constitutes the key code, and then takes the middle bits as the hash address according to the size of the hash table.

An identifier can be represented by an internal code of a computer word length. Because the middle bits of the square number of the internal code is generally determined by all the characters of the identifier, so the hash address calculated for different identifiers are mostly different.
In the squaring method, the hash address is usually taken as a power of 8. For example, if the total number of hash addresses is taken as m = 8^r, then the square of the internal code is taken to the middle r bits.

Example:

identifiers	internal code	internal code square (i.e. the square of a Chinese character)	hash address
A	01	01	001
A1	0134	20420	042
A9	0144	23420	342
B	02	04	004
DMAX	04150130	21526443617100	443
DMAX1	0415013034	5264473522151420	352
AMAX	01150130	135423617100	236
AMAX1	0115013034	3454246522151420	652

The octal endocode representation of an identifier and its squared value and hash address

2.5 Folding method

This method divides the keycode from left to right intobitwise equalityand the several parts of theNumber of bits per partresemblehash table address bitsThe same, only the last part of the digit can be shorter.

By superimposing these parts of the data, the hash address of the record with the keycode can be obtained.

There are two stacking methods:

Shift method: add the last digits of the parts in alignment;
Boundary method: each part is not broken, folded back and forth along the boundaries of each part, and then aligned and added.

Example: Given a key code of key = 23938587841, if the storage space is limited to 3 bits, the result will be 3 bits per segment.

The above keycode can be divided into 4 segments:239 385 878 41

Stacked, the sum is then deleted from the highest bit that exceeds the number of address bits, leaving only the lowest 3 bits as the usable hash address.

This method is generally used to obtain a hashed address when the key code has a large number of bits and the distribution of digits on each bit of the key code is roughly uniform.

Assuming that the address space is HT[400], using the above function to calculate, take 3 of the bits, the value of the range of 0 to 999, may be beyond the range of the address space, for this reason it is necessary to 0 to 999 compressed to 0 to 399. can be calculated by multiplying the address by a compression factor of 0.4, to compressed to the permissible range of the calculated address.

The advantage of the folding method is that it can cope with inconsistent keyword lengths, while increasing the randomness of the hash.
However, there may also be some problems with the approach, such as the partial folding approach may lead to hash conflicts, or additional boundary cases may need to be handled when splitting keywords.

3. Conflict management

3.1 Closed Hash Methods

Since either hash function cannot avoid generating conflicts, it is important to choose a good conflict resolution method.

The closed hash method stores all records directly in a hash table and continues probing based on some way if a conflict occurs.

Linear Probing

If a conflict occurs at d, check d + 1, d + 2 in turn ......

EXAMPLE: Suppose a set of table entries is given with keycodes Burke, Ekers, Broad, Blum, Attlee, Alton, Hecht, Ederly.

The hash function used is: take the position of its first letter in the alphabet.Hash (x) = ord (x) - ord (‘A’) ord ( ) is a function that finds the code of a character.

This yields Hash (Burke) = 1 Hash (Ekers) = 4 Hash (Broad) = 1 Hash (Blum) = 1

Hash (Attlee) = 0 Hash (Hecht) = 7 Hash (Alton) = 0 Hash (Ederly) = 4

Let the hash table HT[26], m = 26. Using linear probing method to deal with conflicts, the hash result is shown in Fig.

0	1	2	3	4
Attlee	Burke	Broad	Blum	Ekers
5	6	7	8	9
Alton	Ederly	Hecht

When you need to search or join a table entry, use the hash function to calculate the position of the keyword in the table:H0 = hash ( key )
Once a conflict occurs, look for the next position Hi in the table in descending order:Hi = ( H(i-1) +1 ) % m， i =1, 2, …, m-1
This means that the following linear probe sequence is used to find the next position in the table:H0 + 1, H0 + 2, …, m-1, 0, 1, 2, …, H(0-1)
It can also be written as the following generalized formula:Hi = ( H0 + i ) % m， i =1, 2, …, m-1

Linear probing method realization:

const int N = 360007; // N is the maximum number of elements that can be stored

class Hash
N is the maximum number of elements that can be stored.
    int keys[N]; int values[N]; // N is the maximum number of elements that can be stored.
    int values[N]; // N is the maximum number of elements that can be stored.
public.
    Hash() { memset(values, 0, sizeof(values)); /* used to set the contents of an area of memory to a specified value */}

    int& operator[](int n)
    {
        // Returns a reference to the corresponding Hash[Key].
        // value equal to 0 is considered null
        int idx = (n % N + N) % N, cnt = 1; // here (n % N + N) % N is used.
        // (n % N + N) % N is used here instead of n % N because the % operation in C++ can't convert negative numbers to positive numbers.
        while (keys[idx] ! = n && values[idx] ! values[idx] ! = 0)
        {
            idx = (idx + cnt) % N; cnt += 1; }
        }
        keys[idx] = n; }
        return values[idx]; }
    }
}; }

3.2 Open Hash Method (Chain Address Method)

If we set the position of the address space of the hash table from 0 to m-1, all the keycodes in the keycode set are divided into m subsets, and the keycodes with the same address are grouped into the same subset.
Each subset is called a bucket. Usually the table entries in each bucket are linked by a single linked table.
The query involves sweeping through the entire linked table at the corresponding position and comparing each piece of data in it to see if its key value matches the query's key value.

Chain address method implementation:

const int M = 10;

class HashTable 
{
    struct Node
    {
        Node* next;
        int value;
        int key;
        Node(Node* n = nullptr,int v = 0,int k = 0) 
            :next(n), value(v),key(k){}
    };

    Node* head[M];
   
    int f(int key) { return (key % M + M) % M; }

public:
    HashTable() { for (auto& h : head) { h = nullptr; } }

    int get(int key)
    {
        for (Node* p = head[f(key)]; p; p = p->next)
        {
            if (p->key = key)return p->value;
        }
        return -1;
    }

    int add(int key, int value)
    {
        if (get(key) != -1) return -1;
        Node* buf = new Node(nullptr, value,key);
        Node* p = nullptr;
        for (p = head[f(key)]; p && p->next; p = p->next){}
        if (p) { p->next = buf; }
        else { head[f(key)] = buf; }
        return value;
    }
};

practice

state in advancen numbers, requiring that duplicates of them be removed and only the first occurrence be retained.

Input Format:

There are multiple sets of data in this question.

First line an integerT, indicates the number of data sets.

For each set of data:

First line an integern。

second linen A number that represents the given number.

Output format:

For each set of data, output a line for the number left after de-emphasis, with a single space separating the two numbers.

Sample:

importation

2
11
1 2 18 3 3 19 2 3 6 5 4
6
1 2 3 4 5 6

exports

1 2 18 3 19 6 5 4
1 2 3 4 5 6