[Data structure] Hash table
Hash table is also called a hash table, which is a linear data structure. In general, the time complexity of o(1) can be used to add, delete, modify and check data. In the Java development language, the bottom layer of HashMap is a hash table.
1. What is a Hash table
Hash table is a linear data structure, and the bottom layer of this data structure is generally through an array To achieve. When performing data addition, deletion, modification, and query, the Hash table first performs a Hash operation on a key value through the Hash function. This Hash operation will map the key to a subscript of the array. After obtaining the subscript, you can directly perform the Hash operation on the array. The data is manipulated. In theory, the time complexity of Hash table data operations is O(1).
The bottom layer of the Hash table is Realized through an array. One characteristic of data is that its length must be specified when it is initialized. So when the data in the Hash table is filled up, if you want to continue to put data into it, you must create an array with a larger capacity, and then copy the array in the previous array to this new array. This process is a performance-consuming operation, so it is best to estimate the data capacity before using the Hash table and try to avoid expansion operations.
2. Hash function
Hash function, also known as hash function, is to take input of any length (also known as pre-image) , Through the hash algorithm, transformed into a fixed-length output, the output is the hash value. This conversion is a compression mapping, that is, the hash value space is usually much smaller than the input space, different inputs may be hashed into the same output, and it is impossible to uniquely determine the input value from the hash value. Assuming that the output value range is S, the properties of the hash function are as follows:
-
A typical hash function has an infinite input value range;
-
When the hash function input is the same, the output must be the same;
-
When the hash function passes in different input values, the return value may be the same or not Same;
-
The output values obtained for different inputs will be evenly distributed;
In addition, the Hash function also has The following two properties:
-
Collision-free: That is, there will be no input x≠y, but H(x)=H(y), In fact, this feature does not hold in theory. For example, the SHA256 algorithm currently used by Bitcoin has 2^256 outputs. If we input 2^256 + 1 times, then a collision will inevitably occur. In fact, it is proved by theory. , With 2^130 inputs, there is a 99% probability that a collision will occur, but even so, even if all computers made by humans have been computing since the birth of the universe until today, the probability of a collision is extremely small.
-
Hiddenity: That is to say, for a given output result H(x), if you want to infer the input x, in calculation Is impossible. If you want to get the possible original input of H(x), there is no better way than exhaustive list.
The commonly used Hash functions are: SHA1, MD5, SHA2, etc.
3. Hash conflict
For different input values, the Hash function may give the same output. This situation is called a Hash conflict.
Hash conflicts are inevitable. The methods we commonly use to resolve hash conflicts are Open Address Method and **Zipper Method**.
3.1 Zipper method
The core idea of the zipper method is: If a Hash conflict occurs at a certain position in the Hash table (that is, when an element is When placed in a certain position in the array, there are already other elements occupying this position), then these elements are stored in the form of a linked list.
The query efficiency of the linked list is Relatively low, so if there are too many conflicts in a certain position of the Hash table, then this position is a very long linked list. The query speed is slow. In Java 8, HashMap has made an optimization, that is, when the length of the linked list reaches 8, it will automatically convert the linked list into a red-black tree, and the query efficiency is higher (the red-black tree is a self-balancing binary search tree).
3.2 Open address method
In the open address method, if the data cannot be directly stored in the array index calculated by the hash function, it is necessary Find another location to store it. There are three ways to find other locations in the open address method, namely linear detection, secondary detection, and rehashing.
3.2.1 Linear detection method
The insertion of linear detection is relatively simple. The way to do this is: first perform a hash mapping of the element, if there is no For other elements, insert data directly at this position; if there is already data at this position, then judge whether there is data at the next position, if there is no direct insertion, if there is data, perform the next judgment until a vacant position is found.
Search for linear detection: first locate the subscript position of the array through the key value, and then compare the value of the data at this position with the value of the data you want to find, if it is equal, it will be found directly, if it is not equal Then continue to judge the next element, if all elements are not found after traversing, it does not exist.
Deletion of linear detection: First, map the key value to a subscript position of the array, and then compare the value of the element in the array with the value of the element you want to delete to find out what you want to delete That element. Then delete the element at this position and set a flag to indicate that there was data at this position (think about why you want to do this at this step)
3.2. 2 Secondary detection method
In the linear detection hash table, data will be aggregated. Once the aggregation is formed, it will become larger and larger, and those hash functions that fall within the range of aggregation Data items need to be moved back step by step and inserted after the aggregation, so the larger the aggregation becomes, the faster the aggregation grows. This is just like when we are visiting a supermarket. When there are a lot of people in a certain place, there will only be more and more people. Everyone just wants to know what’s doing here.
Secondary detection is an attempt to prevent aggregation. The idea is to detect cells that are far apart instead of cells that are adjacent to the original location. In linear detection, if the original index obtained by the hash function is x, linear detection is x+1, x+2, x+3…, and so on. In the second detection, the detection The process is x+1, x+4, x+9, x+16, x+25…, and so on, the square of the number of steps to the original distance.
3.2.3 Double Hash Method
Double Hash is to eliminate the problem of original aggregation and secondary aggregation, whether it is linear detection or quadratic For detection, each detection step is fixed. Double hashing is to add a hash function in addition to the first hash function to generate the detection step size according to the keyword, so that even if the first hash function is mapped to the same subscript of the array, the detection step size is different , So that the problem of aggregation can be solved.
The second hash function must have the following characteristics
- It is different from the first hash function;
- It cannot be output as 0, because The step size is 0, and each detection points to the same position, which will enter an endless loop. After experiments, it is found that the hash function of the form stepSize=constant-(key%constant); works very well. Constant is a prime number and is smaller than the array. capacity.
The core idea of double hash is to generate a random detection step in the second step.
4. Hash table related applications
The computer has only 2G memory, how to find the most frequent occurrence among 2 billion data Integer
First, we need to determine the range of value, because this 2 billion number may be the same number, then the value is 2 billion times. Therefore, at least we need to use an int data to store this number (int occupies 4 bytes in Java);
At the same time, we also need to determine the value range of this 2 billion integer. If the value range is 1 to 2 billion, we can also use int to store the key. If it is a larger value range, we need to consider using long to store it. Let’s consider this problem in an extremely bad situation: that is, 20 pieces of data are all different data, and the value range of these data is more than 2 billion. Therefore, we need to use the long type to store the key value, and the int type should be used. To store the value, 2 billion records will need about 26G of memory space. In this case, the memory is obviously insufficient, so it is very risky to count 2 billion at a time.
Solution: Divide a large file containing 2 billion numbers into 16 small files and use the hash function. In this case, the same duplicate number will definitely not be divided into different files. And, if the hash function is good enough, then the number of different in these 16 files will not be greater than 200 million (20/16). Then we can count the 16 files in turn, and finally summarize the number with the most repeats. (When summarizing, I only need to take out the number that appears most frequently in each small file, and then compare these 16 numbers.)
Question: How to judge if the 2 billion numbers are the same? ?