summarize
In practice, in relational databases (MySQL, PostgreSQL) after a single table data volume of hundreds of millions, there are often queries and analysis slows down or even can not perform statistical analysis. At this time it is necessary to split the large table into multiple small tables, the small table is distributed over multiple databases to form a database cluster. In this way, a SQL statistics statement can be executed concurrently on multiple servers, and then the results will be summarized to achieve the analysis of large data volume of relational databases.
The Three Database Paradigms
A paradigm is a table structure with minimal redundancy, and the concept of a triple paradigm is described below
First Paradigm: If each column is the smallest unit of data that cannot be subdivided, then it satisfies the first paradigm, and the goal of the first paradigm is to ensure the atomicity of each column. For example, the Address column stores address information and the value is "Beijing, China", which violates the principle of the first paradigm that columns can not be subdivided, to satisfy the first paradigm, we need to split the Address column into Country and Ciy columns, which store "China" and "Beijing" respectively. China" and "Beijing" respectively.
Second Paradigm: The second paradigm is based on the first paradigm, which states that the non-primary key columns in a table have no partial dependency on the primary key, i.e., the second paradigm requires that each table describes only one thing. For example, the Orders table has the columns "Order No.", "Product No.", "Order Date", and "Product Price". " column, which contains both order information and product information, need to be split into the order table and the product table
Third Paradigm: The first and second paradigms are satisfied and there is no business dependency on non-primary key columns in the table. For example, the Orders table has the columns "OrderNo", "CustomerNo", "OrderDate", "CustomerName". "Except for the primary key "OrderNumber", "CustomerName" depends on "CustomerNumber", so it needs to be removed. "Customer Number" needs to be removed.
Table by scope
Splitting tables according to range means splitting data according to the range on a certain field, for example, splitting data into different databases according to the range of user IDs 0-10w, 10w-20w, 20w-30w respectively. Using this method of expansion is simple, according to the planning of the library and table can be built in advance, the disadvantage is that most of the read and write operations will access the new data, resulting in excessive pressure on the new library.
hash mode (computing)
Hash modeling refers to the calculation of the hash value of a field, according to the hash value of the data to be split. The specific practice of hash modeling is to first N servers from 0 to N-1 for the number, in accordance with the customized hash algorithm, the hash value of each request by N to take the mode, the remainder of the data that is the number of servers where the data. The advantage of this method is that the data distribution is balanced, the overall pressure on the database is small, the disadvantage is that the expansion and contraction of the troublesome expansion and contraction of the process of expansion and contraction of the need for all the data to re-hash distribution and migration
coherent hash algorithm
Consistent hash algorithm instead of the traditional hash mode, to avoid changes in the number of server clusters lead to hash value failure, so that the entire cluster data need to be reallocated problem
The consistent hash algorithm virtualizes the entire hash space into a 0-2^(32-1) hash ring, maps server nodes and data to the hash ring separately, and maps objects to server nodes to achieve the hash division of data across servers
cloth, the process is described below:
Constructing a hash ring: Form the entire hash space into a 0-2^(32-1) virtual circle, i.e., a hash ring, as shown in the following figure
Mapping server nodes to a hash ring: Use the hash function to map servers to a virtual hash ring, you can generally use the IP address or machine name of the server node machine as the calculated value of the hash function. Assuming that there are 3 server nodes: node-0, node-1, node-2, the hash function to calculate the hash value of the server IP address, and distribute it on the hash ring
Mapping data to a hash ring: use the same hash function to calculate the hash value of the data that needs to be stored and map the data onto the hash ring. Suppose there are 4 objects: o1, o2, o3, o4, the hash value of the object is calculated by the hash function and distributed on the hash ring
Mapping an object to a server node: find the location of the object's hash value in the hash ring, start looking clockwise along the hash ring from that location, and the first server encountered is the object's storage node server, mapping the object to the
to this server. As shown in the figure, object o1 is mapped to cs1, object o2 is mapped to cs2, object o3 is mapped to cs1, and object o4 is mapped to cs3
With traditional hash modeling, when there is a change in the server (adding nodes or removing nodes), the hash value of the whole system will be invalidated (because the number of servers has changed, i.e., the number of divisors has changed), thus requiring a recalculation of the hash value, and hash mapping and data distribution. Consistent hashing, on the other hand, only affects the data distribution of the next node of the changed node when the server changes, since the data distribution of the object is only related to the next server in the clockwise direction
Remove nodes: Assuming a server is down, the affected objects are only those originally mapped to the server, according to the principle of clockwise data mapping of consistent hash, you only need to re-map the objects originally mapped to the server to the next normal server. For example, if cs1 is down, just remap o1 to cs3.
Add Node: To add a node, the only data affected are the objects between the new node and the first node along the counterclockwise direction, just remap these objects to the newly added node. For example, to add a new node cs3 between cs1 and cs2, cs3 is located between o1 and o3, just remap o3 to cs3.
Consistent hash algorithm can not guarantee the absolute balance of the data, in the case of less cluster object data, the object can not be uniformly mapped to each node. In order to solve the problem of uneven data distribution, consistent hash algorithm introduces the concept of "virtual nodes". Virtual nodes are copies of actual nodes in the hash space, an actual node corresponds to a number of virtual nodes, the corresponding number is also known as the number of copies, virtual nodes in the hash space in the hash value arrangement. With the introduction of virtual nodes, the mapping relationship is converted from object to node to object to virtual node.