MySQL Indexing Principles and Usage

Index Models in MySQL

Mysql indexes in the use of the data structure is generally a search tree, the search tree here, the general use of B-tree, here to fill in the data structure of the B-tree structure; say B-tree, before the first smooth a pre-existing knowledge, balanced bifurcation tree;

balanced binary tree

No one should be unfamiliar with binary trees, the basic introduction to university data structures thatbinary sorting treeis based on the binary tree with an extra "successive"The notion that, in simple terms, the Left < RightmaybeRight<LeftAnyway, the tree is built in order .

Compared to ordinary binary trees, binary sorted tree has the characteristics of "order", but when the extreme case, that is, unilateral order down, binary sorted tree becomes a single chained table, losing the significance of the tree, so in the binary sorted tree on the basis of further strengthening, that is:A balanced binary tree is a tree that satisfies the characteristics of a binary sorted tree and at the same time has a height difference of less than or equal to 1 between the left and right subtrees.。Balanced binary trees are characterized as follows:

binary sorting tree
The left or right subtree of any node is a "balanced binary tree" (the difference between left and right heights is less than or equal to 1)

B-tree

Balanced binary tree is already very good, because of its order, and limit the saturation of the tree, which makes the balanced binary tree in the retrieval, has a very high performance, complexity only O (logN), at this point affect the performance of the query of the bottleneck evolved to the number of nodes N; according to the characteristics of the tree, the number of times than the number of calculations depends on the height of the tree, if the number of nodes is fixed, we can control the height of the tree through the control of the number of nodes at each level, without being bound to the "binary" feature, into a multifork balanced tree (B-tree). If the number of nodes is fixed, we can control the height of the tree by controlling the number of nodes at each level, and turn it into a multinomial balanced tree (B-tree) without sticking to the characteristic of "bifurcation".

B-tree is an absolutely balanced tree with all leaf nodes at the same height. In each node to store multiple elements, in each node in addition to the pointer node, but also store the corresponding data; compared to the binary balanced lookup tree, in the whole lookup process, although the number of comparisons of data is not significantly reduced, but the number of disk IO times will be greatly reduced; B-tree height is generally 2 to 3 layers to meet the majority of application scenarios, so the use of B-tree to build the index can be very good to improve the efficiency of the query .

B-tree characteristics

Each node can have at most m subtrees (of order m).
Root node with at least 2 nodes.
Ceil(m/2) subtrees that a non-root, non-leaf node has at least (Ceil denotes upward rounding; the 5th-order B-tree in the figure has at least 3 subtrees per node, i.e., at least 3 forks).
The information in the non-leaf node includes [n,A0,K1,A1,K2,A2,...,Kn,An], where n denotes the number of keywords stored in the node, K is the keyword and Ki<Ki+1, and A is a pointer to the root node of the subtree.
Each path from the root to the leaves has the same length, i.e., the leaf nodes are in the same layer and these nodes do not carry information, in fact, these nodes indicate that the specified value cannot be found, i.e., the pointers to these nodes are null.

B+ Tree

Although the B-tree can be more obvious to improve efficiency, but it nodes carry data this feature, with the growth of stored data, will inevitably lead to the occupation of space, if the use of database indexing, means that the tree will be correspondingly high, the amount of data that can be stored in a page will become less, the number of disk IO will become larger.

So another enhancement of B-tree, B+-tree, is introduced, which is characterized as compared to B-tree:

Internal nodes store only keys and values, not data. Data is stored only in leaf nodes and leaf nodes contain all the keys and pointers to the data.
Leaf nodes are linked to each other with bi-directional linked lists, which makes range queries more efficient because leaf nodes can be accessed sequentially through the linked list.

The bottom leaf node of the B+ tree contains all the index entries. As you can see from the figure, when the B+ tree is looking for data, since the data are stored in the bottom leaf nodes, each lookup needs to retrieve the leaf nodes to query the data. So in the case of needing to query the data each time the disk IO is directly related to the height of the tree, but on the other hand, because the data are placed in the leaf nodes, so the number of indexes stored in the index of the disk block locks will be increased with the number of indexes, so compared to the B-tree, B + tree tree height is theoretically shorter than the B-tree. There are also indexes to cover the query, the data in the index to meet the current query statement needs all the data, this time only need to find the index can be returned immediately, do not need to retrieve the bottom of the leaf nodes.

indexing

logical division

By function

primary key index

A table can only have one primary key index, no duplicates, no NULLs, use the keywordPRIMARY KEYDefinition;PRIMARY KEYs are usually based on B-Tree implementations, which are essentially special unique indexes.

ALTER TABLE TableName ADD PRIMARY KEY(column_list);

unique index

Data columns are not allowed to be duplicated, NULL values are allowed; a table can have more than one unique index, and the values of indexed columns must be unique, but null values are allowed. If it is a combination index, the combination of column values must be unique; generally use theUNIQUE INDEXdefine

CREATE UNIQUE INDEX IndexName ON `TableName`(`FieldName`(length)).
# or
ALTER TABLE TableName ADD UNIQUE (column_list);

ordinary index

Multiple general indexes can be created for a table, and a single general index can contain multiple fields, theAllow data duplicationThe following table describes how to insert a NULL value into the NULL file;

CREATE INDEX IndexName ON `TableName`(`FieldName`(length)).
# or
ALTER TABLE TableName ADD INDEX IndexName(`field name`(length));

By use

The breakdown by usage is really whether a single column is indexed or multiple columns are indexed in combination.

Single instance index: an index contains only one column, a table can have more than one single instance index.
Combined indexes: A combined index containsTwo or moreThe columns of the indexes are not used in the query. The query follows the "leftmost prefix" principle of mysql combined indexes, i.e., the condition when using where will not take effect until the index is placed in accordance with the arrangement of the fields at the time the index was created.You can use theunique indexcap (a poem)ordinary indexPerform combinatorial indexing

-- Multi-column combinations (common indexes)
CREATE INDEX <index_name> ON <table_name > (column1, column2, column3).
or
ALTER TABLE <table_name> ADD INDEX <index_name> (column1, column2,column3);

-- Multi-column combinations (unique indexes)
CREATE UNIQUE INDEX <index_name> ON <table_name> (column1, column2,column3).
or
ALTER TABLE <table_name> ADD UNIQUE INDEX <index_name> (column1, column2,column3));

For more information on performance and usage principles in the use of combinatorial indexes, you can skip directly to Principles of Combinatorial Index Usage

Physical division

The reason why it is said to be based on physical division is that indexes are characterized by physical storage principles. And it is mainly for the B+ tree index structure to say

Meaning of cluster:In order to increase the query speed of a particular attribute (or group of attributes), tuples with the same value on this or these attributes (called cluster codes) are stored together in contiguous physical blocks.

clustered index

Clustered index (clustered index) is not a separate type of index, but a data storage method. This storage method is based on the primary key of the table to construct a B + tree, and B + tree leaf nodes are stored in the table row record data, the primary key index is clustered index.

In a clustered index, its data file is itself an index file. The leaf node data field of the tree holds the complete data record. The key of this index is the primary key of the data table, and when a query is performed, according to the search algorithm, after the leaf node is found in the tree, the data is also found. This situation is also often referred to asCoverage Index。

Note: Each table can have at most one clustered index

unclustered index

Non-clustered indexes are also called secondary indexes or secondary indexes (in Mysql primary key other than are secondary indexes), and clustered indexes, on the contrary, the data and indexes are separate, B + Tree leaf node data field is stored in the primary key and the value of the field. In the index search, first according to the B+Tree search algorithm to search the index, if the specified content exists, then take out the primary key in its data field, and then in the clustered index (Primary Key index tree search again).

Suppose we query the users table

SELECT sex,age FROM user WHERE name = 'Chen Fang (1885-1935), female revolutionary and martyr';

At this point, the user table has a clustered index based on the primary key ID, and a non-clustered index based on the name, so the query looks like this:

First of all, because of the use of the name column for the equivalent query, at this time will first use the Name index of the B + tree, the search, when the name is found for Chen Fang's leaf node, will get its ID, and then back to the ID-based clustered index B + tree based on ID = 1 search, and ultimately to find its leaf node, to get the above gender and age, which such a need to use the index of the two processes is often called back to the table query.

even thoughInnoDBcap (a poem)MyISAMThe storage engines all use a B+ tree structure to store indexes by default, but only InnoDB's primary key indexes are clustered, theAuxiliary indexes in InnoDB as well as MyISAM use non-clustered indexes。

Use of Indexes

Principles of Index Creation

existFrequently searched columnson it, which can speed up the search
asprimary keyto enforce the uniqueness of the column and organize the structure of the data arrangement in the table
existColumns often used in joins (JOIN)On the top, these columns are mainly a foreign key that speeds up the connection
existIndexes are often created on columns that need to be searched based on ranges (<, <=, =, >, >=, BETWEEN, IN), because the index is sorted and its specified range is contiguous
existColumns that often need to be sorted (order by)Create an index on it because the index is already sorted, so that queries can take advantage of the index's sorting to speed up sorted query times;
Create indexes on top of columns that are frequently used in WHERE clauses to speed up the determination of conditions.

Data types behave differently for indexing

VARCHAR type

VARCHARtype is used to store variable length strings.

prefix index: As a result ofVARCHARColumns of type can be very long, and to save space and improve efficiency, indexes can be created for only the first few characters of each value, called prefix indexes. However, prefix indexes reduce the selectivity of the index and more rows may need to be scanned to find matching records.
distinctiveness: When choosing the prefix length, you should consider how differentiated the columns are. If most rows have the same prefix, then prefix indexing may not be effective.
Indexing efficiency：Equivalent Queries,INclause queryetc. are more efficient.VARCHARtype columns are in dictionary order when sorting and comparing, which can affect theScope QueriesThe performance of theVARCHARtype indexes perform better when no function or expression is used, but if a function is used in the query for a string column (such as theCONCAT、SUBSTRINGetc.), the index may fail, resulting in a full table scan.

numeric type

Numeric types refer generically to numeric related types, such asINT, FLOAT, Double, etc.。

selectiveness: Columns of integer type are usually highly selective, which means that the values in the index are evenly distributed, effectively reducing the number of rows that need to be checked in a query.
Indexing efficiency：Equivalent query, range query (>、<、BETWEENetc.)More efficient, indexes of numeric types usually outperform string types in query performance when performing range queries.

Date type

DATETIMEtype is used to store dates and times.

Time zone effects: When working with data across time zones, you need to be aware of the impact that time zone conversion may have on index usage and query results.
function impact: If a date function is used in the query (such as theDATE_FORMAT), which may cause the index to fail
Indexing efficiency：DATETIMEThe columns of typeequivalence querycap (a poem)Scope QueriesIt is more efficient when the date and time comparison operation is very fast.

(sth. or sb) else

NULL values for columns: The behavior of an index may be affected if the column contains a NULL value. For example.VARCHARTypes of columns that allow NULL values may take up extra space in the index.
Default values for columns: The default value may affect the selectivity of the index. For example, if aINTtype has a default value for the column, then many rows may have the same default value, which can reduce the selectivity of the index.
Frequency of column updates: Frequently updated columns may fragment the index, thus affecting its performance. Therefore, it may be necessary to rebuild the index periodically for frequently updated columns.

Principles of Using Combined Indexes

When using combinatorial indexes in MySQL, "leftmost prefix"The principle determines how indexes are used and how well queries are optimized. This principle refers to the fact thatMySQL uses the columns in the index in the order of the columns in the combined index, from left to right. The index will only be used effectively if the leftmost column of the index is included in the query condition. If a column in the index is skipped in the query condition, that column and all columns to its right are not used in the index query.

For example, the tb_flow_visit table'sSource ip, destination ip, access time The three fields are set to a combined index (note the order in which they are set)

ALTER TABLE tb_flow_visit ADD INDEX commIndies (sip, dip,visit_time);

sequential priority

In a combinatorial index, MySQL follows the (Source ip, destination ip, access time) this order to match the conditions. IfWHEREThe first thing that appears in the clause is thesip field matching relatedcondition, then the index will be used.

-- Valid index query (match all)
SELECT * FROM tb_flow_visit WHERE sip = '192.168.1.1' AND dip = '192.168.1.2' AND visit_time = '2023-08-01';

-- Valid indexed queries (partial matches)
SELECT * FROM tb_flow_visit WHERE sip = '192.168.1.1' AND dip = '192.168.1.2';

Can't skip

If your query condition first uses thedip (or visit_time)and does not containsip, then this combined index will not be used because the leftmost prefix principle is violated.

-- Invalid index, can't skip sip
SELECT * FROM tb_flow_visit WHERE  dip = '192.168.1.2' AND visit_time = '2024-08-01'; 

-- Invalid indexes, combining indexes that do not follow the order of the indexes, even if any of the columns are used individually, is invalid.
SELECT * FROM tb_flow_visit WHERE dip = '192.168.1.2';or
or
SELECT * FROM tb_flow_visit WHERE visit_time > '2024-08-01';;

partial match

in the event thatWHEREIf the clause contains consecutive columns starting from the leftmost column of the index, then the index can be partially used. Example:The leftmost column is usedsipfields andvisit_timefield, so it will effectively use thesipcolumns are indexed, but will not use thedipcap (a poem)visit_timeIndexing of columns。

SELECT * FROM tb_flow_visit WHERE sip = '192.168.1.1' AND visit_time > '2024-08-01';

Range Query Limits

In MySQL, the leftmost principle's range query restriction means that when using a combinatorial index, if the query condition includes a range query (e.g., the>、<、BETWEEN、LIKEetc.), then this range query must be applied to the leftmost column of the combined index or the column immediately next to the leftmost column (if the leftmost column is a constant condition). If the range query skips the leftmost column or is used on a non-leftmost column, then the index will not be used effectively; for example, if a range query is used on the leftmost column (sip) (as in sip > '192.168.1.2'), then subsequent columns can still be used by the index, but the columns after the leftmost column must be equal-value queries, not range queries.

-- Invalid Index
SELECT * FROM tb_flow_visit WHERE  sip > '192.168.1.2' AND dip > '192.168.2.1';
AND  visit_time = '2024-08-01';
-- Effective Indexing
SELECT * FROM tb_flow_visit WHERE  sip > '192.168.1.2' AND dip = '192.168.2.1' AND visit_time = '2024-08-01';