The Beauty of Data Structures - A Deeper Understanding of Tree Structures

The Beauty of Data Structures - Tree Structures

I. Recognizing the tree structure

Tree structure is a widely used nonlinear data structure, which has a wide range of applications in both computer science and everyday life. For example, tree data structures can be found in file systems, email systems, compiler syntax trees, decision trees, network communications, and even machine learning. The purpose of this article is to summarize the various types of tree structures used in daily life, as well as their advantages and disadvantages, so as to provide an in-depth knowledge and understanding of tree structures. Below is a list of some common application scenarios of tree structures.

1.1 Documentation system

A system used in computers to store and manage files that uses a tree structure to organize files and directories. Each file and directory is stored on a node of the tree, with the parent node representing a higher-level directory and the child nodes representing lower-level directories or files.

1.2 Domain name resolution

The Internet's Domain Name Resolution System (DNS) utilizes a hierarchical tree structure that consists of multiple levels, with the global root servers at the top. Each country, region, and larger organization has its own domain name servers. These servers are interconnected through a complex tree structure to form a distributed network of databases.

Let's observe a domain name for an online typing practice site, The Artful Typist, for example:, you will see that it is split into three parts by two dots. Where com is the top level domain, laidazi is the second level domain and www is the third level domain.

The adoption of this hierarchical tree structure not only improves the efficiency of domain name resolution, but also ensures the stability of domain name resolution. Even if a domain name server fails, the normal operation of the whole DNS system will not be affected too much because the data is distributed on multiple servers and the resolution process can be completed by other available servers.

1.3 Mail system

Emails are usually organized and managed through a tree structure. Each mail can be regarded as a node of the tree, and replies or forwarded mails can be regarded as child nodes of that mail node, forming a tree structure.

1.4 Syntax tree

When program code is compiled, the compiler uses a tree structure (called a syntax tree) to represent the structure and semantics of the source code. Each syntax tree node represents a specific syntax structure, such as a function, a loop, or a conditional statement.

1.5 Decision trees

In a decision tree, each node represents a decision point, each branch represents a possible decision outcome, and leaf nodes represent the final result of the decision. Decision trees are commonly used in the fields of classification, prediction and optimization.

II Evolution of the tree structure

2.1 Linear structures

In engineering practice, data structures such as arrays, linked lists, stacks, queues, etc. should be the most commonly used, they all belong to the combination of ordered data elements, also known as linear structures. They play an important role in storing, traversing, sorting, analyzing and calculating data.

However, this linear structure is not adequate when dealing with data with complex hierarchies, categorizations, and relationships. Tree structures have significant advantages in this regard.

We can also visualize the analogy of certain linear structures to special tree structures consisting of individual child nodes strung together.

2.2 Binary trees

A binary tree is one in which each node has at most two children. It is characterized by orderliness, i.e., the left child node always precedes the right child node. This property can be used to solve various problems like sorting, searching, data compression etc. Binary trees are further subdivided into complete binary trees and balanced binary trees based on the distribution of nodes.

2.2.1 Complete binary trees

Complete binary tree, because its nodes are arranged in a hierarchical order, any node of the left child node of the key value is less than the key value of the node, the right child node of the key value is greater than the key value of the node, this is based on the idea of dichotomous lookup, so that the lookup operation becomes relatively simple and efficient. The time complexity of this lookup is O(log n), which is highly efficient when the data volume is large, so it is mainly used to optimize some specific lookup operations, which are often used in daily life.heap sortIt is this fully binary tree structure that is used.

Complete binary trees suffer from the problem of uneven depths of individual link nodes, which in extreme cases degenerates into a chain-list-like data structure. This makes the time complexity of data lookup O(n).

2.2.2 Balanced binary trees

A balanced binary tree is a self-balancing binary lookup tree that mainly compensates for the node depth difference problem that exists in full binary trees. This maintains O(log n) worst-case lookup performance in scenarios that require frequent lookup and insertion operations, thered and black treesis a self-balancing binary lookup tree in which each node has a color attribute (red or black). The balance of the tree is maintained by color and rotation rules that ensure that the longest path from the root to a leaf is not more than twice the shortest path.

Balanced binary trees require a lot of left-rotation, right-rotation, height calculation and other operations in order to ensure that the height of the tree is balanced at any moment, which can lead to relatively slow insertion and deletion of nodes, and is therefore not well suited for use in large-data-volume scenarios.

2.3 Multiplexed search tree

In real-world application scenarios, the number of children nodes in a binary tree is limited, which results in a relatively high tree height, and the accompanying business data is stored in the middle nodes of the tree, resulting in a long node path for reading and writing data. In big data scenarios, it is not possible to load all data into memory, so each tree node access requires disk IO, resulting in a high number of disk IO reads and writes, and the read and write speed of disk IO is much slower than the speed of memory access, so the binary tree is not suitable for handling large amounts of large-scale data.

2.3.1 B-Tree

B-Tree by increasing the number of child nodes, reducing the level of the tree, can greatly reduce the number of disk IO times, to solve the problem of balanced binary tree disk access efficiency is low, B-tree itself is also a kind of self-balancing tree, can adaptively and dynamically update the data, through the node fission to maintain self-balancing.

Since the B-Tree intermediate nodes (i.e., non-leaf nodes) store both keywords, data records and index pointer data of child nodes, the volume is relatively large, which will make the memory utilization inefficient if they are all tuned into the memory. At the level of query efficiency, due to the use of intermediate order traversal strategy, each query will change with the data close to the root node, is not very stable, for the range of query efficiency is also not high.

2.3.2 B+Tree

B+Tree (B+Tree) is an improvement on B-tree, and its differences from B-tree mainly include:

-All data records of a B+ tree are stored on the leaf nodes, and the intermediate nodes store only keywords and child node index pointer data;

-All the leaf nodes of the B+ tree are connected by a chain table in the order of the data from smallest to largest. Therefore the access to the result set of the data on the B+ tree can be collected directly through the chained list of leaf nodes without going back to the non-leaf nodes, which is more efficient.

Due to the data structure adjustment, B+Tree outperforms B-Tree in all aspects such as disk read/write efficiency, range query efficiency and memory utilization.

Multi-way search tree for insertion, deletion and modification operations have more efficient node adjustment or fission and other ways to ensure the balance of the tree, so as to ensure O (lgN) efficient query time complexity. However, when the application scenario is extended to multi-dimensional spatial index query, due to the parent-child node brother nodes of the size of the ordered relationship between the performance of multiple dimensions, node adjustment or fission operations also need to modify a larger number of farther away from the leaf nodes or ancestor nodes (the limit case may be affected by the entire tree), will lead to the dynamic updating of the data efficiency becomes low.

2.4 Multidimensional trees

2.4.1 KD-Tree

KD-Tree (K-Dimensional Tree) is a high-dimensional indexed tree data structure for storing and retrieving instance points in k-dimensional space.

The above figure shows the storage of a collection of data points for six 2D planar axes: (2,3), (5,4), (9,6), (4,7), (8,1), (7,2).

1. First, according to the X dimension value 7 to divide the space into two to get x<=7 and x>7 two parts of the left subtree and the right subtree.

2. and then bisect the left subtree again in Y dimension 4;

3. According to the above two steps in order to recursive analogies, that is: the first layer of the first X-dimensional division, the second layer of the second Y-dimensional division, the third layer of the third layer of the X-dimensional division, the fourth layer of the fourth layer and then divided into Y-dimensional division, and the cycle repeats itself.

KD-Tree is suitable for handling multidimensional data with high query efficiency. The query time complexity of a static multidimensional data collection is O(lgN) after passing through KD-Tree.

2.4.2 KD-B-Tree

KD-B-Tree (K-Dimension-Balanced-Tree), as the name suggests, combines KD-Tree and B+Tree. it mainly solves the shortcomings of the binary tree form of KD-Tree, which has a higher tree height and is not friendly enough to disk IO, and introduces the multinomial tree form of B+Tree, which not only reduces the height of the tree, but also stores all the data in the leaf nodes and increases the disk space utilization. nodes, increasing the utilization of disk space storage.

Because the hierarchical division of KD-Tree and KD-B-Tree is based on dimensionality in turn, when adjusting an intermediate node after a dynamic update, the current dimension of the node needs to be adjusted at the same time as the current dimension values of all of its descendant nodes, which leads to an increase in the number of accesses to the tree nodes and the increase in the operation time consuming. Therefore, KD-Tree is more suitable for querying multi-dimensional massive data in static scenarios.

2.4.3 BKD-Tree

BKD-Tree (Block-K-Dimension-Tree ), which is a family of (Block) KD-Trees. i.e., it is actually a forest, no longer a tree.

The data structure was designed by several professors at Duke University based on the KD-B-Tree and also incorporates a method known as the "logarithmic method" which makes static data dynamic.

1. Unlike the KD-B-Tree which is a multinomial tree, the BKD-Tree is a fully binary tree. Although binary trees are not as disk IO friendly as multinomial trees, the main reason why BKD-Tree still takes the form of a binary tree may be:

-Takes the form of a fully binary tree, similar to a heap or priority queue, with the ability to access parent or child nodes directly via subscripts;

-Reduces storage space for indexes stored in intermediate nodes, minimalist and compact intermediate index nodes take up less space, and even intermediate nodes can all be called into memory storage at once, and calls to memory access index nodes are more efficient.

1. A method called "logarithmic method" is used to make static data dynamic, which greatly improves the efficiency of dynamic data smaller.

-A forest of bifurcated KD-B-Trees is used. The newly added data is stored on a small-scale data structure M that initially supports efficient querying and dynamic updating, and then the data is updated by replacing the original structure through M and multiple small bifurcated KD-B-Trees, which are merged into a large bifurcated KD-B-Tree with a certain strategy and timing;

-Data deletion can be realized by marking deletion, thus realizing efficient dynamic update operations for multidimensional numerical data;

In Lucene, BKD-Tree is used to address the efficiency of search and filtering operations on large-scale datasets. To support efficient numeric class or multidimensional queries, Lucene introduces Bkd-Tree.

III Application of tree structure

3.1 Dictionary tree

Trie tree by the string of characters as nodes for storage, the underlying use of the multinomial tree data structure. Often used for statistics and sorting a large number of strings (but not limited to strings), so often used by search engine systems for text word frequency statistics. Its main advantage is that is to minimize the unnecessary string comparison, query efficiency than the hash table.

Trie tree prefix matching is often used for search hints. For example, when a keyword is entered, it can be automatically searched for possible choices. When there is no exact match, the most similar possible prefixes can be returned.

3.2 Random Forests

Random Forest is a supervised algorithm that consists of a number of decision trees, which are self-sampled to generate a number of parallel classifiers (decision trees), and the final result is determined by the principle of "majority rule". Random forests use multiple trees to avoid and prevent the overfitting problem of simple decision trees. Its main features are the following two:

1. random: random selection of samples, random selection of features;

2. Integrated learning: voting for election strategies;

Here is a practical example, for example, we want to predict the income level (high, medium, low) of a person based on a total of five characteristics attributes: age, gender, education, industry, and city of residence.

Each tree in a random forest can be viewed as a Categorical Regression Tree (CART), assuming that there are 5 CARTable trees in the forest, with a total feature tree N = 5, and taking M = 1 (indicating that each CART tree corresponds to a different feature).

Based on different features, each decision tree gives a probability of the income level (the feature probability value is the result of our model training), taking the example in the figure, the prediction is as follows:

（20-40）：7%，23%，70%；

（male）：3%，27%，70%；

(University): 6%, 14%, 80%;

(Beijing): 15,%, 20%, 65%;

(Internet): 5%, 30%, 65%;

The combined average result is -7.2% probability of high income, -22.8% probability of moderate income, and -70% probability of low income.

Random forest adopts integrated learning method with high model prediction accuracy, anti-interference ability, and strong interpretability, which plays a great role in various classification scenarios such as wind control, speech recognition, and image classification.

3.3 Database indexing

Database indexes are usually used to accelerate access to data, according to the characteristics of the data distribution will have different data structures, common B-Tree, B+Tree, R-Tree, Hash Index, Bitmap Index, Bloom Filter and so on. The following is an example of the B+Tree index structure used in the Innodb engine of mysql, to explore how the data structure guides our upper-level applications.

Let's start with a diagram of a typical B+Tree index data structure:

3.3.1 Index Tree Depth

The depth of the index tree should be the less the better, it will affect the occupation of disk space as well as the index query efficiency. The main factors affecting the index depth are as follows:

1. Indexed data duplication, the higher the duplication, the relative depth of the index tree will increase;

2. The smaller the data in the index field, the better. This also affects the index depth;

3. The magnitude of the indexed data, the larger the amount of data, the index tree depth will increase;

The smallest storage unit of InnoDB data is Page (the default size is 16k), assuming that the primary key accounts for 8 bytes, the pointer accounts for 6 bytes, and the size of a row of data records is 1k, then the number of data can be stored in a single non-leaf node:16384/14=1170, a single leaf node can hold 16K/1K = 16 records.

Total data capacity = number of node pointers to the root node * amount of data saved per page.

Assume a B+ tree of height 3 and the number of records it can hold:1170*1170*16=21902400Article.

This leads to the concluding recommendations, which are set out below:

1. It is not appropriate to build independent indexes on fields with high repetition (such as gender, status, etc.), which often leads to an increase in the depth of the tree. If there is a real need, you can consider combining other attribute fields with a low degree of repetition, together with a combination of composite indexes;

2. For the problem of indexed field data is too large, you can consider using a prefix index to reduce the size of the index;

3. Ensure that the primary key index is unique and does not contain NULL values. This helps improve query speed and data integrity;

4. The greater the length of the data record, the less data capacity will be available for the same tree depth;

5. The amount of data is large, you can consider splitting the table to alleviate the problem of excessive tree depth;

3.3.2 Data storage formats

In B+Tree, the main storage structure information, is as follows:

1. Intermediate nodes store key-value information, child node pointers, and reference counts;

2. Leaf nodes store key-value information, data record pointers, and delete markers;

3. The leaf nodes are also connected to each other by pointers to form a chained list structure for easy sequential access;

Intermediate nodes and leaf nodes play different roles in InnoDB indexes. Intermediate nodes are used to organize and maintain the structure of the index, while leaf nodes store all data records and provide direct access to the data. The performance and efficiency of database queries can be improved through proper design and use of indexes.

This leads to the concluding recommendations, which are set out below:

1.Using Covered Indexes: If an index contains all the data needed for a query, the query can fetch the data directly through the index without going back to the table. This is called a covering index and can significantly improve query efficiency;

2.Avoid using functions or expressions:Database query optimizers often select and use indexes based on column values. When functions or expressions are applied on columns, the optimizer may not recognize or utilize these indexes, resulting in index failure;

3.Range finding is efficient: Range lookup is very efficient due to the nature of sequential linking between B+Tree leaf nodes. It can quickly locate the start and end of a specified range and then linearly traverse to get all the values in the range, thus avoiding full tree traversal.

4.Leftmost prefix principle:When we create a union index, such as (a,b,c), we actually create a B+ tree consisting of columns a, b and c. In this tree, each node contains the values a, b, and c, and is sorted in the order of a, b, and c. If we want to find the values of columns a and b, and skip column a to find the value of column b, then the range lookup of the B+ tree cannot be used. Because B+tree range lookup is based on the order of the columns, if the leftmost column is skipped, then the range of the lookup cannot be determined.

5.Number of indexes in the control table: Too many indexes have a not insignificant negative impact both in terms of storage footprint and performance overhead of index maintenance (lock contention). Indexes should be designed with reuse in mind whenever possible;

IV. Summary

The move from linear structures to binary trees is a quantum leap in tree structures. The binary tree allows the data structure to present hierarchical relationships, greatly simplifying the complexity of data organization. Balanced trees build on this foundation by keeping the depth of the tree within an acceptable range through balancing operations, thus maintaining better performance during insertion, deletion, and other operations.

Multiplexed search trees such as B-trees and B+-trees further extend the application of tree structures. They allow nodes to have more than one child node, thus providing more flexibility in maintaining the balance of the tree. Multi-dimensional trees, on the other hand, extend the tree structure from one to multiple dimensions, allowing the data structure to better accommodate complex data types and multi-dimensional data relationships.

In terms of applications, tree structures are widely used in many fields such as artificial intelligence, graphics, network protocols and so on. After mastering its underlying realization principle, it can better solve the pain point problem of engineering measurement. For example: in the database system, the index often uses tree structure to improve query efficiency, after understanding its internal implementation, it can have a better performance in terms of resource utilization and performance.