Imagine you have a library (single library and single table) with all books placed on the bookshelf in order. When you are looking for books 100-110, you can get it by counting directly to the 100th. But after the library's books exploded, the curator decided:
-
Distribution: Divide the book into 10 rooms (10 databases)
-
Table: Each room is divided into 20 bookshelves (20 tables)
Each bookshelves only have books with specific rules (such as taking the model by ID:ID % 200
)
The problem is here:
When the user asks to "arrange in reverse order of time and display the 1000-1010th data":
-
Bookshelf in each room are sorted independently
-
Can't directly know where the 1000th data in the global area is
Solution:
Imagine you manage 10 express sorting stations (distribution and table), each site has its own parcel number. Now we need to find out the 1000-1010 packages issued in the country:
-
Stupid method: Let all sites send the first 1010 packages to the headquarters, and select the required ones after the headquarters is sorted (global sorting method)
-
Smart way:
-
First ask the shipping time of the 1010th parcel on each site
-
Time to find the 1010th parcel in the country
-
Let each site send only packages later than this time and sort them again (quadratic query method)
-
-
The lazy but effective way: Copy all the package information to the headquarters’ smart blackboard (Elasticsearch) and check the blackboard at any time!
The following are the solutions, advantages, disadvantages and practices:
plan | Applicable scenarios | Difficulty in technology implementation | shortcoming |
---|---|---|---|
Global sorting method | Small data volume, few shards | Low | Poor performance, deep paging explosion |
Quadratic query method | There are indexes in the sorting field, and there are many shards | middle | Need two inquiries |
Cursor paging method | There are globally ordered fields (such as ID, timestamp) | Low | Random page jump is not supported |
Search engine assistance | High-frequency complex query, tolerating second-level delay | high |
High maintenance costs and data delays |
(1) Global sorting method (simple but inefficient)
Implementation steps:
-
Each shard is executed independently
ORDER BY create_time DESC LIMIT 0, 1010
-
Summarize the results of all shards into a central node
-
The central node sorts all data globally, and then selects items 1000-1010.
shortcoming:
-
The more shards, the more exponentially the data transmission volume increases
-
Performance disasters when deep paging (offset is large) (for example: offset=1 million)
(2) Quadratic query method (optimization performance)
Core idea: Avoid full data transmission, narrow the scope in two queries
Step Example(Query Articles 1000-1010):
-
First query: Each shard returns the first 1010 data piecesMinimum time stamp
-
For example, the minimum timestamp of shard 1 is T1, and shard 2 is T2...
-
-
Take the 1010th smallest global timestamp in all shards T_global
-
Second query: Query of each fragment
create_time >= T_global
Data of -
After summarizing, sort the final result again
advantage:
-
Significantly reduce data transmission
-
Suitable for the case where the sorting field has an index (such as a timestamp)
(3) Cursor paging method (avoid deep paging)
Core idea: Use continuous and unique fields (such as self-increment ID, timestamp) as cursors
Advantages:
-
Avoid completely
offset
Performance issues -
Naturally suitable for library and table division, each fragment needs to be filtered according to the cursor conditions
limit:
-
There must be a globally ordered and unique field
-
Users cannot jump directly to any page number (such as page 1000)
(4)Search Engine Assistance (Ultimate Weapon)
Applicable scenarios: High frequency complex paging query
Implementation method:
-
Synchronize the data of the database and tables to Elasticsearch/Solr in real time
-
All paging queries go directly to the search engine
-
Search engine internal maintenance global sorting
Advantages:
-
Completely block the complexity of library and tables
-
Support complex condition filtering + high concurrent query
cost:
-
Additional systems are required to be maintained, and data consistency delays are increased
Suggestions in actual development
-
Priority to cursor paging:
-
For example, the waterfall flow page of the App will only continue to decline and do not need to jump to pages
-
-
Limit maximum paging depth:
-
For example, you can view the first 100 pages at most to prevent malicious deep paging attacks
-
-
Combined with business optimization:
-
When e-commerce sorts by price, the price can be segmented in advance (0-100 yuan, 100-200 yuan...)
-
-
The ultimate solution:
-
The combination of library and table + Elasticsearch, sacrifices certain real-time quality for high-performance query
-