Cross-pagination query after challenging database and table division

Imagine you have a library (single library and single table) with all books placed on the bookshelf in order. When you are looking for books 100-110, you can get it by counting directly to the 100th. But after the library's books exploded, the curator decided:

Distribution: Divide the book into 10 rooms (10 databases)
Table: Each room is divided into 20 bookshelves (20 tables)
Each bookshelves only have books with specific rules (such as taking the model by ID:ID % 200）

The problem is here：
When the user asks to "arrange in reverse order of time and display the 1000-1010th data":

Bookshelf in each room are sorted independently
Can't directly know where the 1000th data in the global area is

Solution:

Imagine you manage 10 express sorting stations (distribution and table), each site has its own parcel number. Now we need to find out the 1000-1010 packages issued in the country:

Stupid method: Let all sites send the first 1010 packages to the headquarters, and select the required ones after the headquarters is sorted (global sorting method)
Smart way：
1. First ask the shipping time of the 1010th parcel on each site
2. Time to find the 1010th parcel in the country
3. Let each site send only packages later than this time and sort them again (quadratic query method)
The lazy but effective way: Copy all the package information to the headquarters’ smart blackboard (Elasticsearch) and check the blackboard at any time!

The following are the solutions, advantages, disadvantages and practices:

plan	Applicable scenarios	Difficulty in technology implementation	shortcoming
Global sorting method	Small data volume, few shards	Low	Poor performance, deep paging explosion
Quadratic query method	There are indexes in the sorting field, and there are many shards	middle	Need two inquiries
Cursor paging method	There are globally ordered fields (such as ID, timestamp)	Low	Random page jump is not supported
Search engine assistance	High-frequency complex query, tolerating second-level delay	high	High maintenance costs and data delays

(1) Global sorting method (simple but inefficient)

Implementation steps：

Each shard is executed independentlyORDER BY create_time DESC LIMIT 0, 1010
Summarize the results of all shards into a central node
The central node sorts all data globally, and then selects items 1000-1010.

shortcoming：

The more shards, the more exponentially the data transmission volume increases
Performance disasters when deep paging (offset is large) (for example: offset=1 million)

(2) Quadratic query method (optimization performance)

Core idea: Avoid full data transmission, narrow the scope in two queries

Step Example(Query Articles 1000-1010):

First query: Each shard returns the first 1010 data piecesMinimum time stamp
- For example, the minimum timestamp of shard 1 is T1, and shard 2 is T2...
Take the 1010th smallest global timestamp in all shards T_global
Second query: Query of each fragmentcreate_time >= T_globalData of
After summarizing, sort the final result again

advantage：

Significantly reduce data transmission
Suitable for the case where the sorting field has an index (such as a timestamp)

(3) Cursor paging method (avoid deep paging)

Core idea: Use continuous and unique fields (such as self-increment ID, timestamp) as cursors

Advantages：

Avoid completelyoffsetPerformance issues
Naturally suitable for library and table division, each fragment needs to be filtered according to the cursor conditions

limit：

There must be a globally ordered and unique field
Users cannot jump directly to any page number (such as page 1000)

（4）Search Engine Assistance (Ultimate Weapon)

Applicable scenarios: High frequency complex paging query
Implementation method：

Synchronize the data of the database and tables to Elasticsearch/Solr in real time
All paging queries go directly to the search engine
Search engine internal maintenance global sorting

Advantages：

Completely block the complexity of library and tables
Support complex condition filtering + high concurrent query

cost：

Additional systems are required to be maintained, and data consistency delays are increased

Suggestions in actual development

Priority to cursor paging：
- For example, the waterfall flow page of the App will only continue to decline and do not need to jump to pages
Limit maximum paging depth：
- For example, you can view the first 100 pages at most to prevent malicious deep paging attacks
Combined with business optimization：
- When e-commerce sorts by price, the price can be segmented in advance (0-100 yuan, 100-200 yuan...)
The ultimate solution：
- The combination of library and table + Elasticsearch, sacrifices certain real-time quality for high-performance query