Distributed SQL Server on the cloud, you deserve it!

present (sb for a job etc)
Microsoft SQL Azure is Microsoft's cloud relational database with back-end storage also known as Cloud SQL Server (Cloud SQL Server).
It is built on top of SQL Server and improves the scalability and fault tolerance of traditional relational databases through distributed technology.

data model

(1) Logic model
Cloud SQL Server divides data into multiple partitions and circumvents distributed transactions by restricting transactions to be executed in only one partition. In addition, it ensures high availability by replicating data to multiple copies through the Primary-Copy protocol.
A logical database in Cloud SQL Server is called a table group, which can have either a primary key or no primary key.
Only table groups with primary keys are discussed here.
If a table group is with a primary key, it is required that all tables in the table group have an identical column, called the partitioning primary key (partitioning key).

Cloud SQL Server Data Model

The table group in the figure contains two tables, the Customers table and the Orders table, divided by the primary key of the Customer ID (the Id column in the Customers table).

The division primary key does not need to be a common unique primary key for each table in the table group. In the figure, the Customer ID is the unique primary key for the Customers table, but not for the Orders table.

Similarly, the dividing primary key does not need to be an aggregated index for each table; the aggregated index for the orders table is the combined primary key <Customer ID, Order ID> (<Id, Oid>).

The set of rows in a table group that have the same dividing primary key is called a row group. The first row of the Customers table and the first two rows of the Orders table have a primary key of 30 and form a row group.

Cloud SQL Server only supports transactions within the same row group, which means that data from the same row group is logically distributed to the same server.

If the table group is primary keyed, Cloud SQL Server supports automatic horizontal splitting of tables in the table group and dispersing them across the cluster. The same row group is always distributed to the same physical SQL Server server, thus avoiding distributed transactions.

This approach is avoiding two problems with distributed transactions:blockageup toperformances. Of course, it also limits the user's usage patterns. Read-only transactions can span multiple row groups, but the transaction isolation level supports at most read-committed.

(2) Physical modeling
At the physical level, each table group with a primary key is ordered into multiple data partitions based on the partitioned primary key columns. These partitions do not overlap each other and cover all partitioned primary key values. This means that each row group belongs to a unique partition.
A partition is the basic unit of cloud SQL Server replication, migration, and load balancing. Each partition contains multiple replicas (3 by default), and each replica is stored on a physical SQL Server.
Since each row group belongs to a partition, this also means that the amount of data in each row group cannot exceed the upper limit of storage allowed by the partition, that is, the upper limit of capacity of a single SQL Server.
Generally, machines in the same switch or the same rack are more likely to fail at the same time, and thus they belong to the same failure domain.
Cloud SQL Server ensures that multiple copies of each partition are distributed to different failure domains. One replica per partition is the primary and the others are secondary.
The primary replica handles all queries, updates the transaction and synchronizes the transaction to the backup replica in the form of a transaction log (in a manner similar to database mirroring). Each replica receives the transaction log sent by the primary replica and applies it to the local database. The backup replica supports read operations, which can reduce the pressure on the primary replica.

As shown in the figure, there are four logical partitions PA, PB, PC, and PD, each of which has one primary and two backup copies. For example, PA has one primary copy, PA_P, and two backup copies, PA_S1 and PA_S2.

Each physical SQL Server database houses a mix of primary and backup copies. If one machine fails, the partitions on it can be quickly dispersed to other living machines.
Partitioning is dynamic. If a partition exceeds the maximum allowed partition size or the load is too high, the partition will split into two.
Assume that the primary copy of partition A is on machine X and its backup copies are on machines Y and Z. If partition A splits into A1 and A2, each copy needs to be split into two segments accordingly.
For better load balancing, each replica may have a different role before and after the split. For example, A1's primary replica is still at machine X and the backup replicas are at machines Y and Z, while A2's primary replica may be at machine Y and the backup replicas are at machines X and Z.

build
Cloud SQL Server is divided into four main parts: the SQL Server instance, the Global Partition Manager, the Protocol Gateway, and the Distributed Base Components, as shown in the figure.

The function of each part is as follows:
Each SQL Server instance is a physical process running SQL Server. Each physical database contains multiple sub-databases that are isolated from each other. A subdatabase is a partition that contains user data and schema information.
The Global Partition Manager maintains partition mapping table information, including the primary key range to which each partition belongs, the server on which each replica resides, and the current status of each replica, which includes whether the replica is currently master or standby, was previously master or standby, is becoming master, is being copied, or is being chased.
When a server fails, the distributed infrastructure component detects it and synchronizes this information to the global partition manager. The Global Partition Manager then performs reconfiguration operations. In addition, Global Partition Manager monitors the working status of SQL Server in the cluster and performs management operations such as load balancing and replica copying.
The Protocol Gateway is responsible for forwarding the user's database connection request to the corresponding primary partition. The Protocol Gateway obtains the SQL Server instance where the partition resides through the Global Partition Manager, and subsequent read and write transactions take place between the gateway and the SQL Server instance.
Distributed Fabric is used to maintain machine up and down state, detect server failures and perform election of master node operations for various roles in the cluster. It runs a daemon on each server.

Replication and consistency
Cloud SQL Server uses the "Quorum Commit" replication protocol, user data is stored in three copies, at least two copies can be written successfully before returning to the client successfully. As shown in the figure, the primary replica of transaction T is partitioned to generate transaction logs and send them to the backup replica.

If transaction T rolls back, the primary replica sends an ABORT message to the backup replica, which deletes the received modification operation contained in transaction T. The backup replica is not allowed to use this message. If transaction T commits, the primary replica sends a COMMIT message to the backup replica with theTransaction Commit Sequence Number (CSN)，

Each standby replica applies the modified operation of transaction T to the local database and sends an ACK message back to the primary replica. If the master replica receives successful ACKs from more than half of the nodes (including the master replica itself), it will commit the transaction locally and return to the client successfully.
Some standby replicas may fail, and after recovery, they will send the commit sequence number CSN of the last locally committed transaction to the primary replica. if the two are similar, the primary replica will send the operation log directly to the standby replica; if there is too much difference between the two, the primary replica will first pass a snapshot of the database to the standby replica, and then pass the operation log to the standby replica after the snapshot.
Logs of logical operations are sent between the primary and backup copies, rather than redo and rollback logs for physical pages on disk. Database indexing and schema-related operations (such as creating and deleting tables) are also sent through the transaction log.
Between the replicas to send transaction logs / logical operation logs to ensure data consistency of each replica is currently the mainstream program, including TiDB, OceanBase is also using the same program.
Some hardware issues were found during practice, such as certain NICs exhibiting incorrect behavior, so checksums are done on all messages between the master and the backup.
Similarly, some disks can have "bit-flip" errors, so a checksum is applied to the data written to the disk.

error tolerance
If a data node fails, you need to start the downtime recovery process. Each SQL Server instance serves up to 650 logical partitions, which may be primary or backup copies.
The global partition manager unifies scheduling and selects one partition at a time to perform reconfiguration (Reconfiguration).
If the failed partition is a spare copy, the Global Partition Manager first selects a lightly loaded server, and then copies data from the corresponding primary copy partition to increase the replica;

If the partition that failed is the primary replica, you first need to select an up-to-date spare replica from one of the other replicas as the new primary replica, followed by selecting a less loaded machine to add the spare replica.
Since Cloud SQL Server uses the "Quorum Commit" replication protocol, if there are three replicas per partition, at least two replicas are guaranteed to write successfully, and selecting the latest backup replica after the primary replica fails ensures that no data is lost.
The global partition manager controls the priority of reconfiguration tasks; otherwise, users' services are affected. For example, if the primary copy of a data partition fails, you need to switch to the primary copy as soon as possible by selecting the latest backup copy from the other backup copies;

A certain data slice has only one primary copy, which needs to be replicated out of the backup copy as a priority. In addition, some servers may go offline for a short period of time and then come back online, and in order to avoid too many useless data copies, the
Here you also need to configure some policies, such as starting to replicate the third replica only if the state of two replicas lasts for a longer period of time (SQL Azure is configured for two hours by default).
The global partition manager also uses "Quorum Commit" honestHigh availability is now available. It contains seven replicas (an odd number), with only one replica being primary at the same moment, and partition-related metadata operations need to succeed on at least four replicas.
If the global partition manager primary replica fails, the distributed infrastructure component will be responsible for selecting an up-to-date replica from the other replicas as the new primary replica

load balancing
Load balancing-related operations consist of two types: replica migration and primary and backup replica switching. When a new server node is added, partitions within the system are gradually migrated to the new node, the
It is important to note here that in order to avoid too many partitions being moved to new nodes at the same time, the global partition manager needs to control the frequency of migrations, otherwise the overall system performance will be degraded.
In addition, if the load on the server where the master copy is located is too high, you can choose the backup copy with lower load to upgrade to provide read and write services for the master copy. This process is called master and backup copy switching and does not involve data copying. The following are some examples of how to do this,
Factors that affect the load on a server node include the number of reads and writes, disk/memory/CPU/IO usage, and so on. Global Partition Manager calculates the load for each partition and each SQL Server instance based on these factors.

multi-tenant
The operations of multiple users in a cloud storage system interfere with each other, so you need to limit the system resources used by each SQL Azure logical instance:

Operating system resource limits, such as CPU, memory, write speeds, etc. If the limit is exceeded, the corresponding user request will be rejected within 10 seconds;
SQL Azure logical database capacity limits. Each logical database is pre-set with a maximum capacity, and update requests are denied when the limit is exceeded, but delete operations are allowed;
SQL Server physical database data size limit. A client system error is returned when this limit is exceeded.

summarize
Microsoft SQL Azure is based on SQL Server and improves database scalability and fault tolerance through distributed technology. Adopting master and backup replication and partitioning mechanism to ensure high availability and consistency of data.
The system optimizes performance and ensures stable operation in multi-tenant environments through global partition management, load balancing and resource limiting.
SQL Server is currently a mainstream and competitive product, according to the latest reliable news, SQL Server 2025 version will be built-in SQL Azure distributed functionality, coupled with vector databases and AI capabilities, will be more competitive on the world stage.

reference article
/en-us/products/azure-sql/
/chapter/10.1007/978-1-4842-9225-9_2
/azure-sql-database-connectivity-architecture/
/en-us/azure/architecture/reference-architectures/n-tier/multi-region-sql-server
/book/data/9781789538854/1/ch01lvl1sec08/azure-sql-database-architecture

Join our weibo group to discuss database technology and topics related to SQL Server, MySQL, PostgreSQL, and MongoDB with us.
WeChat is for learning and communication purposes only, without any advertising or commercial activities.

This article is copyrighted and may not be reproduced without the author's permission.