Case 9: Distributed Database

Course 9: Distributed Database Architecture Evolution Case Study

Goal: Databases are the lifeline of many applications, and their scalability and consistency are crucial. This course uses the evolution process of a typical distributed database system as an example to give you an in-depth understanding of core technical challenges and solutions such as database read/write splitting, database sharding (sharding), distributed transaction processing, multi-active deployment, and elastic scaling, deeply comprehending the engineering balance between ACID and CAP in the database field.

Phase 0: The Reign of the Single Node (Classic MySQL Single Instance)

System Description

Scenario: A typical business system needs to persist data like user information, orders, products, etc.
Functionality: Supports standard SQL operations (INSERT/SELECT/UPDATE/DELETE), guarantees transaction ACID properties.
Tech Stack:
Database: MySQL 5.7 or higher (using InnoDB storage engine)
Application Connection: Java (JDBC), Python (mysqlclient), etc., directly connecting to the database instance.

Current Architecture Diagram

graph LR
    App --> MySQL("MySQL Master Single Instance");

Pain Points at this moment: - Read Performance Bottleneck: As user volume and data grow, numerous read requests (often over 80%) can exhaust single-machine CPU and IO resources, slowing down queries. - Write Performance Bottleneck: In high-concurrency write scenarios (like flash sales), lock contention is severe, limiting write QPS. - Capacity Limit: Single machine disk capacity has an upper limit; performance degrades significantly with TB-level data. - Single Point of Failure: If the database server goes down, the entire business is paralyzed.

Phase 1: Overwhelming Read Pressure → Read/Write Splitting First

Challenge Emerges

Application's read QPS continuously increases, database response time grows, user experience declines.
Write requests are relatively few but are slowed down by numerous read requests.

❓ Architect's Thinking Moment: How to first alleviate the most prominent read pressure?

(Can't just add machines, what about writes? Is master-slave replication the standard solution? Does the application layer need changes?)

✅ Evolution Direction: Introduce Master-Slave Replication and Read/Write Splitting

Enable MySQL Master-Slave Replication:
- Configure a Master database responsible for handling all write operations (INSERT, UPDATE, DELETE).
- Configure one or more Slave databases that asynchronously replicate data changes from the master (via Binlog). Slaves handle most read operations (SELECT).
Implement Read/Write Splitting:
- Option 1: Application Layer Modification. Judge SQL type in application code, send write SQL to the master, read SQL to slaves (might need load balancing). Highly intrusive.
- Option 2: Introduce Database Middleware/Proxy. Use tools like ProxySQL, MySQL Router, or ShardingSphere-Proxy. Middleware is transparent to the application, automatically parsing SQL and routing read/write requests to the appropriate master or slave. This is the more common approach.
Address Master-Slave Latency Issue:
- Since master-slave replication is usually asynchronous, data latency exists. If the business requires very high real-time consistency for reads (e.g., needing to read immediately after writing), possibilities include:
  - Forcing these read requests to the master (e.g., using SQL Hint /*FORCE_MASTER*/).
  - Designing application logic to tolerate brief latency.

Architecture Adjustment (Introducing Master-Slave and Middleware):

graph TD
    App --> Proxy("Database Middleware ProxySQL / ShardingSphere-Proxy");
    Proxy -- Write Requests --> Master("MySQL Master");
    Proxy -- Read Requests --> Slave1("MySQL Slave 1");
    Proxy -- Read Requests --> Slave2("MySQL Slave 2");
    Master -- Binlog --> Slave1;
    Master -- Binlog --> Slave2;

Significantly improved read performance and availability (read requests distributed across slaves), but write performance and single-database capacity bottlenecks still exist.

Phase 2: Writes Can't Keep Up Either → Sharding is the Endgame

Challenge Escalates: Write Bottleneck and Capacity Limit

With continuous business growth, order and user volumes explode. Even writing only to the master, the master's write QPS hits its bottleneck.
Single table data volume becomes excessive (e.g., order table reaching billions of rows), increasing index maintenance costs, severely degrading query performance (even with indexes), and making DDL operations (like adding fields) a nightmare.
The master's disk storage capacity also approaches its limit.

❓ Architect's Thinking Moment: Vertical splitting has reached its limit, what's the ultimate horizontal scaling solution for databases?

(Read/write splitting can't solve write bottlenecks and capacity issues. Horizontal splitting is necessary. By what dimension? User ID? Order ID? Time? How to ensure globally unique IDs? Does the application layer need to know the sharding details?)

✅ Evolution Direction: Database Horizontal Sharding + Distributed ID

Sharding Strategy:
- Choose an appropriate Sharding Key to distribute data across multiple database instances (database sharding) or multiple tables (table sharding).
- Common Strategies:
  - By Range: E.g., sharding by user ID ranges or time ranges (one order table per month). Relatively simple to expand, but may lead to data hotspots (e.g., most frequent access to recent month's data).
  - By Hash: Calculate a hash value for the sharding key (e.g., user_id or order_id), then take the modulo of the number of databases/tables to determine which one the data belongs to. More even data distribution, but range queries can be difficult.
- Choosing the Sharding Key is Crucial: Usually select the most core field used frequently in queries. Carefully evaluate query patterns to avoid cross-shard queries (especially JOINs). A common practice is to first shard by the core entity (like User ID) into databases, then shard tables within the database by related entities (like Order ID) or time.
Introduce Sharding Middleware:
- Use ShardingSphere-JDBC (integrated within the application) or ShardingSphere-Proxy (independent proxy layer), or MyCAT, TDDL, etc.
- The middleware is responsible for parsing SQL, routing requests to the correct physical database/table based on configured sharding rules, and handling result aggregation (like ORDER BY, GROUP BY), trying to remain transparent to the application layer.
Introduce Distributed ID Generator:
- After sharding, database auto-increment primary keys are no longer globally unique. Need to introduce a Distributed ID Generation Service to ensure global uniqueness of primary keys.
- Common solutions: Snowflake algorithm, Database Segment Mode, Redis Incr, UUID (poor performance, not recommended as primary key).

Architecture Adjustment (Deep Database Layer Transformation):

graph TD
    App --> ShardingMiddleware("Sharding Middleware ShardingSphere / MyCAT");
    subgraph "Distributed ID Service"
        IDService("ID Generation Service Snowflake / Segment");
    end
    App -- Get Globally Unique ID --> IDService;
    subgraph "Sharded Database Cluster"
        DBInstance1("MySQL Instance 1");
        DBInstance2("MySQL Instance 2");
        DBInstance3("MySQL Instance 3");
        DBInstance4("MySQL Instance 4");
        DBInstance1 -- Contains Shard 0, 1, 2, 3 --> T0("Table_0..3");
        DBInstance2 -- Contains Shard 4, 5, 6, 7 --> T1("Table_4..7");
        DBInstance3 -- Contains Shard 8, 9, 10, 11 --> T2("Table_8..11");
        DBInstance4 -- Contains Shard 12, 13, 14, 15 --> T3("Table_12..15");
    end
    ShardingMiddleware -- SQL Routing/Result Aggregation --> DBInstance1;
    ShardingMiddleware -- SQL Routing/Result Aggregation --> DBInstance2;
    ShardingMiddleware -- SQL Routing/Result Aggregation --> DBInstance3;
    ShardingMiddleware -- SQL Routing/Result Aggregation --> DBInstance4;

Solved write performance and capacity bottlenecks, but introduced the challenge of distributed transactions.

Phase 3: What About Cross-Database Operations? → The Choice of Distributed Transactions

New Challenge: Ensuring Data Consistency Across Shards

After sharding, a single business operation might need to modify data distributed across different database instances. For example, a transfer operation: User A's balance (in DB instance 1) decreases, User B's balance (in DB instance 2) increases. How to ensure these two operations either both succeed or both fail?
Traditional local database transactions cannot span multiple instances.

❓ Architect's Thinking Moment: How to guarantee atomicity for cross-database operations?

(Simple rollback won't work. Two-phase commit (2PC/XA)? Performance is too poor! Other options? Flexible transactions? Is eventual consistency acceptable?)

✅ Evolution Direction: Adopt Flexible Transaction Solutions (Saga/TCC/Transactional Messages) or Seata Framework

Internet scenarios typically cannot tolerate the performance and locking overhead of 2PC/XA, preferring flexible transaction solutions that pursue eventual consistency:

Reliable Message Eventual Consistency (Based on Message Queue):
- Core idea: Break down the cross-database operation into multiple local transactions, driven by a message queue (MQ) (like Kafka, RocketMQ).
- Example: Transfer - Service A executes a local transaction (debit), then sends a "debited" message to MQ. Service B subscribes to the message, consumes it successfully, then executes its local transaction (credit).
- Requires the MQ to provide reliable message delivery (at-least-once delivery) and consumer idempotency. RocketMQ's transactional messages can better guarantee the atomicity of "local transaction execution" and "message sending".
TCC (Try-Confirm-Cancel):
- For each service participating in the distributed transaction, implement Try, Confirm, and Cancel interfaces.
- Try phase: Attempt to execute the business logic, reserve resources (e.g., freeze amount).
- Confirm phase: If all services' Try succeed, call their respective Confirm interfaces to actually execute the business.
- Cancel phase: If any service's Try fails, call the Cancel interfaces of all services that successfully completed Try to release reserved resources.
- Highly intrusive to business code.
Saga Pattern:
- Break down a long business process into multiple sub-transactions (local transactions), each with a corresponding compensation operation.
- Execute sub-transactions sequentially. If a sub-transaction fails, execute the compensation operations of the preceding successful sub-transactions in reverse order.
- Relatively simple to implement, but weaker consistency guarantees (doesn't guarantee isolation).
Distributed Transaction Framework Seata:
- Alibaba's open-source distributed transaction solution, offering AT (Automatic Compensation, similar to TCC but less intrusive), TCC, Saga, and XA modes.
- The AT mode is its highlight: by proxying data sources, it automatically generates Try and Cancel logic (recording before and after images of SQL execution, automatically rolling back on failure), requiring minimal changes to business code.
- Introducing Seata increases architectural complexity and dependencies (requires deploying TC - Transaction Coordinator).

Architecture Adjustment (Using Seata AT Mode as an Example):

graph TD
    App --> SeataProxy("Seata DataSource Proxy");
    SeataProxy --> ShardingMiddleware;
    App -- Initiate Global Transaction --> TC("Seata TC Transaction Coordinator");
    subgraph "Business Execution"
        ShardingMiddleware -- "(Branch Transaction 1)" --> DBInstance1("Involving Shard A");
        ShardingMiddleware -- "(Branch Transaction 2)" --> DBInstance2("Involving Shard B");
    end
    DBInstance1 -- Register Branch/Report Status --> TC;
    DBInstance2 -- Register Branch/Report Status --> TC;
    TC -- Drive Global Commit/Rollback --> App & DBInstances;

Solved cross-database transaction consistency issues, but raised higher requirements for availability and performance, especially for online scaling.

Phase 4: Business Growth Can't Stop → Online DDL and Elastic Scaling

Challenge Revisited: Smooth Scaling and Changes

Business continues to grow, requiring more database shards to support larger data volumes and concurrency. Traditional downtime maintenance windows are increasingly unacceptable.
How to perform database scaling (increasing shard count) and schema changes (like adding tables, DDL operations for adding fields) without service interruption?

❓ Architect's Thinking Moment: How to make the database cluster elastic like applications?

(Downtime scaling is not an option. Are there online data migration solutions? How to perform DDL changes smoothly?)

✅ Evolution Direction: Explore NewSQL or Adopt Mature Online DDL/Migration Tools

NewSQL Databases:
- Some emerging distributed databases (often called NewSQL), like TiDB, CockroachDB, Google Spanner, inherently possess online elastic scaling capabilities.
- They typically use underlying storage engines like RocksDB and consensus protocols like Raft/Paxos, enabling automatic data rebalancing and region splitting/merging, supporting online node addition/removal.
- Most are compatible with the MySQL protocol, making migration relatively manageable.
Online DDL Tools (for MySQL Sharding):
- For scenarios continuing with MySQL sharding clusters, tools like gh-ost or pt-online-schema-change can perform online DDL operations with minimal business impact.
- These tools work by creating "shadow tables," copying data, and using triggers to synchronize incremental changes.
Online Data Migration and Scaling Solutions (for MySQL Sharding):
- Database scaling (e.g., expanding from 16 to 32 shards) usually involves complex data migration processes.
- Common strategies include:
  - Dual Write: During migration, the application writes to both old and new shard clusters simultaneously.
  - Data Synchronization: Use tools like Canal to subscribe to the old cluster's Binlog and apply incremental changes to the new cluster in real-time.
  - Full Validation and Cutover: After completing full data migration and incremental synchronization, perform data validation and finally switch traffic to the new cluster.
- Some database middleware (like ShardingSphere's ElasticJob) or cloud provider services offer features to assist with data migration.
(Optional) Embrace Cloud Databases:
- Distributed database services provided by cloud vendors (like AWS Aurora, Alibaba Cloud PolarDB-X) often have built-in elastic scaling and online change capabilities, significantly simplifying operational complexity but leading to vendor lock-in.

Architecture Adjustment (Using TiDB as an Example):

graph TD
    App --> LoadBalancer("Load Balancer e.g., Nginx/HAProxy");
    LoadBalancer --> TiDBServer1("TiDB Server Stateless SQL Layer Instance 1");
    LoadBalancer --> TiDBServer2("TiDB Server Stateless SQL Layer Instance 2");
    LoadBalancer --> TiDBServerN("TiDB Server Stateless SQL Layer Instance N");
    TiDBServer1 --> PDCluster("PD Cluster - Placement Driver Metadata/Scheduling");
    TiDBServer2 --> PDCluster;
    TiDBServerN --> PDCluster;
    PDCluster -- Raft --> PDC1("PD Instance 1");
    PDCluster -- Raft --> PDC2("PD Instance 2");
    PDCluster -- Raft --> PDC3("PD Instance 3");
    subgraph "TiKV Cluster (Distributed KV Storage Layer)"
        TiKVServer1("TiKV Server Instance 1") -- "Stores Region 1, 4, 7 ..." --> Store1("Store1");
        TiKVServer2("TiKV Server Instance 2") -- "Stores Region 2, 5, 8 ..." --> Store2("Store2");
        TiKVServer3("TiKV Server Instance 3") -- "Stores Region 3, 6, 9 ..." --> Store3("Store3");
        TiKVServer1 <-- Raft --> TiKVServer2;
        TiKVServer2 <-- Raft --> TiKVServer3;
        TiKVServer1 <-- Raft --> TiKVServer3;
    end
    TiDBServer1 -- Read/Write --> TiKVCluster;
    TiDBServer2 -- Read/Write --> TiKVCluster;
    TiDBServerN -- Read/Write --> TiKVCluster;
    PDCluster -- Region Scheduling/Heartbeat --> TiKVCluster;

NewSQL databases like TiDB offer more native distributed capabilities and elasticity.

Phase 5: Single Model Isn't Enough → Exploring HTAP and Multi-Model Databases

New Challenge: The Demand for Hybrid Workloads

Business systems need not only to handle high-concurrency online transaction processing (OLTP) but also complex online analytical processing (OLAP), such as real-time reports, user behavior analysis, BI queries, etc.
The traditional approach is to ETL (Extract, Transform, Load) OLTP data into a data warehouse (like Hive, ClickHouse) for analysis, but this involves data latency and architectural complexity.
Can OLTP and OLAP workloads be efficiently supported within the same database system? This is the goal of HTAP (Hybrid Transactional/Analytical Processing).

❓ Architect's Thinking Moment: How to perform real-time data analysis without affecting online transactions?

(ETL is too slow. Can analysis run directly on the production database? How to isolate resources? Is columnar storage better for analysis?)

✅ Evolution Direction: Adopt HTAP Database or Build Real-time Data Link

Adopt Native HTAP Database:
- Some distributed databases (like TiDB) achieve HTAP architecture by introducing a columnar storage engine (like TiFlash) as a special replica.
- OLTP requests are routed to replicas based on row storage (TiKV) to ensure transaction performance.
- OLAP requests are routed to replicas based on columnar storage (TiFlash), leveraging columnar advantages (high compression, column scanning) to accelerate analytical queries.
- Data is automatically synchronized from row storage to column storage via the Raft protocol, ensuring real-time consistency.
Build Real-time Data Synchronization Link (CDC + OLAP Database):
- If continuing to use MySQL or other OLTP databases, CDC (Change Data Capture) tools (like Canal, Debezium) can capture database change logs (Binlog) in real-time.
- Synchronize the change data in real-time to a dedicated OLAP database (like ClickHouse, Doris, StarRocks).
- OLAP queries execute directly on the analytical database, separate from the OLTP workload.
- This approach offers higher flexibility in choosing the optimal OLAP engine but requires maintaining the data synchronization link.
Resource Isolation:
- Regardless of the chosen solution, ensure OLAP queries do not impact OLTP performance. Resource isolation can be achieved through resource groups, priority queues, physical isolation, etc.

Architecture Adjustment (Using TiDB HTAP as an Example):

graph TD
    subgraph "Application Layer"
        OLTP_App("Online Transaction App") --> TiDB_SQL("TiDB SQL Layer");
        OLAP_App("Data Analysis/Reporting App") --> TiDB_SQL;
    end
    subgraph "TiDB Cluster"
        TiDB_SQL --> PDCluster("PD Cluster");
        TiDB_SQL --> TiKVCluster("TiKV Cluster (Row Store - OLTP)");
        TiDB_SQL --> TiFlashCluster("TiFlash Cluster (Column Store - OLAP)");
        TiKVCluster -- Raft Learner Sync --> TiFlashCluster;
        PDCluster -- Scheduling --> TiKVCluster;
        PDCluster -- Scheduling --> TiFlashCluster;
    end

Summary: The Evolutionary Path of Distributed Database Systems

Phase	Core Challenge	Key Solution	Representative Tech/Pattern
0. Single Node	Perf Bottleneck / SPOF / Capacity	Single MySQL Instance	MySQL, InnoDB
1. Read Split	Read Performance	Master-Slave Replication + Read/Write Splitting Proxy	Binlog, ProxySQL, ShardingSphere-Proxy
2. Sharding	Write Perf / Capacity Limit	Horizontal Sharding + Distributed ID	ShardingSphere, MyCAT, Snowflake, Segment
3. Dist. Tx	Cross-Shard Consistency	Flexible Transactions (MQ, TCC, Saga) or Seata	RocketMQ Tx Msg, Seata (AT/TCC/Saga)
4. Elasticity	Online Scaling / DDL	NewSQL DB or Online Migration/DDL Tools	TiDB, CockroachDB, gh-ost, pt-online-schema-change, Canal
5. HTAP	Hybrid Workloads (OLTP+OLAP)	Native HTAP DB (Row+Column Store) or CDC+OLAP DB	TiDB+TiFlash, CDC (Debezium) + ClickHouse/Doris
```