Case 6: E-commerce Platform
title: Building High-Availability Systems: Fault Tolerance, Redundancy & Recovery - System Design Course 6 description: Deep dive into system reliability design, learning core fault tolerance mechanisms like fault domain isolation, redundancy design, automatic failover, graceful degradation, and circuit breakers.
Course 6: Large E-commerce Platform Architecture Evolution Case Study
Goal: E-commerce platforms represent the pinnacle of internet architecture complexity. This course takes you through the entire process of a typical e-commerce platform starting from scratch, gradually coping with surging user numbers, business function expansion, high concurrency challenges, data consistency problems, and finally evolving into a complex distributed system. You will gain an in-depth understanding of core architectural designs like microservice decomposition, service governance, distributed transactions, cache application, database scaling, message queue decoupling, and multi-active deployment.
Phase 0: Humble Beginnings (Monolithic Application + MySQL)
System Description
- A startup e-commerce website with core functions: user registration/login, product browsing, shopping cart, ordering, simple backend management.
- Tech Stack: LAMP/LNMP (Linux + Apache/Nginx + MySQL + PHP/Python/Java)
- Web Server: Apache or Nginx
- Application Framework: e.g., PHP's Laravel/ThinkPHP, Java's Spring Boot
- Database: Single MySQL instance holding all business data.
- Deployment: All code packaged into one WAR/JAR or PHP files deployed directly to a single server.
Current Architecture Diagram
graph TD
User[User] --> Browser[Browser];
Browser --> LB("Load Balancer, Optional");
LB --> AppServer("App Server: Nginx + PHP/Java App");
AppServer --> DBMaster("MySQL Master");
Admin[Admin] --> AppServer;
Pain Points at this moment:
- Low Development Efficiency: All functions coupled in one codebase, conflicts arise easily when multiple people modify the same file, code merging is difficult. Any small change requires recompiling, testing, and deploying the entire application.
- Single Tech Stack: Cannot choose the most suitable technology for different business modules (e.g., Elasticsearch might be better for the search module).
- Poor Reliability: An inconspicuous feature bug can potentially crash the entire website.
- Limited Scalability: Cannot scale specific bottleneck modules (like product query) targetedly; can only add machines to deploy the entire application, which is costly.
Phase 1: Can't Handle It Anymore → Application and Data Splitting
Challenge Emerges
- Rapid growth in user volume and order numbers overwhelms the monolithic application, causing slow responses.
- Single MySQL database under immense pressure, master-slave lag increases, write bottleneck becomes apparent.
- Development teams for different business lines (e.g., product team, order team, user team) interfere with each other, making release coordination difficult.
❓ Architect's Thinking Moment: How to decompose this behemoth to improve overall capacity and development efficiency?
(Vertical or horizontal splitting? How to split the database? How should services communicate after splitting?)
✅ Evolution Direction: Vertically Split Application + Database Master-Slave/Sharding
- Vertical Application Splitting (By Business Line):
- Split the monolithic application into multiple independent subsystems (or services) based on business boundaries, such as: User Service, Product Service, Order Service, Payment Service.
- Each service has its own independent codebase, development team, and deployment process.
- Initially, services can communicate via HTTP API (RESTful) or RPC (like Dubbo, gRPC, Thrift).
- Database Master-Slave Separation:
- Configure MySQL master-slave replication. Route read requests (the majority) to Slave(s), and write requests to the Master.
- Introduce database middleware (like MyCAT, ShardingSphere-Proxy) or implement read-write splitting logic at the application layer to shield master-slave details from the application.
- (Optional) Vertical Database Sharding:
- If a single business's data volume or QPS is still too large, further vertically split its database. For example, put user-related tables (basic info, address, points) into
user_db
, and product-related tables intoproduct_db
. - Note: Vertical sharding introduces the problem of cross-database JOINs, which is usually solved by aggregating data at the service layer. Avoid direct joins at the database layer if possible.
- If a single business's data volume or QPS is still too large, further vertically split its database. For example, put user-related tables (basic info, address, points) into
Architecture Adjustment:
graph TD
User --> LB("Load Balancer");
LB --> Gateway("API Gateway / BFF");
Gateway --> UserService("User Service");
Gateway --> ProductService("Product Service");
Gateway --> OrderService("Order Service");
UserService --> UserDBMaster("User DB Master");
UserService --> UserDBSlave("User DB Slave");
ProductService --> ProductDBMaster("Product DB Master");
ProductService --> ProductDBSlave("Product DB Slave");
OrderService --> OrderDBMaster("Order DB Master");
OrderService --> OrderDBSlave("Order DB Slave");
%% Read-Write Split
UserService -- Write --> UserDBMaster;
UserService -- Read --> UserDBSlave;
ProductService -- Write --> ProductDBMaster;
ProductService -- Read --> ProductDBSlave;
OrderService -- Write --> OrderDBMaster;
OrderService -- Read --> OrderDBSlave;
Phase 2: Too Many Services to Manage → Introduce Service Governance & Distributed Configuration
New Challenge: Early Signs of Microservice Pain
- The number of services increases, and dependencies between services become complex.
- How to effectively discover newly launched service instances?
- How to perform load balancing for service calls?
- What if a service dies? How to implement circuit breaking, degradation to prevent cascading failures?
- Configurations for each service (database addresses, third-party keys, etc.) are scattered everywhere, leading to management chaos.
❓ Architect's Thinking Moment: How to manage this pile of "small services"? How to centralize configuration?
(Need a registry? Client-side or server-side load balancing? How to do circuit breaking and rate limiting? How do configuration changes take effect in real-time?)
✅ Evolution Direction: Introduce Service Registry/Discovery + Config Center + Service Governance Framework
- Service Registration & Discovery:
- Introduce a Service Registry (like Consul, Nacos, ZooKeeper, Eureka).
- Service providers register their addresses with the registry upon startup.
- Service consumers obtain the list of provider addresses from the registry.
- Service Invocation & Load Balancing:
- After obtaining the address list, service consumers use a load balancing algorithm (like round-robin, random, weighted) to select an instance for invocation. This is often handled by RPC frameworks (like Dubbo) or specialized client libraries (like Ribbon).
- Circuit Breaking, Degradation, Rate Limiting:
- Introduce a service governance framework (like Sentinel, Hystrix) or capabilities built into the RPC framework.
- Circuit Breaking: When calls to a specific service consistently fail and reach a threshold, temporarily "break the circuit" to that service, failing fast to avoid resource exhaustion.
- Degradation: During high concurrency or when non-core services fail, proactively disable or simplify certain non-core functions to ensure core process availability.
- Rate Limiting: Control the number of requests to a service per unit of time to prevent the system from being overwhelmed by sudden traffic bursts.
- Distributed Configuration Center:
- Introduce a Config Center (like Nacos, Apollo, Spring Cloud Config).
- Store configurations for all services centrally.
- Applications pull configurations from the center on startup and can listen for changes, enabling dynamic configuration updates.
Architecture Adjustment (Adding Governance & Config Layer):
graph TD
subgraph "Service Governance & Configuration"
Registry("Nacos/Consul Registry");
ConfigCenter("Nacos/Apollo Config Center");
end
subgraph "Service Layer"
UserService -- Register/Discover --> Registry;
UserService -- Pull Config --> ConfigCenter;
ProductService -- Register/Discover --> Registry;
ProductService -- Pull Config --> ConfigCenter;
OrderService -- Register/Discover --> Registry;
OrderService -- Pull Config --> ConfigCenter;
%% Inter-service calls
OrderService -- "RPC (Load Balance/Circuit Break)" --> UserService;
OrderService -- "RPC (Load Balance/Circuit Break)" --> ProductService;
end
subgraph "Data Layer"
UserDB("User DB");
ProductDB("Product DB");
OrderDB("Order DB");
end
UserService --> UserDB;
ProductService --> ProductDB;
OrderService --> OrderDB;
Phase 3: Overwhelming Read Pressure → Introduce Distributed Cache
Challenge Escalates: Database Read Bottleneck
- As user volume grows further and major promotions like "Double Eleven" arrive, QPS for high-frequency read scenarios like product detail pages and homepage recommendations skyrockets.
- Even with master-slave separation, a large number of read requests overwhelm the database slaves.
❓ Architect's Thinking Moment: How to significantly improve read performance without drastically increasing database costs?
(Cache is king. Which cache to use? Redis or Memcached? How to handle cache avalanche, penetration, breakdown?)
✅ Evolution Direction: Introduce Distributed Cache Cluster (Redis/Memcached)
- Cache Selection:
- Redis is often chosen due to its richer data structures (String, Hash, List, Set, Sorted Set) and persistence capabilities, allowing for more diverse use cases.
- Memcached is simpler, potentially slightly more efficient in memory management, suitable for pure KV caching.
- Caching Strategy:
- Cache-Aside Pattern: Most common. Read: Read cache first, if miss, read DB, then write back to cache. Write: Update DB first, then invalidate the cache (not update cache, to avoid concurrent update issues).
- Set reasonable Time-To-Live (TTL) for cache entries.
- Cache Location:
- Local Cache: e.g., Guava Cache, Caffeine. Caches within the application JVM, fastest speed but limited capacity and higher risk of data inconsistency.
- Distributed Cache: e.g., Redis Cluster, Memcached Cluster. Deployed independently, large capacity, shared by all service instances, the primary caching layer used.
- Handling Cache Issues:
- Cache Penetration (querying non-existent data): Use a Bloom Filter for pre-check, or cache null objects.
- Cache Breakdown (hot key expires): Use distributed locks (like Redisson) or mutex locks, allowing only one request to load data and write back to the cache.
- Cache Avalanche (many keys expire simultaneously): Set randomized expiration times, use cache pre-warming, or implement multi-level caching (local + distributed).
Architecture Adjustment (Adding Cache Layer):
graph TD
subgraph "Application Layer"
UserService;
ProductService;
OrderService;
end
subgraph "Cache Layer"
RedisCluster("Redis Cluster");
end
subgraph "Data Layer"
UserDB("User DB");
ProductDB("Product DB");
OrderDB("Order DB");
end
%% Read/Write Paths
UserService -- Read --> RedisCluster -- Cache Miss --> UserDB;
UserService -- "Write DB & Invalidate Cache" --> UserDB;
ProductService -- Read --> RedisCluster -- Cache Miss --> ProductDB;
ProductService -- "Write DB & Invalidate Cache" --> ProductDB;
OrderService -- Read --> RedisCluster -- Cache Miss --> OrderDB;
OrderService -- "Write DB & Invalidate Cache" --> OrderDB;
Phase 4: Writes Also Can't Keep Up → Database Horizontal Sharding
Ultimate Challenge: Database Write Bottleneck & Capacity Limit
- Order volume and user numbers continue explosive growth. The write QPS of single business databases (even with master-slave) hits bottlenecks.
- Single table data volume becomes excessively large (e.g., hundreds of billions of rows in the orders table), making queries, index maintenance, and DDL operations extremely slow or impossible.
- Database storage capacity also approaches limits.
❓ Architect's Thinking Moment: What's the ultimate scaling solution for the database?
(Vertical splitting is maxed out. How to do horizontal splitting? By what dimension? How to choose the sharding key? How to generate distributed IDs? How to migrate smoothly?)
✅ Evolution Direction: Database Horizontal Sharding + Distributed ID Generator
- Horizontal Sharding Strategy:
- By Range: e.g., by time range (one order table per month) or ID range. Pro: simple scaling. Con: potential data hotspots (e.g., most recent month's data accessed most frequently).
- By Hash: Choose a Sharding Key (like
user_id
ororder_id
), calculate its hash, then take modulo the number of databases/tables to determine where the data lands.- Choosing the right sharding key is crucial: Usually select the most frequently used field in queries (User ID, Order ID, Product ID). Need to avoid hotspots (e.g., all orders from a big seller landing on one shard).
- Common combination: Shard databases by
user_id
hash first, then shard tables within each database byorder_id
hash or time range.
- Database Sharding Middleware:
- Introduce database sharding middleware (like ShardingSphere-JDBC/Proxy, TDDL, MyCAT).
- The middleware parses SQL, routes requests to the correct physical databases/tables based on sharding rules, and aggregates results, aiming for transparency to the application layer.
- Distributed ID Generator:
- After sharding, database auto-increment primary keys are no longer globally unique. Need a distributed ID generation service.
- Common solutions: Snowflake algorithm, UUID (too long, not recommended as primary key), database sequence segment pattern, Redis Incr.
- Data Migration & Resharding:
- Requires detailed data migration plans (like dual writes, incremental sync + full verification) and smooth resharding plans (like doubling capacity). This is a very complex and high-risk operation.
Architecture Adjustment (Deep Database Layer Transformation):
graph TD
subgraph "Application Layer"
UserService --> ShardingMiddleware("Sharding Middleware ShardingSphere/TDDL");
OrderService --> ShardingMiddleware;
end
subgraph "Distributed ID"
IDGenerator("Distributed ID Generator Service");
UserService -- Get ID --> IDGenerator;
OrderService -- Get ID --> IDGenerator;
end
subgraph "Database Cluster (Sharded)"
UserDB_0("UserDB_0");
UserDB_1("UserDB_1");
UserDB_etc("...");
UserDB_N("UserDB_N");
OrderDB_0("OrderDB_0");
OrderDB_1("OrderDB_1");
OrderDB_etc("...");
OrderDB_M("OrderDB_M");
end
ShardingMiddleware -- Route R/W --> UserDB_0;
ShardingMiddleware -- Route R/W --> UserDB_1;
ShardingMiddleware -- Route R/W --> UserDB_N;
ShardingMiddleware -- Route R/W --> OrderDB_0;
ShardingMiddleware -- Route R/W --> OrderDB_1;
ShardingMiddleware -- Route R/W --> OrderDB_M;
Phase 5: Inter-Service Dependencies & Transaction Dilemmas → Introduce Message Queue & Distributed Transactions
(Content continues in the original file)