Case 6: E-commerce Platform

Course 6: Large E-commerce Platform Architecture Evolution Case Study

Goal: E-commerce platforms represent the pinnacle of internet architecture complexity. This course takes you through the entire process of a typical e-commerce platform starting from scratch, gradually coping with surging user numbers, business function expansion, high concurrency challenges, data consistency problems, and finally evolving into a complex distributed system. You will gain an in-depth understanding of core architectural designs like microservice decomposition, service governance, distributed transactions, cache application, database scaling, message queue decoupling, and multi-active deployment.

Phase 0: Humble Beginnings (Monolithic Application + MySQL)

System Description

A startup e-commerce website with core functions: user registration/login, product browsing, shopping cart, ordering, simple backend management.
Tech Stack: LAMP/LNMP (Linux + Apache/Nginx + MySQL + PHP/Python/Java)
Web Server: Apache or Nginx
Application Framework: e.g., PHP's Laravel/ThinkPHP, Java's Spring Boot
Database: Single MySQL instance holding all business data.
Deployment: All code packaged into one WAR/JAR or PHP files deployed directly to a single server.

Current Architecture Diagram

Pain Points at this moment: - Low Development Efficiency: All functions coupled in one codebase, conflicts arise easily when multiple people modify the same file, code merging is difficult. Any small change requires recompiling, testing, and deploying the entire application. - Single Tech Stack: Cannot choose the most suitable technology for different business modules (e.g., Elasticsearch might be better for the search module). - Poor Reliability: An inconspicuous feature bug can potentially crash the entire website. - Limited Scalability: Cannot scale specific bottleneck modules (like product query) targetedly; can only add machines to deploy the entire application, which is costly.

Phase 1: Can't Handle It Anymore → Application and Data Splitting

Challenge Emerges

Rapid growth in user volume and order numbers overwhelms the monolithic application, causing slow responses.
Single MySQL database under immense pressure, master-slave lag increases, write bottleneck becomes apparent.
Development teams for different business lines (e.g., product team, order team, user team) interfere with each other, making release coordination difficult.

❓ Architect's Thinking Moment: How to decompose this behemoth to improve overall capacity and development efficiency?

(Vertical or horizontal splitting? How to split the database? How should services communicate after splitting?)

✅ Evolution Direction: Vertically Split Application + Database Master-Slave/Sharding

Vertical Application Splitting (By Business Line):
- Split the monolithic application into multiple independent subsystems (or services) based on business boundaries, such as: User Service, Product Service, Order Service, Payment Service.
- Each service has its own independent codebase, development team, and deployment process.
- Initially, services can communicate via HTTP API (RESTful) or RPC (like Dubbo, gRPC, Thrift).
Database Master-Slave Separation:
- Configure MySQL master-slave replication. Route read requests (the majority) to Slave(s), and write requests to the Master.
- Introduce database middleware (like MyCAT, ShardingSphere-Proxy) or implement read-write splitting logic at the application layer to shield master-slave details from the application.
(Optional) Vertical Database Sharding:
- If a single business's data volume or QPS is still too large, further vertically split its database. For example, put user-related tables (basic info, address, points) into user_db, and product-related tables into product_db.
- Note: Vertical sharding introduces the problem of cross-database JOINs, which is usually solved by aggregating data at the service layer. Avoid direct joins at the database layer if possible.

Architecture Adjustment:

Phase 2: Too Many Services to Manage → Introduce Service Governance & Distributed Configuration

New Challenge: Early Signs of Microservice Pain

The number of services increases, and dependencies between services become complex.
How to effectively discover newly launched service instances?
How to perform load balancing for service calls?
What if a service dies? How to implement circuit breaking, degradation to prevent cascading failures?
Configurations for each service (database addresses, third-party keys, etc.) are scattered everywhere, leading to management chaos.

❓ Architect's Thinking Moment: How to manage this pile of "small services"? How to centralize configuration?

(Need a registry? Client-side or server-side load balancing? How to do circuit breaking and rate limiting? How do configuration changes take effect in real-time?)

✅ Evolution Direction: Introduce Service Registry/Discovery + Config Center + Service Governance Framework

Service Registration & Discovery:
- Introduce a Service Registry (like Consul, Nacos, ZooKeeper, Eureka).
- Service providers register their addresses with the registry upon startup.
- Service consumers obtain the list of provider addresses from the registry.
Service Invocation & Load Balancing:
- After obtaining the address list, service consumers use a load balancing algorithm (like round-robin, random, weighted) to select an instance for invocation. This is often handled by RPC frameworks (like Dubbo) or specialized client libraries (like Ribbon).
Circuit Breaking, Degradation, Rate Limiting:
- Introduce a service governance framework (like Sentinel, Hystrix) or capabilities built into the RPC framework.
- Circuit Breaking: When calls to a specific service consistently fail and reach a threshold, temporarily "break the circuit" to that service, failing fast to avoid resource exhaustion.
- Degradation: During high concurrency or when non-core services fail, proactively disable or simplify certain non-core functions to ensure core process availability.
- Rate Limiting: Control the number of requests to a service per unit of time to prevent the system from being overwhelmed by sudden traffic bursts.
Distributed Configuration Center:
- Introduce a Config Center (like Nacos, Apollo, Spring Cloud Config).
- Store configurations for all services centrally.
- Applications pull configurations from the center on startup and can listen for changes, enabling dynamic configuration updates.

Architecture Adjustment (Adding Governance & Config Layer):

Phase 3: Overwhelming Read Pressure → Introduce Distributed Cache

Challenge Escalates: Database Read Bottleneck

As user volume grows further and major promotions like "Double Eleven" arrive, QPS for high-frequency read scenarios like product detail pages and homepage recommendations skyrockets.
Even with master-slave separation, a large number of read requests overwhelm the database slaves.

❓ Architect's Thinking Moment: How to significantly improve read performance without drastically increasing database costs?

(Cache is king. Which cache to use? Redis or Memcached? How to handle cache avalanche, penetration, breakdown?)

✅ Evolution Direction: Introduce Distributed Cache Cluster (Redis/Memcached)

Cache Selection:
- Redis is often chosen due to its richer data structures (String, Hash, List, Set, Sorted Set) and persistence capabilities, allowing for more diverse use cases.
- Memcached is simpler, potentially slightly more efficient in memory management, suitable for pure KV caching.
Caching Strategy:
- Cache-Aside Pattern: Most common. Read: Read cache first, if miss, read DB, then write back to cache. Write: Update DB first, then invalidate the cache (not update cache, to avoid concurrent update issues).
- Set reasonable Time-To-Live (TTL) for cache entries.
Cache Location:
- Local Cache: e.g., Guava Cache, Caffeine. Caches within the application JVM, fastest speed but limited capacity and higher risk of data inconsistency.
- Distributed Cache: e.g., Redis Cluster, Memcached Cluster. Deployed independently, large capacity, shared by all service instances, the primary caching layer used.
Handling Cache Issues:
- Cache Penetration (querying non-existent data): Use a Bloom Filter for pre-check, or cache null objects.
- Cache Breakdown (hot key expires): Use distributed locks (like Redisson) or mutex locks, allowing only one request to load data and write back to the cache.
- Cache Avalanche (many keys expire simultaneously): Set randomized expiration times, use cache pre-warming, or implement multi-level caching (local + distributed).

Architecture Adjustment (Adding Cache Layer):

Phase 4: Writes Also Can't Keep Up → Database Horizontal Sharding

Ultimate Challenge: Database Write Bottleneck & Capacity Limit

Order volume and user numbers continue explosive growth. The write QPS of single business databases (even with master-slave) hits bottlenecks.
Single table data volume becomes excessively large (e.g., hundreds of billions of rows in the orders table), making queries, index maintenance, and DDL operations extremely slow or impossible.
Database storage capacity also approaches limits.

❓ Architect's Thinking Moment: What's the ultimate scaling solution for the database?

(Vertical splitting is maxed out. How to do horizontal splitting? By what dimension? How to choose the sharding key? How to generate distributed IDs? How to migrate smoothly?)

✅ Evolution Direction: Database Horizontal Sharding + Distributed ID Generator

Horizontal Sharding Strategy:
- By Range: e.g., by time range (one order table per month) or ID range. Pro: simple scaling. Con: potential data hotspots (e.g., most recent month's data accessed most frequently).
- By Hash: Choose a Sharding Key (like user_id or order_id), calculate its hash, then take modulo the number of databases/tables to determine where the data lands.
  - Choosing the right sharding key is crucial: Usually select the most frequently used field in queries (User ID, Order ID, Product ID). Need to avoid hotspots (e.g., all orders from a big seller landing on one shard).
- Common combination: Shard databases by user_id hash first, then shard tables within each database by order_id hash or time range.
Database Sharding Middleware:
- Introduce database sharding middleware (like ShardingSphere-JDBC/Proxy, TDDL, MyCAT).
- The middleware parses SQL, routes requests to the correct physical databases/tables based on sharding rules, and aggregates results, aiming for transparency to the application layer.
Distributed ID Generator:
- After sharding, database auto-increment primary keys are no longer globally unique. Need a distributed ID generation service.
- Common solutions: Snowflake algorithm, UUID (too long, not recommended as primary key), database sequence segment pattern, Redis Incr.
Data Migration & Resharding:
- Requires detailed data migration plans (like dual writes, incremental sync + full verification) and smooth resharding plans (like doubling capacity). This is a very complex and high-risk operation.

Architecture Adjustment (Deep Database Layer Transformation):

Phase 5: Inter-Service Dependencies & Transaction Dilemmas → Introduce Message Queue & Distributed Transactions

(Content continues in the original file)