Case 8: Distributed Storage

Course 8: Distributed File Storage System Architecture Evolution Case Study

Goal: Data is the cornerstone of modern applications, and distributed file storage is the core infrastructure for handling massive amounts of data. This course uses the gradual construction and optimization process of a typical distributed file system as an example to give you an in-depth understanding of core design challenges and solutions such as metadata management, data sharding and replication, consistency protocols, fault tolerance and recovery mechanisms, deeply appreciating the art of trade-offs in the CAP theorem in engineering practice.

Phase 0: The Single Machine Era (Simple but Sufficient: NFS/Local File System)

System Description

Scenario: An initial application needs to store some user-uploaded files or generated logs.
Functionality: Provides basic file operation interfaces (create, read, write, delete). All files are stored on the local disk of a single server (e.g., ext4 file system), possibly shared via NFS for access by other applications.
Tech Stack:
Storage: Local disk (ext4, XFS) or Network File System (NFS)
Access Interface: Standard POSIX file system API

Current Architecture Diagram

graph LR
    Client --> NFSServer("NFS Server");
    NFSServer --> LocalDisk("Local Disk");

Pain Points at this moment: - Single Point of Failure (SPOF): If this server goes down, all files become inaccessible, and data might be lost. - Capacity Bottleneck: Single machine disk capacity is limited, unable to cope with massive data growth at TB, PB levels. - Performance Bottleneck: Single machine's IOPS and throughput are limited, difficult to support high-concurrency read/write requests.

Phase 1: Divide and Conquer → Data Sharding, the Embryonic Form of Distributed Storage

Challenge Emerges

Total file volume grows rapidly, straining single machine disk capacity.
Applications require higher read/write throughput; single machine IO performance is insufficient.

❓ Architect's Thinking Moment: How to break the capacity and performance limits of a single machine?

(If one machine can't store/run fast enough, use multiple? How to distribute data across multiple machines? How to manage the relationship between files and storage nodes?)

✅ Evolution Direction: Introduce Data Sharding + Independent Metadata Management

Data Sharding / Chunking:
- Split large files into fixed-size data blocks (Chunks / Blocks), e.g., 64MB or 128MB (classic HDFS size).
- Distribute these data blocks across multiple DataNode / ChunkServer servers.
- The strategy for deciding which DataNode stores a file block can be simple, like hashing the filename and taking the modulo: hash(filename + chunk_index) % num_datanodes.
Independent Metadata Service:
- Need a place to record file Metadata, including: filename, directory structure, file attributes (size, permissions, creation time, etc.), which data blocks the file is split into, and on which DataNodes each block is stored.
- Initially, a relational database (like MySQL) can be used to store this metadata information.
Client Access Flow:
- When a client reads/writes a file, it first contacts the Metadata Service to get the file's data block list and location information.
- Then, the client directly communicates with the corresponding DataNodes to read/write data blocks.

Architecture Adjustment (Separating Metadata and Data Nodes):

graph TD
    Client --> MetadataDB("Metadata Service MySQL");
    MetadataDB -- File block locations --> Client;
    Client -- Read/Write data blocks --> DataNode1("DataNode 1");
    Client -- Read/Write data blocks --> DataNode2("DataNode 2");
    Client -- Read/Write data blocks --> DataNode3("DataNode 3");

Solved capacity and initial performance issues, but introduced new complexities: Metadata service becomes a new performance bottleneck and single point of failure; DataNode failure leads to partial data loss.

Phase 2: Fear No Downtime → Introduce Replication Mechanism for High Availability

New Challenge: Risk of Data Loss and Service Interruption

If a DataNode storing a specific file block crashes, that part of the data is permanently lost.
System reliability and availability cannot be guaranteed.
Also, for popular files, having only one copy might limit read performance.

❓ Architect's Thinking Moment: How to prevent data loss? How to keep the system running when some nodes fail?

(Don't put all eggs in one basket. Create backups? Where to put backups? How to ensure consistency between backup and original data?)

✅ Evolution Direction: Implement Data Replication + Read/Write Consistency Strategy

Data Replication:
- Create multiple replicas (usually 3 replicas) for each data block, and distribute these replicas across different physical nodes, different racks, or even different availability zones on DataNodes to maximize disaster recovery capability.
- The metadata service needs to record the locations of all replicas for each data block.
Write Consistency Strategy:
- When a client writes data, it needs to write to the primary replica, which then synchronously replicates to other replicas.
- Often uses a Quorum mechanism to balance consistency, availability, and performance: e.g., for 3 replicas, configure W=2 (write considered successful only if at least 2 replicas succeed), R=2 (read from at least 2 replicas to ensure reading the latest data or perform repair).
- Write flow: Client → Primary DataNode → Secondary DataNode 1 & Secondary DataNode 2. The Primary returns success to the Client only after receiving acknowledgments from at least W-1 secondaries.
Fault Detection and Automatic Recovery:
- The metadata service (or a dedicated Master node) needs to continuously monitor the health status of all DataNodes via heartbeat mechanisms.
- When a DataNode failure is detected, causing some data blocks to have insufficient replicas, the system must automatically replicate data from other healthy replicas to new DataNodes to restore the replica count.
- If inconsistent replica data is found during reads (e.g., via Checksum validation), a Read Repair mechanism needs to be triggered.

Architecture Adjustment (Adding Replication and Consistency Control):

graph TD
    Client --> MetadataService("Metadata Service");
    MetadataService -- Replica locations --> Client;
    subgraph "Data Write Flow"
        Client -- "1. Write Request" --> Primary("Primary Replica DataNode");
        Primary -- "2. Synchronous Replication" --> Replica1("Replica 1 DataNode");
        Primary -- "2. Synchronous Replication" --> Replica2("Replica 2 DataNode");
        Replica1 -- "3. Ack" --> Primary;
        Replica2 -- "3. Ack" --> Primary;
        Primary -- "4. Return Success (W-1 Acks Received)" --> Client;
    end
    subgraph "Fault Recovery"
        MetadataService -- Heartbeat Detection --> DataNodes("All DataNodes");
        MetadataService -- "Replica insufficient, Trigger Replication" --> HealthyReplica("Healthy Replica");
        HealthyReplica -- Replicate Data --> NewReplica("New Replica Node");
    end

Improved availability and data reliability, but the pressure and single-point issue of the metadata service become more prominent.

Phase 3: Metadata Can't Keep Up → Dedicated Metadata Clustering

Challenge Escalates: Metadata Service Performance and Scalability Bottleneck

As the number of files grows to hundreds of millions or even billions, the performance of MySQL storing metadata drops sharply, becoming the bottleneck for the entire system.
Metadata queries (like ls on a large directory) become very slow.
A single-point metadata service is also a huge availability risk.

❓ Architect's Thinking Moment: How to make the metadata service horizontally scalable and highly available?

(Relational databases aren't suitable for this scenario anymore. What technology to replace it? How to ensure metadata consistency? Can memory help speed things up?)

✅ Evolution Direction: Adopt Distributed KV or Dedicated Metadata System

Distributed, Highly Available Metadata Storage:
- Migrate metadata from MySQL to systems more suitable for large-scale metadata management.
- Option 1: Systems based on distributed consensus protocols, like using ZooKeeper or etcd to store the directory tree structure and critical metadata. They provide strong consistency guarantees and high availability.
- Option 2: Use metadata services specifically designed for file systems, like HDFS NameNode (achieves high availability via Standby NameNode and JournalNodes), or build a custom metadata service based on KV stores like RocksDB/LevelDB.
Accelerate Metadata Access with Memory:
- Classic HDFS NameNode approach: Load the entire file system Namespace and file-to-block mapping into memory, greatly accelerating metadata access. In-memory state is persisted via EditLog and FsImage.
- For other solutions, Redis or other in-memory databases can also be used to cache metadata for hot directories or files.
Metadata Federation:
- When the memory or processing capacity of a single metadata cluster still becomes a bottleneck (e.g., file count reaches hundreds of billions), consider horizontal partitioning (Federation) of the metadata. For example, assign different directory subtrees to different metadata clusters based on mount points (like HDFS NameNode Federation).

Architecture Adjustment (Upgrading Metadata Service):

graph TD
    Client --> MetadataCluster("Distributed Metadata Cluster");
    subgraph "Metadata Service Cluster (e.g., based on Raft/Paxos)"
        MetadataNode1("Metadata Node 1");
        MetadataNode2("Metadata Node 2");
        MetadataNode3("Metadata Node 3");
        MetadataNode1 <--> MetadataNode2;
        MetadataNode2 <--> MetadataNode3;
        MetadataNode1 <--> MetadataNode3;
        %% "(Optional) In-memory Cache Layer"
        RedisCache("In-memory Metadata Cache Redis") <--> MetadataCluster;
        Client --> RedisCache; 
    end
    MetadataCluster -- Replica locations --> Client;
    Client -- Read/Write --> DataNodes;

Solved metadata service performance and single-point issues, but the storage efficiency problem of massive small files begins to emerge.

Phase 4: The Pain of Small Files → Optimizing Storage Efficiency and Metadata Overhead

New Pain Point: The Trouble with Massive Small Files

The system contains a large number of small files (e.g., images, log snippets, user configurations from a few KB to a few MB).
Metadata Pressure: Each small file requires an independent metadata record, causing immense pressure on the metadata service and high memory consumption (if metadata is in memory).
Low Storage Efficiency: File system data blocks are usually large (e.g., 64MB), storing many small files wastes significant disk space (internal fragmentation). DataNodes managing too many small file blocks also increases management burden.

❓ Architect's Thinking Moment: How to store and manage massive small files economically and efficiently?

(Can small files be merged for storage? Can the metadata structure be optimized?)

✅ Evolution Direction: Merge Small Files for Storage + Optimize Index Structure

Merging Small Files for Storage:
- Instead of allocating separate data blocks for each small file, merge multiple small files into larger logical blocks or Segment Files.
- Metadata needs to record the offset and length of each small file within its corresponding Segment File.
- DataNodes only need to manage a significantly reduced number of Segment Files.
- For example, Facebook's Haystack system was specifically designed to store massive numbers of photos (typical small files) using a similar merged storage strategy.
Optimize Metadata Index Structure:
- If metadata is stored in a KV system (like RocksDB), use data structures like LSM-Tree (Log-Structured Merge-Tree), which is very write-friendly and suitable for high-concurrency metadata updates.
- For in-memory metadata, design more compact data structures to reduce memory footprint.
Introduce Object Storage API:
- For many small file scenarios (like images, videos), the application layer might prefer using an Object Storage interface (like S3 API) rather than traditional file system interfaces.
- Build an S3 API compatible gateway layer, potentially still using our evolved distributed file system (or its variant) underneath, shielding users from underlying details like file blocks and metadata. Open-source MinIO is a typical example.

Architecture Adjustment (Introducing File Merging and Object Storage Gateway):

graph TD
    subgraph "Access Layer"
        ClientFS("File System Client") --> MetadataService;
        ClientS3("S3 API Client") --> S3Gateway("S3 Gateway");
        S3Gateway --> MetadataService;
    end
    subgraph "Metadata Service"
        MetadataService("Metadata Service - Supports File Offsets");
    end
    subgraph "Data Storage Layer"
        DataNode1("DataNode 1") -- Stores --> SegFile1("Segment File 1");
        DataNode2("DataNode 2") -- Stores --> SegFile2("Segment File 2");
        %% File block merging
        SmallFile1 & SmallFile2 & SmallFile3 --> SegFile1;
        SmallFile4 & SmallFile5 --> SegFile2;
        %% Client direct read/write
        ClientFS -- Read/Write offset --> DataNode1;
        ClientFS -- Read/Write offset --> DataNode2;
        %% S3 Gateway proxies read/write
        S3Gateway -- Read/Write Request --> DataNode1;
        S3Gateway -- Read/Write Request --> DataNode2;
    end

Solved small file storage efficiency problems, but cross-regional deployment and strong consistency requirements bring new challenges.

Phase 5: Going Global → Cross-Regional Consistency and Disaster Recovery

Ultimate Challenge: Global Distribution and Data Consistency

Business requires global deployment, users are distributed across different continents, necessitating data centers in different regions.
Cross-Regional Consistency: How to ensure a file uploaded by a user in the US can be accessed "immediately" (or within acceptable latency) in Europe or Asia, with consistent data?
Cross-Regional Disaster Recovery: How to ensure business continuity and data integrity if one data center suffers a catastrophic failure?
CAP Trade-off: Cross-regional network latency is an objective reality. How to choose between Consistency, Availability, and Partition Tolerance?

❓ Architect's Thinking Moment: How to build a reliable, consistent distributed file system over a wide area network?

(How to synchronize data across regions? What consistency protocol guarantees correctness? How to balance performance and consistency?)

✅ Evolution Direction: Introduce Strong Consistency Protocol + Optimize Cross-Regional Replication

Strongly Consistent Metadata Service:
- For file system namespace operations (like creating files, deleting directories, renaming) and critical metadata updates, strong consistency must be guaranteed.
- Requires running a distributed consensus protocol (like Paxos or Raft) among metadata service nodes across multiple data centers. An operation is considered successful only after a quorum of nodes agrees.
- Google's Spanner (though a database, its consistency mechanism is relevant) and etcd (based on Raft) are paradigms in this area.
Optimize Cross-Regional Data Replication:
- Data block replication can choose different consistency levels based on business needs:
  - Strong Consistency: Write operations wait for data to be synchronously replicated to remote region replicas and acknowledged before returning success. High latency, but guarantees data consistency.
  - Eventual Consistency: Write operations return success after succeeding in the local region; data is asynchronously replicated to other regions. Low latency, good availability, but may have a brief window of data inconsistency.
- More intelligent replication strategies can be employed, such as replicating only hot data or using dedicated network lines to optimize transfer speed.
Design for Partition Tolerance:
- Distributed systems must assume network partitions can occur.
- During a network partition, the system needs to choose based on the CAP principle:
  - Choose CP (Consistency + Partition Tolerance): To ensure data consistency during a partition, might need to sacrifice some availability, e.g., disallow writes, or only allow reads/writes in the partition with the majority of replicas.
  - Choose AP (Availability + Partition Tolerance): To ensure high availability during a partition, allow reads/writes in different partitions, but this may lead to data conflicts, requiring subsequent mechanisms to resolve conflicts and ensure eventual consistency.

Architecture Adjustment (Cross-Regional Deployment and Consistency Control):

graph TD
    subgraph "Client Access"
        ClientAsia["Asia User"] --> RegionAsia("Asia Datacenter");
        ClientUS["US User"] --> RegionUS("US Datacenter");
    end
    subgraph "Metadata Layer (Global Strong Consistency)"
        MetadataAsia("Metadata Service Asia") <--> Consensus("Paxos/Raft Consensus Protocol") <--> MetadataUS("Metadata Service US");
        RegionAsia --> MetadataAsia;
        RegionUS --> MetadataUS;
    end
    subgraph "Data Layer (Configurable Consistency)"
        DataAsia("Data Nodes Asia") <-- Async/Sync Replication --> DataUS("Data Nodes US");
        RegionAsia -- Read/Write --> DataAsia;
        RegionUS -- Read/Write --> DataUS;
    end

Summary: The Evolutionary Path of Distributed File Storage Systems

Phase	Core Challenge	Key Solution	Representative Tech/Pattern
0. Single Node	SPOF / Capacity / Performance	NFS / Local File System	ext4, NFS
1. Sharding	Capacity / Initial Performance	Data Sharding + Metadata Service	Chunking, MySQL Metadata
2. Replication	Data Loss / Availability	Data Replication + Write Quorum	3 Replicas, W=2/R=2 Quorum
3. Metadata Scale	Metadata Perf/SPOF	Distributed Metadata + In-memory Cache	ZooKeeper/etcd/HDFS NN, Redis Cache, Federation
4. Small Files	Storage Efficiency / Meta Load	File Merging + Optimized Index / Object API	Haystack, LSM-Tree, S3 Gateway (MinIO)
5. Global	Cross-Region Consistency/Latency	Strong Consensus Protocol + Optimized Replication	Paxos/Raft, Eventual/Strong Consistency, CAP Trade-off
```