Case 10: Distributed Computing

Course 10: Distributed Computing Framework Architecture Evolution Case Study

Goal: Big data processing relies heavily on powerful distributed computing frameworks. This course follows the evolutionary path of computing frameworks, from the classic MapReduce to the stream-batch unified Flink, giving you an in-depth understanding of core technological changes such as batch processing, stream computing, in-memory computation optimization, resource scheduling, and the unification of programming models, mastering the applicable scenarios and design philosophies of different computing paradigms.

Phase 0: The Single Machine Era (Scripting Workshop)

System Description

Scenario: Initial data analysis needs, processing moderately sized log files or datasets.
Functionality: For example, using Python scripts with the Pandas library to analyze Nginx access logs on a single machine, calculating daily PV/UV, and outputting results to a CSV file.
Tech Stack:
Language/Library: Python + Pandas / Shell (AWK, grep)
Scheduling: Cron jobs

Current Architecture Diagram

graph LR
    LogFile("Local Log File") -- Read --> Script("Python/Shell Script");
    Script -- Output --> ResultCSV("Result CSV File");

Pain Points at this moment: - Processing Capacity Bottleneck: Script crashes if data volume exceeds single machine memory. - Processing Speed Bottleneck: Single-machine, single-threaded processing is extremely slow, unable to handle TB/PB level data. - Zero Scalability: Cannot increase processing power by adding machines.

Phase 1: Embracing Distributed → MapReduce Paradigm and Hadoop Ecosystem

Challenge Emerges

Log data volume grows to TB or even PB levels, making single-machine processing completely infeasible.
Need a horizontally scalable, fault-tolerant distributed computing model to handle massive data.

❓ Architect's Thinking Moment:

How to make hundreds or thousands of machines collaborate on a large task? How to handle node failures? (Divide and conquer is the core idea. How to divide? How to combine? What if a machine fails? What insights did Google's Big Three papers offer?)

✅ Evolution Direction: Introduce MapReduce Computing Model + HDFS Distributed Storage

Distributed Storage Foundation (HDFS):
First, need a distributed file system capable of storing massive data and providing high-throughput access. HDFS (Hadoop Distributed File System) was created for this.
It splits large files into fixed-size data blocks (Blocks), distributes them across multiple DataNodes, and uses a NameNode to manage metadata.
MapReduce Programming Model:
Google's MapReduce model abstracts distributed computation into two core phases:
- Map Phase: Input data is split into multiple Splits. Each Map Task processes one split, performs mapping transformations, and outputs <Key, Value> pairs.
- Reduce Phase: <Key, Value> pairs with the same Key from the Map output are aggregated and sent to a Reduce Task for processing, outputting the final result.
- The framework handles complex details like data distribution, task scheduling, inter-node communication (Shuffle), and fault tolerance.
Fault Tolerance Mechanism:
MapReduce is highly fault-tolerant. If a Task (Mapper or Reducer) fails, the framework automatically reschedules it on another node, and data is re-read from HDFS.
Hadoop Ecosystem:
A vast Hadoop ecosystem developed around HDFS and MapReduce, including Hive (SQL on Hadoop), Pig (Data Flow Language), HBase (NoSQL Database), etc.

Architecture Adjustment (Introducing Hadoop MapReduce):

graph TD
    subgraph "Data Storage"
        Logs --> HDFS("HDFS Distributed Storage");
    end
    subgraph "Computation Flow (MapReduce Job)"
        HDFS -- Input Splits --> MapTasks("Map Tasks Parallel Execution");
        MapTasks -- "K/V Pairs" --> Shuffle("Shuffle & Sort - Framework Managed");
        Shuffle -- Grouped Data --> ReduceTasks("Reduce Tasks Parallel Execution");
        ReduceTasks -- Output Results --> HDFS;
    end
    subgraph "Resource Scheduling"
        JobTracker("JobTracker/ResourceManager - Master") -- Schedules --> TaskTrackers("TaskTracker/NodeManager - Worker");
    end

Achieved distributed batch processing of massive data, but had significant performance bottlenecks: MapReduce is a disk-based model; intermediate results of each stage are written to HDFS, causing huge disk I/O overhead. Performance is poor for algorithms requiring multiple iterations (like machine learning, graph computing).

Phase 2: Pursuing Ultimate Performance → Spark In-Memory Computing

Challenge Escalates: Disk I/O Becomes Bottleneck, Iterative Computation is Inefficient

MapReduce's disk I/O overhead limits its processing speed, especially in scenarios requiring multiple data scans (like machine learning algorithm training).
Users need faster interactive queries and more efficient iterative computation capabilities.

❓ Architect's Thinking Moment:

How to get rid of frequent disk reads/writes? Can intermediate results reside in memory? (Memory is orders of magnitude faster than disk. Can we keep data in memory as much as possible during computation? How to optimize the computation flow to reduce unnecessary steps?)

✅ Evolution Direction: Introduce Spark and its Core RDD/DAG Model

Core Abstraction: RDD (Resilient Distributed Dataset):
Spark introduced the core abstraction RDD (Resilient Distributed Dataset).
An RDD is an immutable, partitionable, parallel-operable distributed dataset.
RDDs support rich Transformation operations (e.g., map, filter, groupByKey) and Action operations (e.g., count, collect, save). Transformations are lazily evaluated.
Crucially, RDDs can be cached in memory (RDD.persist() or RDD.cache()), allowing subsequent computations to read directly from memory, greatly reducing disk I/O.
DAG (Directed Acyclic Graph)-Based Execution Engine:
Spark builds a DAG from a series of RDD transformations.
The execution engine divides the computation into multiple Stages based on the DAG, pipelining operations within the same Stage whenever possible to minimize Shuffle operations (data transfer and disk writes between nodes).
Optimizing Shuffle with Narrow/Wide Dependencies:
Narrow Dependency: Each partition of the parent RDD is depended on by at most one partition of the child RDD (e.g., map, filter). Narrow dependencies can be pipelined within the same node.
Wide Dependency / Shuffle Dependency: Each partition of the parent RDD may be depended on by multiple partitions of the child RDD (e.g., groupByKey, reduceByKey, join). Wide dependencies usually require Shuffle operations and are key optimization points.

Architecture Adjustment (Introducing Spark):

graph TD
    DataSource("Data Source HDFS/Kafka/...") --> SparkDriver("Spark Driver Program");
    SparkDriver -- Build DAG & Schedule Tasks --> ClusterManager("Cluster Manager YARN/Mesos/Standalone");
    ClusterManager -- Allocate Resources --> SparkExecutors("Spark Executor Processes");
    subgraph "Inside Spark Executor"
        ExecutorCore("Execute Task");
        CacheMemory("Memory Cache Block Manager");
        DiskStorage("Disk Storage Optional");
        ExecutorCore -- "Read/Write RDD Partitions" --> CacheMemory;
        CacheMemory -- "Spill to Disk if Memory Full" --> DiskStorage;
        ExecutorCore -- "Shuffle Data Exchange" --> OtherExecutors("Other Executors");
    end
    SparkExecutors -- Execution Results --> SparkDriver;
    SparkDriver -- Output --> ResultDestination("Result Destination");

Dramatically improved batch processing and iterative computation performance, but still insufficient for real-time scenarios requiring millisecond/second latency.

Phase 3: Embracing Real-time → Flink Stream Computing Engine

New Challenge: Low-Latency Real-time Data Processing Needs

Business requires low-latency processing and analysis of real-time data streams, such as: real-time monitoring of website anomaly traffic, real-time calculation of user current location, real-time updates of online ad CTR, etc.
Spark Streaming, while offering micro-batching to simulate stream processing, is fundamentally batch-based, with latency typically in seconds, failing to meet true low-latency requirements.

❓ Architect's Thinking Moment:

How to achieve true event-by-event processing with millisecond latency? How to handle out-of-order events and ensure state consistency? (The batch processing mindset doesn't apply. Need a native stream processing engine. Time semantics are crucial: event time or processing time? How to save and restore state during computation?)

✅ Evolution Direction: Adopt Flink as a Unified Stream-Batch Processing Engine

Flink was designed for stream processing but also possesses excellent batch processing capabilities, known as a unified stream-batch engine: 1. True Native Streaming Model: - Flink treats data as Unbounded Streams, processing incoming events one by one, achieving millisecond-level latency. - Batch processing is considered a special case of stream processing (bounded stream).

Powerful Time Semantics Support:
Event Time: Process based on the actual time events occurred, correctly handling out-of-order events, key for result accuracy.
Processing Time: Process based on the time events arrive at the processing node; simple but results may be inaccurate.
Ingestion Time: Process based on the time events enter the Flink system.
Watermark Mechanism: Used to handle out-of-order and latency issues in event time, indicating the progress of event time.
Stateful Computation and Checkpoint Mechanism:
Stream processing often requires maintaining state (e.g., aggregates within a window). Flink provides powerful stateful computation capabilities.
Through the distributed snapshot (Checkpoint) mechanism, operator state is periodically persisted to external storage (e.g., HDFS, S3). In case of failure, state can be restored from the latest Checkpoint, guaranteeing Exactly-once processing semantics (requires Source and Sink support).
Flexible Windowing Operations:
Offers rich window types for processing unbounded streams, such as Tumbling Window, Sliding Window, Session Window, etc.

Architecture Adjustment (Introducing Flink):

graph TD
    subgraph "Data Sources"
        KafkaStream("Kafka Real-time Stream");
        HDFSBatch("HDFS Batch Data");
    end
    subgraph "Flink Cluster"
        JobManager("JobManager - Master") -- Schedules --> TaskManagers("TaskManager - Worker");
        KafkaStream -- Input --> TaskManagers;
        HDFSBatch -- Input --> TaskManagers;
        TaskManagers -- "Execute Operators (map, filter, window, aggregate)" --> TaskManagers;
        %% State Persistence
        TaskManagers -- Checkpoint State --> ExternalStorage("External Storage HDFS/S3");
        ExternalStorage -- Restore State --> TaskManagers;
        %% Output
        TaskManagers -- Output --> Sink("Kafka/DB/...");
    end

Provides low-latency real-time stream processing and the potential for unified stream-batch, but resource management and scheduling remain challenges.

Phase 4: Resource Scheduling and Elasticity → Embracing Kubernetes

Challenge Revisited: Resource Utilization and Operational Efficiency

Traditional computing clusters (like YARN) have relatively static resource allocation, making rapid elastic scaling based on load changes difficult, leading to low resource utilization.
Maintaining multiple clusters (Hadoop/YARN, Spark, Flink, etc.) simultaneously increases operational complexity and cost.
Need a more unified, cloud-native resource management and scheduling platform.

❓ Architect's Thinking Moment:

How to make computing tasks elastic like microservices? How to unify resource management? (Can Spark/Flink jobs run on K8s? Containerization is the trend. How to achieve on-demand allocation and auto-scaling?)

✅ Evolution Direction: Containerize Computing Frameworks and Run on Kubernetes

Containerize Computing Tasks:
Package Spark, Flink Driver/Master and Executor/TaskManager programs into Docker images.
Kubernetes as Unified Resource Scheduler:
Use Kubernetes (K8s) as the underlying resource management and scheduling platform, replacing YARN/Mesos.
Both Spark and Flink provide Native Kubernetes Integration or run on K8s via the Operator pattern.
Elastic Scaling:
Utilize K8s' HPA (Horizontal Pod Autoscaler) or Flink/Spark on K8s' auto-scaling capabilities to automatically adjust the number of Executor/TaskManager Pods based on metrics like CPU, memory, queue backlog, achieving resource elasticity.
Simplify Operations:
Use K8s uniformly to manage all types of computing tasks (batch, stream, even machine learning), reducing operational complexity.
Leverage the K8s ecosystem (e.g., Helm for deployment, Prometheus for monitoring).
(Optional) Reduce Costs with Cloud Spot Instances:
For fault-tolerant batch jobs, schedule them on cloud provider Spot Instances to significantly reduce computing costs.

Architecture Adjustment (Compute Layer Running on K8s):

graph TD
    subgraph "Submission & Scheduling"
        SparkSubmit("Spark Submit / Flink CLI") -- Submit Job --> K8sAPIServer("Kubernetes API Server");
        K8sAPIServer -- Create/Manage Pods --> K8sCluster("Kubernetes Cluster");
    end
    subgraph "Deployment within K8s Cluster"
        DriverPod("Driver/JobManager Pod");
        ExecutorPods("Executor/TaskManager Pods");
        Operator("Spark/Flink Operator Optional");
        HPA("Horizontal Pod Autoscaler Optional");
        K8sCluster -- Runs --> DriverPod;
        K8sCluster -- Runs --> ExecutorPods;
        Operator -- "Manages" --> DriverPod & ExecutorPods;
        HPA -- "Monitors Metrics & Scales" --> ExecutorPods;
    end
    DriverPod -- Coordinates --> ExecutorPods;
    ExecutorPods -- Read/Write --> Storage("HDFS/S3/Kafka...");

Achieved elastic resource management and unified operations, but stream and batch processing logic still separate.

Phase 5: Pursuing Unification → Flink SQL Stream-Batch Unity

Final Challenge: Development Efficiency and Logic Reuse

Many business scenarios require processing both historical data (batch) and real-time data (stream), e.g., calculating user profiles based on full behavior history (batch) and updating user current status in real-time (stream).
Developing and maintaining two separate codebases (one Spark/MapReduce for Batch, one Flink for Streaming) is inefficient and makes ensuring logical consistency difficult.
Can a single API or language be used to describe both batch and stream processing logic?

❓ Architect's Thinking Moment:

How to unify stream and batch processing with one codebase? (Need a higher-level abstraction. Is SQL a good choice? How to convert between streams and tables?)

✅ Evolution Direction: Utilize Flink SQL for Unified Stream-Batch Programming Model

The Flink community strives to provide a unified stream-batch programming experience through Flink SQL: 1. Unified SQL API: - Users can write data processing logic using standard SQL syntax, whether for bounded batch data or unbounded streaming data.

Stream-Table Duality:
Flink SQL treats Streams and Tables as two different representations of the same concept.
A stream can be viewed as a continuously appended Dynamic Table.
A table can be viewed as a snapshot of a stream at a specific point in time.
Rich Connectors:
Flink SQL provides numerous Connectors to easily connect to various external systems as Sources or Sinks, such as Kafka, HDFS, MySQL, Elasticsearch, Redis, etc.
Unified Execution Engine:
The underlying Flink Runtime can automatically choose the appropriate execution mode (batch or stream) based on the input data source type (bounded or unbounded) to execute the SQL job.

Architecture Adjustment (Centering on Flink SQL):

graph TD
    subgraph "Data Sources & Sinks"
        KafkaSource("Kafka Stream Source");
        HDFSSource("HDFS Batch Source");
        MySQLSource("MySQL Lookup Source");
        ElasticSink("Elasticsearch Sink");
        KafkaSink("Kafka Sink");
    end
    subgraph "Flink SQL Platform"
        SQLClient("SQL Client / API / UI") -- Submit SQL Query --> FlinkSQLGateway("Flink SQL Gateway");
        FlinkSQLGateway -- Parse & Optimize --> FlinkCluster("Flink Cluster (JobManager/TaskManagers)");
        FlinkCluster -- Use Connector --> KafkaSource;
        FlinkCluster -- Use Connector --> HDFSSource;
        FlinkCluster -- Use Connector --> MySQLSource;
        FlinkCluster -- Use Connector --> ElasticSink;
        FlinkCluster -- Use Connector --> KafkaSink;
    end

Summary: The Evolutionary Path of Distributed Computing Frameworks

Phase	Core Challenge	Key Solution	Representative Tech/Pattern
0. Single Node	Capacity/Speed Limit	Local Scripting	Python/Pandas, Shell, Cron
1. MapReduce	Massive Data Batch Processing	MapReduce Model + HDFS	Hadoop, HDFS, MapReduce, Hive, Pig
2. Spark	Disk I/O Bottleneck, Iteration Perf	In-Memory Computing (RDD/DAG)	Spark, RDD, DAG, Cache/Persist, Narrow/Wide Dependency
3. Flink	Real-time Low Latency, State Mgmt	Native Streaming, Event Time, Checkpoint	Flink Streaming, Watermark, Stateful Computation
4. K8s	Resource Utilization, Ops Efficiency	Containerization + K8s Scheduler + Elasticity	Docker, Kubernetes, Native K8s Support, Operator, HPA
5. Flink SQL	Development Efficiency, Logic Reuse	Unified Stream-Batch SQL API	Flink SQL, Stream-Table Duality, Dynamic Tables
```