Case 12: DevOps

Course 12: AI Platform Architecture Evolution Case Study

Goal: Artificial intelligence is reshaping industries, and building efficient, scalable AI platforms is key to unlocking AI productivity. This course uses the evolution of a typical AI platform as an example to help you master core capabilities like data management and feature engineering, distributed training, model serving, the full MLOps lifecycle, and large model specific architectures, understanding the architectural challenges and practices from algorithm experimentation to large-scale production.

Phase 0: The Wild West Era (Data Scientist's Local Experiments: Jupyter Notebook)

Scenario Description

Workflow: Data scientists conduct algorithm exploration and model training on their laptops or single workstations using Jupyter Notebook.
Data Processing: Directly read local files (CSV, Parquet) or connect to databases for small-scale data.
Environment Management: Rely on conda or virtualenv to manage local Python environments.
Model Sharing: Trained models are shared as serialized files (e.g., .pkl, .h5) via email, cloud storage, etc.

Architecture Diagram:

graph LR
    DS["Data Scientist"];
    Laptop["Laptop/Workstation"];
    LocalData["Local Data Files"];
    Jupyter["Jupyter Notebook"];
    ModelFile["Model File .pkl"];

    DS -- "Local Operations" --> Laptop;
    Laptop -- "Loads" --> LocalData;
    Laptop -- "Runs" --> Jupyter;
    Jupyter -- "Outputs" --> ModelFile;

Pain Points at this moment: - "Works on My Machine" Syndrome: Inconsistent environments make experimental results difficult to reproduce. - Resource Bottleneck: Limited local compute resources (CPU, GPU, memory) cannot handle large datasets or complex model training. - Collaboration Difficulty: Chaotic management of models, code, and data hinders team efficiency. - Deployment Impossibility: Experimental results are hard to translate into production-ready services.

Phase 1: Centralized Compute Resources (GPU Server Cluster + Shared Storage)

Challenge Emerges

Data volume increases (e.g., exceeding 10GB), exceeding local memory capacity.
Model complexity rises, requiring powerful GPU resources for training acceleration.
Multi-person collaboration needs unified development environments and data storage.

❓ Architect's Thinking Moment: How to provide data scientists with more powerful compute resources and a collaborative environment?

(Local is insufficient, use servers. How to share GPU servers? Where to store data conveniently for everyone? How to unify environments?)

✅ Evolution Direction: Build GPU Resource Pool + Shared Storage + Environment Containerization

Centralized GPU Resource Pool:
- Purchase or rent servers equipped with high-performance GPUs.
- Use a job scheduling system (like Slurm, PBS) to manage and schedule GPU resources, allowing multiple users to share servers.
Shared Storage:
- Set up a Network File System (NFS) or use a distributed file system (like HDFS, Ceph) to store training data, code, and models, making them accessible to all users and servers.
Environment Containerization (Docker):
- Use Docker to package the training environment, including required libraries (TensorFlow, PyTorch, Scikit-learn, etc.) and drivers, into standard images.
- Data scientists specify which Docker image to use when submitting training jobs, ensuring environment consistency and reproducibility.

Architecture Adjustment (Introducing Compute Cluster and Shared Environment):

graph TD
    DataScientist(Data Scientist) -- Submits Job --> Scheduler(Job Scheduling System Slurm/PBS);
    Scheduler -- Allocates Resource & Runs --> GPUNode1(GPU Server 1 - Docker Container);
    Scheduler -- Allocates Resource & Runs --> GPUNode2(GPU Server 2 - Docker Container);
    Scheduler -- Allocates Resource & Runs --> GPUNodeN(GPU Server N - Docker Container);
    subgraph "Shared Resources"
        SharedStorage(Shared Storage NFS/HDFS);
        DockerRegistry(Docker Image Registry);
    end
    GPUNode1 -- Reads/Writes Data/Models --> SharedStorage;
    GPUNode2 -- Reads/Writes Data/Models --> SharedStorage;
    GPUNodeN -- Reads/Writes Data/Models --> SharedStorage;
    GPUNode1 -- Pulls Image --> DockerRegistry;
    GPUNode2 -- Pulls Image --> DockerRegistry;
    GPUNodeN -- Pulls Image --> DockerRegistry;

Solved: Compute resource shortage and basic environment consistency issues, supporting larger-scale model training. Still Exists: Lack of distributed training capability (single-node multi-GPU or single-node training), repetitive feature engineering, difficulty deploying models online.

Phase 2: Model Too Big, Single Card Not Enough → Distributed Training Frameworks

Challenge Escalates: Training Efficiency and Model Scale Bottleneck

Model scale (parameter count) continues to grow (e.g., BERT, GPT), exceeding single GPU memory capacity, or single-node training time becomes too long (days or weeks).
Need to utilize multiple machines and multiple GPUs in parallel to accelerate training and support larger models.

❓ Architect's Thinking Moment: How to make multiple machines and GPUs collaboratively train a single model?

(How to distribute data and computation? How do nodes synchronize model parameters? What are the mature distributed training frameworks?)

✅ Evolution Direction: Introduce Distributed Training Frameworks

Based on different parallel strategies, mainstream frameworks include:

Data Parallelism: Most common.
- Principle: Split training data into multiple parts; each GPU processes one part and computes gradients; gradients from all GPUs are then aggregated (e.g., via All-Reduce communication); global model parameters are updated; updated parameters are broadcast back to all GPUs.
- Frameworks: Horovod (Uber open-source, good cross-framework support), PyTorch DistributedDataParallel (DDP) (PyTorch native, easy to use), TensorFlow MirroredStrategy/MultiWorkerMirroredStrategy.
- Applicable: When the model fits in single-card memory, but training needs acceleration.
Model Parallelism:
- Principle: Split different parts of the model itself (e.g., different layers) across different GPUs. Data flows through different GPUs to complete forward and backward computations.
- Challenges: Complex splitting strategy, high communication overhead.
- Frameworks: Often requires manual implementation or specific libraries (e.g., PyTorch Pipeline Parallelism, Megatron-LM).
- Applicable: When the model is too large to fit in single-card memory.
Parameter Server (PS) Architecture:
- Principle: Store model parameters on dedicated parameter servers. Worker nodes pull the latest parameters from PS, compute gradients, and push gradients back to PS for updates.
- Framework: TensorFlow ParameterServerStrategy.
- Applicable: Ultra-large-scale sparse feature models (e.g., recommendation, advertising), where communication can be asynchronous.
Hybrid Parallelism: Combines data parallelism, model parallelism, pipeline parallelism, etc., to handle extremely large models (e.g., GPT-3 level). E.g., DeepSpeed (Microsoft).

Optimization Techniques: * Automatic Mixed Precision (AMP): Use FP16 for most computations to reduce memory usage and compute time, while keeping critical parts in FP32 for accuracy. * Gradient Accumulation: Simulate larger batch sizes when memory is limited. * Gradient Checkpointing: Trade compute time for memory by reducing storage of intermediate activations.

Architecture Adjustment (Data Parallelism Horovod/DDP Example):

graph TD
    Scheduler(Job Scheduler Slurm/K8s) -- Launches N Worker Processes --> Worker1(Worker 1 - Rank 0);
    Scheduler -- Launches N Worker Processes --> Worker2(Worker 2 - Rank 1);
    Scheduler -- Launches N Worker Processes --> WorkerN(Worker N - Rank N-1);
    subgraph "Inside Worker (Each runs on a GPU)"
        DataShard(Process local data shard);
        ModelReplica(Hold model replica);
        ComputeGradient(Compute local gradients);
    end
    Worker1 --> ComputeGradient;
    Worker2 --> ComputeGradient;
    WorkerN --> ComputeGradient;
    ComputeGradient -- Local Gradients --> AllReduce(All-Reduce Communication NCCL/Gloo);
    AllReduce -- Global Gradients --> UpdateParams(Update Model Parameters);
    UpdateParams -- Sync Parameters --> Worker1 & Worker2 & WorkerN;
    Worker1 & Worker2 & WorkerN -- Read/Write --> SharedStorage;

Solved: Slow training efficiency and single-card memory bottlenecks, enabling training of larger models. Introduced New Problems: Siloed feature engineering, difficulty reusing features, inconsistency between online and offline features.

Phase 3: Chaotic Feature Production → Building a Feature Store

Challenge Emerges: The "Spaghetti" State of Feature Production

Different algorithm teams and models might repeatedly develop similar features, wasting effort.
Feature logic used for online prediction might differ from that used during offline training, degrading model performance (Train-Serve Skew).
Lack of unified feature management, version control, and monitoring mechanisms.
Difficulty accessing and using real-time features.

❓ Architect's Thinking Moment: How to centrally manage feature production and usage? How to ensure online-offline consistency? How to easily use real-time features?

(Need a central platform for features. How to define features? How to store them? How to version them? How to ingest real-time feature streams?)

✅ Evolution Direction: Build a Feature Store

A Feature Store aims to solve problems in feature production and consumption, providing unified feature management and serving capabilities:

Unified Feature Definition & Registration:
- Provide a UI or SDK for data scientists/engineers to define features (name, type, source, compute logic, metadata) and register them in the platform.
Offline Feature Storage & Computation:
- Store historical feature data in offline storage (e.g., Hive, Delta Lake, Snowflake).
- Provide or integrate with batch processing engines (like Spark) for batch computation and updates of offline features.
Online Feature Storage & Serving:
- Load features needed for online prediction (usually latest values or precomputed ones) into online storage (typically low-latency Redis, DynamoDB, Cassandra KV stores).
- Provide a low-latency online feature serving API for real-time feature lookup during online model inference.
Real-time Feature Computation & Ingestion:
- Integrate with stream processing engines (like Flink, Spark Streaming) to consume real-time event streams (Kafka), compute features in real-time (e.g., user behavior stats in the last N minutes), and write to online storage.
Feature Discovery & Versioning:
- Provide a feature catalog for users to discover and reuse existing features.
- Support feature versioning to ensure correct feature versions are used during training and inference.
Feature Monitoring:
- Monitor feature data distribution, quality, and drift (e.g., using Great Expectations) to promptly detect data issues.

Open-Source Feature Store Solutions: Feast, Tecton (commercial).

Architecture Adjustment (Introducing Feature Store):

graph TD
    subgraph "Feature Production"
        BatchSource(Offline Data Source Hive/DB) --> SparkBatch(Spark Batch Compute Offline Features);
        StreamSource(Real-time Event Stream Kafka) --> FlinkStream(Flink Stream Compute Real-time Features);
        SparkBatch -- Write --> OfflineStore(Offline Feature Store Hive/Delta);
        SparkBatch -- Write --> OnlineStore(Online Feature Store Redis/KV);
        FlinkStream -- Write --> OnlineStore;
        FeatureRegistry(Feature Registry/Management Center);
        SparkBatch -- Register/Use Definition --> FeatureRegistry;
        FlinkStream -- Register/Use Definition --> FeatureRegistry;
    end
    subgraph "Feature Consumption"
        OfflineStore -- Provides Training Data --> ModelTraining(Model Training);
        OnlineStore -- Provides Online Features --> OnlineInference(Online Model Inference);
        ModelTraining -- Query Feature Definition --> FeatureRegistry;
        OnlineInference -- Query Feature Definition --> FeatureRegistry;
    end
    subgraph "Monitoring"
        OfflineStore & OnlineStore --> FeatureMonitoring(Feature Monitoring Great Expectations);
    end

Solved: Redundant feature development, online-offline inconsistency, improving feature engineering efficiency and standardization. Next Challenge: How to deploy trained models quickly and reliably into production for serving?

Phase 4: Difficulty Deploying Models → Model Serving

Challenge Revisited: Efficiency and Stability of Model Deployment

Deploying a model trained in a Python Notebook into a stable, high-performance, scalable online prediction service is non-trivial.
Need to handle concurrent requests, model version management, resource isolation, performance monitoring, etc.
Different models (Scikit-learn, TensorFlow, PyTorch, XGBoost, etc.) require different runtime environments and dependencies.

❓ Architect's Thinking Moment: How to package models trained with various frameworks into standard, high-performance online services?

(Just wrap it with Flask/Django? Can it handle the performance? Are there more professional model deployment solutions? How to do A/B testing and version management?)

✅ Evolution Direction: Adopt Professional Model Inference Servers or Platforms

Model Format Conversion & Optimization:
- Convert models trained in different frameworks into standard, deployment-friendly formats (e.g., ONNX, TensorFlow SavedModel, TorchScript).
- Use inference optimization libraries (like TensorRT, OpenVINO, ONNX Runtime) to optimize models (quantization, operator fusion, etc.) for faster inference and lower resource usage.
Dedicated Inference Servers:
- Use server software specifically designed for model inference, such as:
  - NVIDIA Triton Inference Server: Supports multiple frameworks (TF, PyTorch, ONNX, TensorRT, etc.), multi-model serving, dynamic batching, model ensembles; excellent performance.
  - TensorFlow Serving: Focuses on TensorFlow model deployment.
  - TorchServe: Focuses on PyTorch model deployment.
- These servers usually provide standard HTTP/gRPC interfaces for easy integration.
Kubernetes-Based Model Serving Platforms:
- Containerize inference servers and deploy them on Kubernetes.
- Use open-source platforms like KFServing (now KServe) or Seldon Core. They provide higher-level capabilities on top of K8s for model deployment, versioning, traffic splitting (Canary/Shadow), auto-scaling, explainability, etc.

Architecture Adjustment (Introducing Model Serving Platform KServe/Triton on K8s):

graph TD
    subgraph "Model Training & Conversion"
        ModelTraining(Model Training) --> ModelRegistry(Model Registry MLflow/S3);
        ModelRegistry -- Download Model --> Optimizer(Model Optimization TensorRT/ONNX);
        Optimizer -- Optimized Model --> OptimizedModelRegistry;
    end
    subgraph "Model Deployment (KServe on K8s)"
        KServeController(KServe Controller) -- Watches CRD --> K8sAPIServer;
        K8sAPIServer -- Creates --> InferenceService(InferenceService Pod);
        OptimizedModelRegistry -- Loads Model --> InferenceService;
        subgraph "Inside InferenceService Pod"
            Predictor(Predictor Triton/TFServing/TorchServe);
            Transformer(Optional: Feature Transformer);
            Agent(KServe Agent);
            Agent -- "Serves Model" --> Predictor;
            Transformer <-->|Request| Agent;
            Predictor <-->|Request| Agent;
        end
    end
    subgraph "Service Invocation"
        ClientApp(Client Application) --> K8sIngress(K8s Ingress/Istio Gateway);
        K8sIngress -- Routes/Splits Traffic --> InferenceService;
    end
    %% Monitoring
    InferenceService -- Metrics --> Prometheus;

Solved: Inefficient, low-performance, hard-to-manage model deployment, achieving standardized, high-performance model serving. Final Challenge: How to connect data processing, feature engineering, model training, model deployment, monitoring, and feedback into an automated, repeatable, and reliable machine learning workflow?

Phase 5: Connecting the Dots → MLOps Lifecycle Automation

The Ultimate Goal: Industrialized AI Production

The entire process from data acquisition, preprocessing, feature engineering, model training, validation, deployment, monitoring, to retraining still involves many manual steps and handoffs between different roles (data engineers, data scientists, ML engineers, Ops).
Lack of standardization, automation, and reproducibility leads to low efficiency and high risk in delivering AI capabilities.
Need a systematic approach to manage the entire machine learning lifecycle.

❓ Architect's Thinking Moment: How to automate the entire ML process from data to production model? How to manage versions, lineage, and reproducibility?

(Need workflow orchestration. Need experiment tracking. Need model registry. Need CI/CD for ML.)

✅ Evolution Direction: Build an MLOps Platform

MLOps (Machine Learning Operations) applies DevOps principles to the machine learning lifecycle to build, deploy, and maintain ML models in production reliably and efficiently. An MLOps platform typically integrates tools for:

Data Management & Versioning: Tools like DVC (Data Version Control), Pachyderm to version control large datasets alongside code. Integration with Feature Stores.
Experiment Tracking: Tools like MLflow Tracking, Weights & Biases (W&B), Kubeflow Metadata to log parameters, metrics, code versions, and artifacts for each experiment, ensuring reproducibility.
Workflow Orchestration / Pipelines: Tools like Kubeflow Pipelines, Airflow, Argo Workflows, MLflow Pipelines to define, schedule, and execute multi-step ML workflows (data prep, training, validation, deployment) as Directed Acyclic Graphs (DAGs). Pipelines are often container-based and run on Kubernetes.
Model Registry: Tools like MLflow Model Registry, Kubeflow Artifact Store (MinIO/S3) to store, version, and manage trained models, facilitating deployment and governance (tagging models for staging/production).
CI/CD for ML: Extending traditional CI/CD pipelines (like Jenkins, GitLab CI, GitHub Actions) to include ML-specific steps: automated testing (data validation, model evaluation), automated model training/retraining, automated model deployment (using KServe/Seldon). Triggered by code changes or data drift.
Model Monitoring: Monitoring model prediction performance (accuracy, drift), operational metrics (latency, throughput, errors), and data quality in production. Tools like Prometheus, Grafana, Evidently AI, WhyLabs. Feedback loop for retraining or alerting.

Architecture Adjustment (Integrated MLOps Platform):

graph TD
    subgraph "Versioning"
        CodeRepo(Git Code Repo);
        DataRepo(DVC/Pachyderm Data Repo);
        FeatureStore(Feature Store);
    end
    subgraph "Orchestration & Execution"
        PipelineEngine(ML Pipeline Orchestrator Kubeflow/Airflow/Argo);
        ComputeCluster(K8s Compute Cluster);
        PipelineEngine -- Define & Run Pipeline --> ComputeCluster;
        ComputeCluster -- Run Steps (DataPrep, Train, Eval) --> ComputeCluster;
    end
    subgraph "Tracking & Registry"
        ExperimentTracker(Experiment Tracking MLflow/W&B);
        ModelRegistry(Model Registry MLflow);
    end
    subgraph "Deployment & Serving"
        CD_Pipeline(CI/CD Jenkins/GitLab CI);
        ModelServingPlatform(Model Serving KServe/Seldon);
    end
    subgraph "Monitoring"
        MonitoringStack(Monitoring Prometheus/Grafana/Evidently);
        Alerting(Alerting System);
    end

    %% Connections
    CodeRepo --> PipelineEngine;
    DataRepo --> PipelineEngine;
    FeatureStore -- Features --> PipelineEngine;
    PipelineEngine -- Log Experiments --> ExperimentTracker;
    PipelineEngine -- Register Model --> ModelRegistry;
    ModelRegistry -- Trigger Deployment --> CD_Pipeline;
    CD_Pipeline -- Deploy Model --> ModelServingPlatform;
    ModelServingPlatform -- Serve Predictions --> ClientApp;
    ModelServingPlatform -- Send Metrics/Logs --> MonitoringStack;
    MonitoringStack -- Trigger Alert/Retraining --> Alerting & PipelineEngine;
    ExperimentTracker -- Enables Reproducibility --> Developer;

Achieved: Automation, reproducibility, and reliability across the entire ML lifecycle, enabling industrialized AI production.

Summary: The Evolutionary Path of AI Platforms

Phase	Core Challenge	Key Solution	Representative Tech/Pattern
0. Local	Resource Limit / Reproducibility / Collab	Local Jupyter Notebook	Python, Pandas, Scikit-learn, Conda
1. Central GPU	Compute Resource / Env Consistency	GPU Cluster + Shared Storage + Docker	Slurm/PBS, NFS/HDFS, Docker Registry
2. Dist Train	Training Efficiency / Model Scale	Distributed Training Frameworks	Horovod, PyTorch DDP, TF Strategy, DeepSpeed, NCCL/Gloo
3. FeatureStore	Feature Chaos / Train-Serve Skew	Feature Store Platform	Feast, Tecton, Spark/Flink, Redis/Hive, Great Expectations
4. Model Serve	Deployment Efficiency / Stability	Model Inference Server / Platform	Triton, TF Serving, TorchServe, KServe, Seldon, ONNX Runtime
5. MLOps	Lifecycle Automation / Reliability	Integrated MLOps Platform	Kubeflow Pipelines, Airflow, MLflow, Argo CD, DVC, W&B