Case 11: Cloud Native
Course 11: Cloud-Native Infrastructure Architecture Evolution Case Study
Goal: Cloud-native is more than just containerization; it's a set of architectural principles and technologies designed for cloud environments. This course traces the evolution of infrastructure from traditional data centers (IDCs) to modern cloud-native architectures, enabling you to grasp core capabilities like virtualization, container orchestration, service mesh, serverless computing, zero-trust security, and hybrid cloud management, understanding the efficiency, elasticity, and reliability transformations brought by cloud-native.
Phase 0: The Prehistoric Era (Physical Machines and IDC Hosting)
Scenario Description
- Deployment: Applications deployed directly onto physical servers purchased by the company or hosted in an IDC.
- Resource Management: Manual allocation of server resources, static IP address binding.
- Operations: Manual configuration, SSH logins, rsync or script-based deployment.
Architecture Diagram
graph TD
AppA[Application A] --> Server1(Physical Server 1);
AppB[Application B] --> Server2(Physical Server 2);
User --> AppA;
User --> AppB;
Pain Points at this moment:
- Extremely Low Resource Utilization: Server resources often utilized only 10-20%, leaving vast amounts idle.
- Long Scaling Cycles: Hardware procurement, racking, and configuration take weeks or even months.
- Inefficient Operations: Environment inconsistencies, error-prone manual deployments, slow fault recovery.
- Lack of Elasticity: Inability to dynamically adjust resources based on load.
Phase 1: Introduction of Virtualization (Embracing IaaS)
Challenge Emerges
- Business growth requires more flexible resource allocation and shorter application deployment times.
- Need to quickly create and destroy test environments, reducing environment costs.
❓ Architect's Thinking Moment: How to improve resource utilization and flexibility without changing hardware?
(Physical machines are wasteful. Can multiple isolated environments run on one physical machine? Is virtualization the answer? How to manage these virtual machines?)
✅ Evolution Direction: Introduce Server Virtualization Technology and Management Platform
- Hypervisor:
- Install Hypervisor software (like KVM, VMware ESXi, Xen) on physical servers.
- Hypervisors allow multiple independent Virtual Machines (VMs) to run on the same physical machine, each with its own OS and resources (CPU, memory, disk, network).
- Virtualization Management Platform (IaaS Foundation):
- Introduce a virtualization management platform (like OpenStack, VMware vSphere, or directly use public cloud EC2/ECS services) to centrally manage virtual resources like compute (Nova), storage (Cinder/EBS), and network (Neutron/VPC).
- Enables rapid VM creation, deletion, migration, snapshotting, etc.
Architecture Adjustment (Introducing Virtualization Layer):
graph TD
subgraph "Physical Layer"
HW1(Physical Server 1);
HW2(Physical Server 2);
end
subgraph "Virtualization Layer (Hypervisor)"
Hyp1(Hypervisor on HW1);
Hyp2(Hypervisor on HW2);
end
subgraph "Virtual Machine Layer (VMs)"
VM1(VM 1 - OS + App A) --> Hyp1;
VM2(VM 2 - OS + App B) --> Hyp1;
VM3(VM 3 - OS + App C) --> Hyp2;
VM4(VM 4 - OS + App D) --> Hyp2;
end
subgraph "Management Platform (Optional)"
Mgmt(OpenStack/vSphere/Cloud Console) -- Manages --> Hyp1 & Hyp2;
Mgmt -- Creates/Manages --> VM1 & VM2 & VM3 & VM4;
end
User --> VM1;
User --> VM2;
User --> VM3;
User --> VM4;
What it brought: Significantly improved physical resource utilization; reduced resource delivery time from weeks to minutes; provided some flexibility and isolation.
Remaining Issues: VMs are still relatively heavy (include full OS), slow to boot; resource isolation not as strong as physical machines; "VM sprawl" problem (uncontrolled VM numbers); application packaging and environment consistency issues persist.
Phase 2: The Containerization Wave (Docker + Kubernetes)
New Pursuit: Lighter, Faster, More Consistent Environments
- While VMs solved resource utilization, their overhead (OS boot time, resource consumption) remains significant.
- Inconsistencies between development, testing, and production environments are still prominent, making application migration and deployment complex.
- Need lighter-weight isolation technology, faster deployment speed, and standardized application packaging and delivery methods.
❓ Architect's Thinking Moment: Is there a lighter, faster-booting isolation technology than VMs? How to standardize application packaging and runtime environments? How to manage so many containers?
(Docker arrived! Is container technology the answer? What's the difference between containers and VMs? How to orchestrate and schedule hundreds or thousands of containers?)
✅ Evolution Direction: Adopt Container Technology + Kubernetes Container Orchestration
- Containerize Applications (Docker):
- Use Docker to package applications and all their dependencies (libraries, runtime environments) into lightweight, portable Container Images.
- Containers share the host OS kernel, boot extremely fast (seconds or even milliseconds), and have much lower resource overhead than VMs.
- Ensures high consistency across development, testing, and production environments.
- Container Orchestration (Kubernetes):
- As container numbers grow, a powerful container orchestration system is needed to automate deployment, scaling, scheduling, service discovery, load balancing, fault recovery, etc.
- Kubernetes (K8s) has become the de facto industry standard. It provides a declarative API, allowing users to describe the desired application state, and K8s ensures it becomes reality.
- Core components: API Server (control center), etcd (distributed key-value store for cluster state), kubelet (node agent managing containers), kube-proxy (network proxy), Controller Manager (maintains desired state), Scheduler (responsible for Pod scheduling).
- Container Network Solutions:
- K8s requires network plugins (CNI) for Pod-to-Pod communication. Common solutions include Flannel (simple, VXLAN overlay-based), Calico (good performance, BGP/IPIP-based), Cilium (eBPF-based, powerful features).
Architecture Adjustment (K8s as Infrastructure Core):
graph TD
subgraph "K8s Control Plane"
APIServer(API Server);
Etcd(etcd);
Scheduler(Scheduler);
ControllerMgr(Controller Manager);
APIServer <--> Etcd;
APIServer <--> Scheduler;
APIServer <--> ControllerMgr;
end
subgraph "K8s Data Plane (Worker Nodes)"
Node1(Node 1);
Node2(Node 2);
NodeN(Node N);
Kubelet1(kubelet) --> Node1;
KubeProxy1(kube-proxy) --> Node1;
CRI1(Container Runtime Docker/containerd) --> Node1;
Kubelet2(kubelet) --> Node2;
KubeProxy2(kube-proxy) --> Node2;
CRI2(Container Runtime Docker/containerd) --> Node2;
KubeletN(kubelet) --> NodeN;
KubeProxyN(kube-proxy) --> NodeN;
CRIN(Container Runtime Docker/containerd) --> NodeN;
APIServer -- Manages --> Kubelet1 & KubeProxy1;
APIServer -- Manages --> Kubelet2 & KubeProxy2;
APIServer -- Manages --> KubeletN & KubeProxyN;
Scheduler -- Schedules Pod to --> Node1 & Node2 & NodeN;
Kubelet1 -- Creates/Manages --> PodA1(App A Pod 1);
Kubelet1 -- Creates/Manages --> PodB1(App B Pod 1);
Kubelet2 -- Creates/Manages --> PodA2(App A Pod 2);
KubeletN -- Creates/Manages --> PodC1(App C Pod 1);
%% Network Communication
PodA1 <--> PodB1;
PodA1 <--> PodA2;
end
User --> K8sService(K8s Service / Ingress);
K8sService --> PodA1 & PodA2;
K8sService --> PodB1;
K8sService --> PodC1;
Core Breakthroughs: Achieved rapid application deployment, elastic scaling, and automated operations; significantly improved resource utilization (often > 60%); built a standardized, portable application delivery platform.
New Challenges: In microservices architectures, managing inter-service communication (like traffic control, security, observability) becomes complex.
Phase 3: Complexity of Microservice Governance → Service Mesh
Challenge Emerges: The Pains of Microservices
- After splitting applications into numerous microservices, inter-service call relationships become extremely complex.
- How to implement service discovery? How to perform fine-grained traffic control (e.g., canary releases, A/B testing)? How to ensure secure inter-service communication? How to trace a request across multiple services?
- Coupling this governance logic into business code significantly increases development burden and complexity.
❓ Architect's Thinking Moment: How to decouple service governance capabilities from business code and sink them into the infrastructure layer?
(Can a proxy intercept all inter-service traffic and implement governance logic there? Is the Sidecar pattern feasible?)
✅ Evolution Direction: Introduce Service Mesh Istio / Linkerd
A Service Mesh decouples service governance by deploying lightweight network proxies (Sidecar Proxies) alongside services:
- Data Plane:
- Typically uses high-performance proxies like Envoy or Linkerd-proxy as Sidecars, deployed in each business Pod.
- All traffic entering and leaving the business container is intercepted by the Sidecar.
- Sidecars are responsible for executing specific traffic rules, security policies, and collecting telemetry data.
- Control Plane:
- Such as Istio (Pilot, Istiod) or Linkerd Controller.
- Configuration Management: Gets configuration from users (e.g., K8s CRDs) and translates it into configurations understood by Sidecars.
- Service Discovery: Aggregates service information from platforms like K8s.
- Certificate Management: Dynamically distributes TLS certificates to Sidecars for secure mTLS communication.
- Policy Distribution: Dynamically pushes traffic routing rules, security policies, and telemetry configurations to data plane Sidecars.
- Core Capabilities:
- Traffic Management: Fine-grained routing rules (by Header/Path/Weight), request timeouts, retries, circuit breaking, fault injection.
- Security: Automatic mTLS (mutual TLS authentication and encryption) between Pods, identity-based authorization policies (AuthorizationPolicy).
- Observability: Automatically generates inter-service Metrics, distributed Tracing Spans, and access logs without modifying business code.
Architecture Adjustment (Introducing Istio Service Mesh):
graph TD
subgraph "Istio Control Plane (Istiod)"
Pilot(Pilot - Config Distribution/Service Discovery);
Citadel(Citadel - Certificate Management);
Galley(Galley - Config Validation);
end
subgraph "K8s Worker Node"
subgraph "Pod A"
AppA(App A Container);
EnvoyA(Envoy Sidecar);
AppA <--> EnvoyA;
end
subgraph "Pod B"
AppB(App B Container);
EnvoyB(Envoy Sidecar);
AppB <--> EnvoyB;
end
EnvoyA <--> Pilot -- xDS API (Dynamic Config);
EnvoyB <--> Pilot;
EnvoyA <--> Citadel -- Certificates;
EnvoyB <--> Citadel;
%% Inter-service communication via Envoy
EnvoyA -- mTLS --> EnvoyB;
end
User --> IngressGateway(Istio Ingress Gateway - Envoy);
IngressGateway --> EnvoyA;
What it solved: Decoupled service governance from business logic; provided powerful traffic control, security, and observability capabilities.
Cost: Increased system complexity and resource overhead (each Pod needs a Sidecar).
Phase 4: Pursuing Ultimate Elasticity and Cost → Serverless Architecture
New Requirements: Pay-per-use, Zero Ops, Event-Driven
- For workloads with extreme fluctuations or infrequent execution (like image processing, data conversion, scheduled tasks), maintaining resident container instances might not be cost-effective.
- Desire to further reduce operational burden, allowing developers to focus solely on business logic without worrying about servers and capacity planning.
- Many scenarios are event-driven (e.g., processing triggered by file uploads to object storage).
❓ Architect's Thinking Moment: Can we pay only when code runs? Can we completely avoid managing servers?
(Is Function as a Service (FaaS) a direction? How does it differ from containers? How to address the cold start problem?)
✅ Evolution Direction: Adopt FaaS Platform or Serverless Containers
Serverless doesn't mean no servers, but rather that developers don't manage the underlying infrastructure like servers:
- FaaS (Function as a Service):
- Developers write and upload business logic functions (e.g., AWS Lambda, Google Cloud Functions, Azure Functions, Alibaba Cloud Function Compute FC).
- The platform automatically runs function instances based on event triggers or HTTP requests.
- Billing is based on actual execution time and resource consumption; no cost when not running.
- The platform handles elastic scaling automatically.
- Challenges: Cold start latency (initial call requires loading code/environment), function execution duration limits, state management difficulties, vendor lock-in.
- Serverless Containers:
- Platforms (like Google Cloud Run, AWS Fargate, Azure Container Instances, Knative) allow deploying container images, but the platform manages on-demand instance startup, scaling, and billing based on actual usage.
- Combines the flexibility of containers with the elasticity and zero-ops benefits of Serverless.
Architecture Comparison (Serverless vs K8s):
Feature | Kubernetes Solution (Containers) | Serverless Solution (FaaS/Containers) | Explanation |
---|---|---|---|
Ops Complexity | High | Low | Serverless avoids managing OS, Runtime, Instances |
Scaling Speed | Minutes/Seconds | Seconds/Milliseconds (Potential Cold Start) | Serverless scales faster, but cold starts impact first call latency |
Billing Model | Pay per Node Resources | Pay per Execution/Request | Serverless better for fluctuating/infrequent loads |
Control Granularity | High | Low | K8s allows fine control over network, storage, etc. |
Vendor Lock-in | Relatively Low | Higher (Especially FaaS) | K8s ecosystem is more open |
Architecture Example (Event-Driven FaaS):
graph TD
User -- Upload Image --> S3(Object Storage S3/OSS);
S3 -- Triggers Event --> EventBridge(Event Bus EventBridge/MNS);
EventBridge -- Triggers --> Lambda(Function Compute FaaS - Image Processing Function);
Lambda -- Stores Result --> DynamoDB(NoSQL DB DynamoDB/TableStore);
Suitable for: Event-driven tasks, API backends, scheduled jobs, web services not extremely sensitive to latency.
Phase 5: Need for Agility Across Clouds → Hybrid/Multi-Cloud Management & GitOps
Final Frontier: Managing Complexity Across Environments
- Enterprises often use a mix of private clouds (on-prem K8s) and multiple public clouds (AWS, Azure, GCP) to leverage specific services, avoid vendor lock-in, or meet data residency requirements (Hybrid Cloud / Multi-Cloud).
- How to manage applications and infrastructure consistently across these diverse environments?
- How to ensure security and compliance policies are uniformly applied?
- How to achieve efficient, declarative, and auditable application deployment and management?
❓ Architect's Thinking Moment: How to manage clusters and applications distributed across different clouds/IDCs? How to make deployment declarative and version-controlled?
(Need a unified control plane. Can K8s itself manage other K8s clusters? What about GitOps principles?)
✅ Evolution Direction: Unified Multi-Cluster Management + GitOps Workflow
- Multi-Cluster Management Platforms:
- Use platforms designed to manage multiple Kubernetes clusters across different environments.
- Examples: Red Hat ACM (Advanced Cluster Management), Rancher, Google Anthos, Azure Arc, Karmada (CNCF sandbox project).
- These platforms provide:
- Unified Cluster Provisioning & Lifecycle Management: Create, upgrade, delete clusters across clouds.
- Centralized Policy Management: Define and enforce security, compliance, and configuration policies across fleets of clusters.
- Application Placement & Deployment: Deploy applications to specific clusters based on policies or labels.
- Unified Observability: Aggregate monitoring and logging data from multiple clusters.
- GitOps Workflow:
- Adopt GitOps as the methodology for managing infrastructure and applications.
- Core Principle: Use Git as the single source of truth for the desired state of the entire system (infrastructure configuration, application deployments).
- Workflow:
- Developers/Ops push declarative configuration (K8s manifests, Helm charts, Kustomize overlays) to a Git repository.
- An automated agent (like Argo CD, Flux CD) running in the cluster(s) monitors the Git repository.
- When changes are detected in Git, the agent automatically applies them to the cluster(s) to reconcile the actual state with the desired state defined in Git.
- Benefits: Version control, auditability, automated and consistent deployments, easier rollbacks, improved security (declarative state, less direct cluster access).
Architecture Adjustment (Multi-Cloud Management + GitOps):
graph TD
subgraph "Developer/Ops Workflow"
Developer -- Pushes Code --> AppCodeRepo(App Code Git Repo);
AppCodeRepo -- Triggers CI --> CI_Pipeline(CI Pipeline Build & Push Image);
CI_Pipeline -- Pushes Image --> ContainerRegistry(Container Registry);
Developer -- Pushes Config --> ConfigRepo(Config Git Repo K8s Manifests/Helm);
end
subgraph "Management Layer"
MultiClusterManager(Multi-Cluster Mgmt Platform Anthos/Rancher/ACM) -- Manages --> K8s_OnPrem & K8s_AWS & K8s_Azure;
GitOpsAgent_OnPrem(GitOps Agent ArgoCD/Flux) -- Monitors --> ConfigRepo;
GitOpsAgent_AWS(GitOps Agent ArgoCD/Flux) -- Monitors --> ConfigRepo;
GitOpsAgent_Azure(GitOps Agent ArgoCD/Flux) -- Monitors --> ConfigRepo;
end
subgraph "Target Environments"
K8s_OnPrem("On-Prem K8s Cluster");
K8s_AWS("AWS EKS Cluster");
K8s_Azure("Azure AKS Cluster");
GitOpsAgent_OnPrem -- Applies Config --> K8s_OnPrem;
GitOpsAgent_AWS -- Applies Config --> K8s_AWS;
GitOpsAgent_Azure -- Applies Config --> K8s_Azure;
end
ContainerRegistry --> K8s_OnPrem & K8s_AWS & K8s_Azure;
Summary: The Evolutionary Path of Cloud-Native Infrastructure
Phase | Core Challenge | Key Solution | Representative Tech/Pattern |
---|---|---|---|
0. Physical | Resource Util / Scalability / Ops | Physical Servers + IDC Hosting | Manual Ops, SSH, rsync |
1. Virtual | Resource Util / Flexibility | Server Virtualization + Mgmt Platform (IaaS) | KVM/VMware/Xen, OpenStack, EC2/ECS |
2. Container | Deployment Speed / Env Consistency | Containerization + Orchestration (K8s) | Docker, Kubernetes, Pods, Deployments, Services |
3. Service Mesh | Microservice Governance Complexity | Sidecar Proxy + Control Plane | Istio, Linkerd, Envoy, mTLS, Traffic Mgmt, Observability |
4. Serverless | Ops Burden / Cost / Event-Driven | FaaS / Serverless Containers | AWS Lambda, Cloud Functions, Cloud Run, Knative, Fargate |
5. Multi-Cloud | Consistency / Management / GitOps | Multi-Cluster Mgmt + GitOps Workflow | Anthos, Rancher, ACM, Karmada, Argo CD, Flux CD |
``` |