Principles

System Design Principles and Practices

Part One: Foundation – The Unshakeable Design Principles

Architecture design isn't arbitrary; it's supported by fundamental principles. Mastering these principles is prerequisite to making wise decisions.

I. Overview of Core Design Principles

Imagine building a robust system like constructing a pyramid. Reliability is the foundation, followed by scalability and maintainability, with high performance and security running throughout.

Fundamental Principles Pyramid (Logical Relationship)

graph TD
    A[Reliability] --> B[Scalability]
    B --> C[Maintainability]
    C --> D[High Performance]
    D --> E[Security]

Simultaneously, we need a more comprehensive framework to ensure no critical dimensions are missed: (Supplementary) Multi-Dimensional Consideration Framework

graph TD
    A[Functional Requirements] --> B(System's Non-Functional Requirements);
    B --> C[Performance?]
    B --> D[Reliable Enough?]
    B --> E[Security Ensured?]
    B --> F[Easy to Maintain/Iterate?]
    B --> G[Cost Acceptable?]

Detailed Explanation of Principles: Deep Dive into Meaning
- Reliability: The Bedrock of System Stability
  - Fault Domain Isolation: Key to preventing cascading failures. By partitioning into independent units (e.g., microservices, K8s AZ isolation), ensures one component's failure doesn't bring down the whole system.
  - Redundancy Design: "Don't put all eggs in one basket." Eliminate single points of failure through data replicas (MySQL Master-Slave), service instance replicas (K8s ReplicaSet), etc.
  - Automatic Failover: The system needs "self-healing" capability to automatically detect failures and switch to backup components (think Redis Sentinel's auto master-slave switch).
  - Graceful Degradation: Sometimes you have to sacrifice to gain. Strategically disable non-core functions under extreme pressure or failure (e.g., temporarily disabling recommendations during a flash sale, see Course 3) to protect core business availability.
  - Retry & Idempotency: Network jitters are common; transient failures should be automatically retried. But retries must not cause side effects. Critical operations like payments must ensure idempotency (the same request yields the same result no matter how many times it's retried).
  - Circuit Breaker Pattern: The "fuse" preventing fault propagation. When a dependent service continuously fails, fail fast to avoid request piling up, giving the downstream a chance to recover (see Glossary, applied in Courses 4/11).
- Scalability: The Ability to Gracefully Handle Growth
  - Horizontal Scaling (Scale Out): Many hands make light work. Increase processing capacity by adding more machines or instances (add web servers, K8s HPA auto-scaling).
  - Vertical Scaling (Scale Up): "Upgrade the equipment" of a single machine (add CPU, memory). Simple and direct, but has physical limits.
  - Stateless Design: The "best friend" of horizontal scaling. Service instances don't store state (e.g., Session stored in Redis), allowing requests to be freely routed to any instance.
  - Sharding Strategy: Cut the "big cake." Distribute data (database sharding) or load (Kafka Partition) across multiple units.
  - Asynchronous Processing: "Let the bullets fly for a while." Place time-consuming non-core operations (like issuing coupons after a successful flash sale) into a message queue for asynchronous processing, freeing up main process resources and improving response speed (essence of Course 3).
  - Caching: "Keep good things close at hand." Store hot data or computation results in high-speed media (memory like Redis) to drastically improve access speed (core optimization in Courses 1/3).
- Maintainability: Enabling the System to "Live Long and Be Modifiable"
  - Observability: The system needs a "dashboard." Provide sufficient logs, metrics, tracing data to see the internal state clearly (detailed in Part Five of this appendix).
  - Modularity & Decoupling: "High cohesion, low coupling" is the eternal pursuit. Split the system into independent modules or services for easier understanding, modification, and replacement (core concept of microservice architecture).
  - Automated Operations: Delegate repetitive tasks to machines. Achieve automated deployment, monitoring, scaling through CI/CD, K8s, configuration management tools, etc.
  - Documentation as Code: Make documentation "live." Use tools like OpenAPI, ADRs to keep architecture documents synchronized with code.
  - Configuration Management: Separate volatile configurations from code, manage them centrally for easy modification (e.g., K8s ConfigMap, Apollo, Nacos).
- High Performance: Pursuing the Ultimate User Experience
  - Latency is the Devil: Optimizing request response time is crucial. Methods include CDN acceleration, caching, SQL optimization, efficient serialization protocols (Protobuf), etc.
  - Throughput is Key: Increase the system's processing capacity per unit time. Examples: Nginx's high concurrent processing capability, Kafka's high-throughput messaging.
  - Be Economical: Optimize CPU, memory, network, disk IO consumption, e.g., using zero-copy techniques to reduce data duplication.
  - (Supplementary) User Perception First: Technical metrics ultimately serve user experience. Focus on the "golden rules" of response time (e.g., 0.1s/1s/10s), use techniques like progressive loading, skeleton screens.
- Security: The Non-Negotiable Bottom Line
  - Defense in Depth: Build security defenses at multiple layers (network, host, application, data), rather than relying on a single barrier.
  - Principle of Least Privilege: "Grant no more privilege than necessary," only the minimum access required to complete a task.
  - Secure by Default: Consider security during design; default configurations should be secure, not requiring manual hardening.
  - Authentication & Authorization: First verify identity (Who are you? Authentication), then grant permissions (What can you do? Authorization). Common techniques: OAuth 2.0, JWT, RBAC.
  - Data Encryption: Consider encryption for data in transit (TLS/SSL) and at rest.
  - (Supplementary) Zero Trust Architecture: "Never trust, always verify." Break the traditional perimeter-based security model; verify identity and permissions for every access (detailed in Part Six of this appendix and Course 11).
  - (Supplementary) Cost-Effectiveness: The Wisdom of Doing More with Less
    - Architecture design isn't cost-agnostic tech showmanship. Optimize resource usage (use Spot instances wisely, design hot/cold data strategies, enhance automation) while meeting business needs and quality attributes.

Part Two: The Wisdom of Architecture Patterns and Pitfalls of Anti-Patterns

Architecture patterns are crystallizations of past experience, mature solutions to specific problems; anti-patterns are common "traps." Recognizing them helps us avoid detours.

I. Common Architecture Patterns: Handy Tools in the Toolbox

Data Layer Patterns
- Read-Write Splitting: A weapon for read-intensive scenarios (Courses 1/9).
- Data Sharding: The ultimate means to break through single-database capacity and performance bottlenecks (Courses 1/3/8/9).
- Caching Strategies: How to use cache efficiently? Cache-Aside, Read-Through, Write-Through, Write-Back each have merits (see Glossary and Courses 1/3).
- Replication Strategies: Cornerstone for data availability and consistency (Master-Slave/Multi-Master/Quorum, see Courses 1/8/9).
- Data Synchronization: How to keep data replicas consistent? CDC technology is a common solution (Course 9).
- Cold/Hot Separation: An effective means to reduce storage costs (Courses 1/5/6).
- CQRS: When read and write needs differ vastly, separate them completely (Course 3).
Application/Service Layer Patterns
- Monolithic Architecture: Simple and direct, the starting point (Course 1 Phase 0).
- Layered Architecture: Basic pattern for separating responsibilities (Presentation/Business Logic/Data Access Layer).
- Microservice Architecture: Modern choice for complex businesses and growing teams (Course 1 Phase 4).
- Event-Driven Architecture (EDA): Decouple services via asynchronous events, enhancing system resilience (uses Kafka/RabbitMQ, see Course 3).
- Serverless: Let developers focus on business logic, delegating server management to the cloud platform (Course 11).
- API Gateway: The "gatekeeper" for microservices, handling routing, authentication, rate limiting, etc. (Course 1).
- Service Mesh: Sidecar pattern, elegantly handling the complexity of inter-service communication (Course 11).
- BFF (Backend For Frontend): Tailor backend API aggregation layers for different frontends (Web/Mobile).
- Strangler Fig Pattern: A practical strategy for smoothly migrating from monolith to microservices.
Messaging/Communication Patterns
- Synchronous Call: Simple and direct, but carries coupling and blocking risks (RPC/REST).
- Asynchronous Message: A tool for decoupling, peak shaving (Message Queue, Publish/Subscribe).
- Long Connection: Choice for real-time communication (WebSocket/SSE, see Courses 2/7).
Fault Tolerance Patterns
- Timeout Control: Avoid indefinite waiting.
- Retry: Handle transient failures.
- Circuit Breaking: Prevent snowball effects.
- Degradation: Degraded service to ensure core availability.
- Rate Limiting: Prevent the system from being overwhelmed by traffic (Token Bucket/Leaky Bucket, see Courses 2/3).
- Bulkhead Isolation: Resource isolation to prevent fault propagation.

II. Anti-Pattern Warnings: The Pits We've Fallen Into

Identifying and avoiding anti-patterns is as important as learning positive patterns.

Common Distributed System Anti-Patterns
- Single Point of Failure: The system's "Achilles' heel." Improvement: Redundancy design, automatic failover.
- Timeout Cascade (Snowball Effect): A small problem triggering a domino effect. Improvement: Reasonable timeouts, circuit breaking, degradation.
- Split Brain: Cluster "schizophrenia," multiple master nodes emerge. Improvement: Quorum mechanism, Fencing.
- Inappropriate Consistency Choice: Sacrificing performance/availability for unnecessary strong consistency. Improvement: Choose consistency model based on need, embrace eventual consistency + compensation.
- Overly Optimistic Concurrency Control: Missing locks or version control leading to data corruption. Improvement: Use optimistic/pessimistic locking appropriately.
- Ignoring Network Unreliability: Treating distributed calls like local calls. Improvement: Design for fault tolerance, consider asynchronous communication.
Common Microservice Anti-Patterns
- Distributed Monolith: A "tightly coupled monster" disguised as microservices. Improvement: DDD guiding service splitting, asynchronous decoupling.
- God Service: One service doing everything, hard to maintain and scale. Improvement: Further splitting.
- Shared Database: Microservices in name only, coupled at the data layer. Improvement: Each service owns its database, interact via API or events.
- Cyclic Dependencies: Intertwined dependencies, hard to untangle. Improvement: Redesign dependencies, introduce intermediate layers or events.
- Siloed Architecture: Teams reinventing the wheel, lack of standardization. Improvement: Platformization, promoting common components.
Data Processing Anti-Patterns
- N+1 Query: Simple but brutal loop database query, a performance killer. Improvement: Batch query, JOIN query.
- Data Swamp: Data lake lacking governance, turning into a "garbage dump." Improvement: Data governance, metadata management, data lineage.
- Excessive Logging: Recording too much useless info, wasting resources and hindering troubleshooting. Improvement: Structured logging, levels, sampling.
Common Frontend Anti-Patterns
- Mega Component: "Spaghetti code" frontend. Improvement: Component splitting, logic extraction.
- State Chaos: State management like a "tangled mess." Improvement: Use state management libraries appropriately.
- Inefficient Rendering: Unnecessary updates slowing down the page. Improvement: Memoization, virtual lists.
General Anti-Patterns
- Over-Abstraction/Design: Designing for design's sake, complicating problems. Improvement: YAGNI, KISS principles.
- Magic Configuration: Obscure configurations relying on hidden rules. Improvement: Explicit configuration, validation.
- Reinventing the Wheel: Insisting on writing something that already exists. Improvement: Embrace open source and standard libraries.
- Vendor Lock-in: Being "held hostage" by a specific vendor. Improvement: Embrace open standards, design adapter layers.

Part Three: Architecture Design Process and Methodologies

Architecture design isn't purely technical; it's a process of understanding business, analyzing requirements, weighing trade-offs, making decisions, and continuous evolution.

I. General Iterative Architecture Design Process

Just as software development needs a process, architecture design has its methods. A typical iterative flow:

graph TD
    A["1 Understand Requirements & Constraints<br>(Clarify goals & boundaries)"] --> B["2 Analyze & Evaluate Options<br>(Propose solutions, weigh pros/cons)"]
    B --> C["3 Define Architecture & Design<br>(Make decisions, detail blueprint)"]
    C --> D["4 Document Architecture<br>(Record decisions, build consensus)"]
    D --> E["5 Implement & Validate<br>(Execute, verify effectiveness)"]
    E --> F["6 Monitor & Feedback<br>(Observe continuously, collect data)"]
    F -- Issues/New Needs --> A
    F -- Running Well --> G["Complete Current Iteration / Prepare Next Phase"]

Understand Requirements & Constraints
- Functional Requirements: What does the system need to "do"? (User registration? Order placement? Content publishing?) – This is the foundation.
- Non-Functional Requirements (Quality Attributes): How "well" should the system do it? (Required QPS/Latency? Availability (nines)? Security needs? Future scalability expectations?) – This determines architectural complexity.
- Constraints: What are the "shackles"? (Budget? Timeline? Tech stack limitations? Team skills? Legal/compliance requirements?) – The harsh reality.
- Key: Communicate thoroughly with Product, Business, Ops, Security, etc., identify core needs, prioritize. Don't design in isolation.
Analyze & Evaluate Options
- For critical architecture decision points (e.g., which database? how to split services? sync vs. async?), don't settle for the first idea; propose at least a few candidates.
- Systematically evaluate and trade-off these options across multiple dimensions (reliability, performance, cost, maintainability, development efficiency, team familiarity, etc.). No perfect solution, only the best fit for the current context.
- When necessary, conduct Proof of Concept (PoC) or performance tests to speak with data and reduce guesswork.
Define Architecture & Design
- Based on evaluation, choose the most suitable architecture and proceed with detailed design.
- Drawing clear architecture diagrams is key for communication and understanding. Use different views like C4 model, system context diagrams, deployment diagrams.
- Define core components and responsibilities, key interface definitions, data models, core interaction flows.
Document Architecture
- "A good memory is no match for a worn pen." Recording important architecture decisions and their rationale is crucial. ADR (Architecture Decision Records) is an excellent practice.
- Maintain up-to-date architecture diagrams and design documents, ensuring they reflect the system's actual state.
- Key: Documentation should serve communication and maintenance; aim for brevity, clarity, ease of understanding, avoid lengthy prose.
Implement & Validate
- Architecture isn't built in the air; it needs implementation by the development team.
- Architects need deep involvement in implementation, e.g., ensuring code aligns with design via Code Reviews.
- Verify if the architecture truly meets functional and non-functional requirements through unit tests, integration tests, performance tests, security tests, etc.
Monitor & Feedback
- System launch isn't the end, but a new beginning. Continuously monitor system status (logs, metrics, traces are essential).
- Collect user feedback and business operational data.
- Evaluate the actual effectiveness of the architecture based on monitoring data and feedback, identify potential issues or new optimization opportunities, which in turn drives the next iteration.

II. Common Architecture Design Methodologies

Attribute-Driven Design (ADD): Quality attributes (like performance, reliability) are the primary design drivers. First, define the most important quality goals, then select architectural tactics and patterns that satisfy them.
Domain-Driven Design (DDD): Guides service splitting and model design by deeply understanding the business domain, defining Bounded Contexts, and establishing a Ubiquitous Language. Especially suitable for complex business systems.
C4 Model: A simple and effective method for visualizing software architecture through four levels: System Context, Containers, Components, and Code, showing the architecture from high-level to detailed views.
Architecture Tradeoff Analysis Method (ATAM): A systematic method for evaluating whether an architecture design meets quality attribute goals, focusing on identifying risks and trade-off points.
Evolutionary Architecture: Emphasizes that architecture should be adaptable, supporting incremental changes and continuous evolution rather than aiming for a one-time perfect design.

III. Key Architect Mindsets

Becoming a great architect involves more than just technology; mindset is crucial.

Trade-off Thinking: This is the most important architect mindset. Nearly all architecture decisions are trade-offs; there's no absolute right or wrong, only pros and cons. Understanding and clearly articulating these trade-offs is a core competency.
Abstraction Thinking: Grasping the essence of a problem, ignoring unimportant details, and applying reasonable abstraction and layering are key to managing complexity.
Critical Thinking: Not blindly following tech trends, able to think independently, question assumptions, and deeply analyze the pros, cons, and applicability of solutions.
Evolutionary Thinking: Understanding that systems grow and evolve gradually. Design considering future scalability and maintainability, leaving room for change.
Communication & Collaboration: Architects don't fight alone. Need to clearly articulate designs to different roles (tech team, product managers, management) and collaborate effectively to drive implementation.

Part Four: Full-Stack Design Considerations (Mainly from Appendix Seven)

I. Frontend-Backend Collaboration Patterns

Web Architecture Evolution

timeline
    title Web Architecture Evolution Path
    ~2005 : Server-Side Rendering (SSR - PHP/JSP/ASP)
    ~2010 : AJAX Asynchronous Interaction (jQuery)
    ~2015 : Frontend-Backend Separation + Single Page Application (SPA - React/Vue/Angular)
    ~2018 : Isomorphic Rendering/Server Components (SSR+SPA - Next.js/Nuxt.js)
    ~2022 : Edge Rendering (Edge SSR) + Islands Architecture

API Design Specifications

Factor	REST Best Practice	GraphQL Approach	gRPC Implementation
Style	Resource-oriented, HTTP Verb+Noun	Data graph-oriented, Single Endpoint	Service/Method-oriented, Strong Contract
Data Fetching	Multiple Endpoints, potential over/under-fetching	Client declares exact data needs	Server defines return structure
Protocol	HTTP/1.1, HTTP/2	HTTP (usually POST)	HTTP/2 (mandatory)
Data Format	JSON (mainstream), XML	JSON	Protocol Buffers (binary)
Versioning	URL Path (/v1), Header, Accept	Schema Versioning (often avoided)	Proto file version/Service version
Error Handling	HTTP Status Code + JSON Error Body	`errors` field in response	gRPC Status Code + Error Details (opt.)
Docs & Tools	OpenAPI (Swagger)	GraphiQL, Apollo Studio	Protocol Buffer Definition, gRPCurl
Use Cases	CRUD ops, Public APIs, Simple interactions	Flexible data needs for frontend, Mobile	Internal microservices, Performance-sensitive

Performance Optimization Matrix

Optimization Area	Frontend Strategies	Backend Strategies	Full-Stack/Network Strategies
Load Perf.	Code Splitting, Tree Shaking, Lazy Loading, Prefetch/Preload, Resource Compression (Minify/Gzip/Brotli)	API Aggregation (BFF), Reduce Redirects, HTTP/2 Server Push (use cautiously)	CDN for static assets, HTTP/2 or HTTP/3
Render Perf.	Virtual DOM diff opt., Virtual List, Time Slicing, SSR/SSG/ISR, Memoization, Debounce/Throttle	Streaming SSR	Islands Architecture, Edge Rendering
Data Interact.	Client Cache (LocalStorage/SWR/React Query), Req. Throttling/Debounce, Batch Requests, Deferred Req.	Query Opt. (Index, Avoid N+1), Cache (Redis), Async Processing	WebSocket/SSE, GraphQL, Protobuf, gRPC
Image Opt.	Responsive Images (srcset), WebP/AVIF format, Image Lazy Loading, Image CDN	-	-

II. Data Processing System Design

Data Pipeline Architecture

graph LR
    DataSource(DB/Log/API/...) --> Ingestion(Flume/Filebeat/Logstash/CDC) --> Buffer(Kafka/Pulsar)
    Buffer --> Storage(HDFS/S3/Data Lake/DB)
    Buffer --> RealtimeProcessing(Flink/Spark Streaming)
    Storage --> BatchProcessing(Spark/Hive/MapReduce)
    RealtimeProcessing --> Service/App(API/Dashboard/Alert)
    BatchProcessing --> DataWarehouse/DataMart(DW/DM)
    DataWarehouse/DataMart --> Service/App

Storage Selection Considerations (Combines Appendices Five and Seven)
- Data Model: Relational (MySQL/PostgreSQL), Document (MongoDB), Key-Value (Redis/etcd), Columnar (ClickHouse/HBase), Graph (Neo4j/JanusGraph), Time-Series (InfluxDB/TDengine), Search (Elasticsearch).
- Consistency Req.: Strong (Financial Tx) vs. Eventual (Social Feed).
- Read/Write Load: Read-heavy (Cache + R/W Split) vs. Write-heavy (LSM Tree DB).
- Data Volume & Scalability: Single node vs. Distributed (Sharding).
- Query Complexity: Simple KV vs. Complex SQL vs. Full-text Search vs. Graph Traversal.
- Latency Req.: Milliseconds (Memory Cache) vs. Seconds (SSD DB) vs. Minutes (Offline Analysis).
- Cost & Ops Complexity.

Real-time Communication Pattern Comparison

Method	Connection Type	Server Push	Latency	Efficiency/Overhead	Typical Scenarios
Polling	Short (HTTP)	No	Secs-Mins	Low Eff., High Ovh.	Simple notifications, Compatibility
Long Polling	Long (HTTP)	Pseudo	Seconds	Medium	Upgrading traditional notifications
SSE	Long (HTTP)	Yes (One-way)	Seconds	Relatively High Eff.	Server-to-client push (e.g., news)
WebSocket	Long (Dedicated)	Yes (Bi-dir.)	Milliseconds	Highest Eff.	Real-time chat, collab., games

Part Five: Reliability and Fault Tolerance Design (Mainly from Appendix Six)

I. Fault Prevention and Detection

Fault Domain Partitioning: AZ (Availability Zone), Region, Rack, Server, Process.
Heartbeat: Periodically send signals to confirm node liveness.
Health Check: More in-depth checks if a service can process requests normally.
Monitoring & Alerting: Real-time monitoring of key metrics (RED, USE), set reasonable thresholds, detect anomalies promptly.
Checksum/CRC: Data validation to prevent corruption during storage or transmission.
Gray Release/Canary Release: Validate changes in a small scope, reducing fault impact.
Feature Flag: Quickly disable problematic features.
Unit/Integration/E2E Tests: Detect code logic errors early.
Chaos Engineering: Proactively inject faults to test system resilience (Refer to Part Nine of this appendix).

II. Fault Recovery Strategies

Prioritize Automatic Recovery: Design systems to automatically recover from common faults.
- Automatic Retry: For transient network or service jitters.
- Automatic Failover: Automatically switch to standby node upon master failure (e.g., DB master-slave switch).
- Automatic Scaling: Adjust resources based on load (K8s HPA).
- Automatic Restart: Automatically bring up Pods/processes after OOM or abnormal exit.
Fail Fast: Check dependencies and configurations on service startup; fail quickly if unmet, avoiding running with issues.
Graceful Degradation: Sacrifice partial functionality to ensure core availability under high pressure or dependency failure.
Compensation Mechanism: For eventual consistency scenarios, execute compensation operations to roll back completed steps if a step fails (Saga Pattern).
Manual Intervention Plan (Playbook): Define detailed manual procedures and plans for severe failures that cannot be automatically recovered.
Backup & Restore: Regularly back up critical data, develop and practice data recovery plans (RPO/RTO).

III. Fault Tolerance Design Patterns Summary (Refer to Part Two Architecture Patterns)

Part Six: Observability System Design (Mainly from Appendix Seven)

I. Three Pillars of Observability

Logging
- Goal: Record discrete events for troubleshooting and auditing.
- Best Practices: Structured logs (JSON format), include context (Trace ID, User ID), appropriate log levels, avoid logging sensitive info, log rotation and archiving.
- Tool Stack: ELK (Elasticsearch, Logstash, Kibana), Loki + Grafana, Splunk.
Metrics
- Goal: Aggregated numerical data reflecting system state and trends, used for monitoring and alerting.
- Best Practices: Focus on key metrics (RED, USE, Four Golden Signals), multi-dimensional labels, reasonable sampling rate and retention period, separate monitoring and alerting.
- Tool Stack: Prometheus + Grafana + Alertmanager, InfluxDB + Grafana, Datadog.
Tracing
- Goal: Record the complete call chain of a single request across multiple services for performance analysis and bottleneck identification.
- Best Practices: Unified Trace ID propagation (W3C Trace Context), record Span info at critical nodes, appropriate sampling strategy (Head-based vs. Tail-based).
- Tool Stack: Jaeger, Zipkin, SkyWalking, OpenTelemetry.

II. End-to-End Monitoring Solution

graph LR
    A[User Request] --> B[Frontend App/SDK]
    B --> C[Log Collection]
    B --> D[Metric Collection]
    B --> E[Trace Collection]
    C --> F["Log System (Loki/ES)"]
    D --> G["Metric System (Prometheus/InfluxDB)"]
    E --> H["Trace System (Jaeger/Zipkin)"]
    F --> I["Visualization/Alerting (Grafana)"]
    G --> I
    H --> I
    J[Backend Service 1] --> C
    J --> D
    J --> E
    K[Backend Service 2] --> C
    K --> D
    K --> E
    L[Infrastructure] --> C
    L --> D

* Correlation Analysis: Correlate logs, metrics, and traces in visualization platforms (like Grafana) to allow clicking on metric charts to jump to related logs or trace details.

III. Key Metrics System

Business Metrics: Registered users, DAU/MAU, Order volume, GMV, Conversion rate, etc.
System Metrics (USE Method):
- Utilization: Resource busyness (e.g., CPU Usage, Network Bandwidth Utilization).
- Saturation: Degree of resource bottleneck (e.g., CPU Queue Length, Disk IO Wait).
- Errors: Number or rate of system errors (e.g., HTTP 5xx Error Rate, Log Error Count).
Service Metrics (RED Method):
- Rate: QPS/RPS.
- Errors: Failed request rate.
- Duration: P50, P90, P99 response time.
Google SRE Four Golden Signals: Latency, Traffic, Errors, Saturation.
Specific Component Metrics (Refer to Appendix Seven):
- Database: Query latency, Connections, Cache hit rate, Master-slave lag.
- Message Queue: Message backlog, Produce/Consume latency.
- Cache: Hit rate, Memory usage, Connections.

Part Seven: Security Design Principles and Practices (Mainly from Appendices Six, Seven)

I. Defense in Depth Model

graph BT
    Data Security --> Storage Encryption (TDE/App Layer Encryption)
    Data Security --> Transport Encryption (TLS/mTLS)
    Data Security --> Data Masking/Data Loss Prevention (DLP)
    Application Security --> Input Validation (Anti-Injection/XSS)
    Application Security --> Authentication & Authorization (RBAC/OAuth)
    Application Security --> Dependency Scanning (SCA)
    Host Security --> OS Hardening/Patch Management
    Host Security --> Intrusion Detection/Prevention (HIDS/HIPS)
    Host Security --> Vulnerability Scanning
    Network Security --> Firewall/WAF (Web App Firewall)
    Network Security --> DDoS Protection
    Network Security --> Network Isolation (VPC/Security Group)
    Physical Security --> Data Center Access Control
    Management Security --> Security Awareness Training/Audit Logs/Incident Response

II. Authentication and Authorization

Authentication: Who are you?
- Password Auth (Salted Hash Storage).
- Multi-Factor Authentication (MFA).
- Single Sign-On (SSO): SAML, OpenID Connect.
- API Auth: API Key, OAuth 2.0, JWT.
Authorization: What can you do?
- Access Control List (ACL).
- Role-Based Access Control (RBAC).
- Attribute-Based Access Control (ABAC).
- OAuth 2.0 Authorization.

III. Common Web Security Threats & Defenses (OWASP TOP 10 Core)

Risk	Description	Defense Measures
Injection	SQL Inj., OS Command Inj., LDAP Inj., etc.	Parameterized Queries/Prepared Stmts, ORM, Input Validation/Filtering, Least Privilege
Broken Auth.	Credential Stuffing, Session Fixation, Weak Passwords	Strong Password Policy, MFA, Secure Session Mgmt (HTTPS Only, HttpOnly, Secure flag), Credential Rotation
Sensitive Data Exp.	Unencrypted Transit/Storage, Log Exposure	TLS, Storage Encryption, Data Masking, Avoid exposing sensitive info in logs/URLs
XML External Entities (XXE)	Processing malicious XML input -> info leak/DoS	Disable external entity parsing, Use secure XML parsers
Broken Access Ctrl.	Privilege Escalation (Horizontal/Vertical)	RBAC, Check permissions on every access, Avoid exposing IDs in URLs
Security Misconfig.	Default Credentials, Unnecessary Services, Verbose Errors	Secure Baseline Config, Automated Checks, Minimization Principle, Custom Error Pages
Cross-Site Scripting (XSS)	Executing malicious scripts in user's browser	Input Validation & Output Encoding (HTML Escape), Content Security Policy (CSP), HttpOnly Cookie
Insecure Deserialization	Deserializing untrusted data -> RCE	Avoid deserializing untrusted data, Use secure libs, Sign or encrypt serialized data
Using Components with Known Vulns.	Third-party libraries have security holes	Regularly scan dependencies (SCA), Patch promptly, Use Software Bill of Materials (SBOM)
Insufficient Logging & Monitoring	Cannot detect/respond to incidents timely	Log critical security events (login, perm change, key ops), Centralized logging, Real-time monitoring & alerting

IV. Zero Trust Architecture (ZTA) (From Appendices Six/Eleven)

Core Concept: "Never Trust, Always Verify". Breaks traditional perimeter-based security.
Key Principles:
- Identity Centric: Strong identity verification for all users and devices.
- Micro-segmentation: Divide network into small zones, strictly control east-west traffic.
- Least Privilege Access: Dynamically authorize based on identity, device state, context.
- Continuous Monitoring & Verification: Real-time monitoring of user behavior and device state, dynamically adjust trust levels.
Implementation Components: Identity Provider (IdP), Policy Enforcement Point (PEP), Policy Decision Point (PDP), Access Proxy, Device Health Check.

Part Eight: Evolutionary Architecture and Evaluation (Mainly from Appendices Six, Seven)

I. Evolutionary Architecture Practices

Understand Business Drivers: Architecture evolution should be driven by business needs (e.g., user growth, new features, globalization) or technical pain points (e.g., performance bottlenecks, maintenance difficulties).
Small Steps, Continuous Iteration: Avoid large, one-off refactoring; adopt incremental evolution.
Strangler Fig Pattern: Gradually migrate functionality from the old system to the new, eventually "strangling" the old one.
Reversibility: Consider rollback plans when designing changes.
Adaptability: Architecture should possess flexibility to adapt to future changes.
Standardization & Platformization: Build reusable platform capabilities and standardized components to accelerate evolution.
Evolution Case References:
- Course 1: Blog platform from monolith to global multi-active.
- Course 2: Social network from simple feed to graph DB and real-time push.
- Course 3: Flash sale system from monolith to async, sharding, elastic scaling.
- Course 7: Recommendation system from static to real-time personalized deep learning.
- Course 11: Infrastructure from physical machines to Serverless and Zero Trust.

II. Architecture Evaluation Methodologies

Architecture Decision Record (ADR)
- Purpose: Record important architectural decisions, their context, alternatives, rationale, and consequences.
- Value: Knowledge sharing, onboarding new members, avoiding redundant discussions, traceability.
- Template Elements (Ref Appendix Six): Title, Status (Proposed/Accepted/Deprecated), Context, Decision Drivers, Considered Options, Decision Outcome, Consequences.
Quality Attribute Workshop (QAW)
- Purpose: Identify and define the system's key quality attributes (non-functional requirements) and decompose them into concrete, measurable scenarios.
- Method: Stakeholder interviews, brainstorming, transforming vague requirements (e.g., "high availability") into specific scenarios (e.g., "If the primary database node fails, the system should automatically failover within 30 seconds with an RPO of 0").
Architecture Tradeoff Analysis Method (ATAM)
- Purpose: Systematically evaluate whether an architecture design meets key quality attribute requirements, identifying risks and trade-off points.
- Method: Based on QAW scenarios, analyze how the architecture addresses them, identify sensitivity points (design decisions affecting multiple attributes) and tradeoff points (choices made when multiple attributes conflict).
Technology Radar
- Purpose: Assess and track the maturity and recommended adoption level of techniques, tools, platforms, and languages/frameworks used within an organization.
- Quadrants: Techniques, Tools, Platforms, Languages & Frameworks.
- Rings: Adopt, Trial, Assess, Hold.
- Value: Guides technology selection, promotes innovation and standardization.
Architecture Health Check / Scorecard
- Purpose: Periodically quantify the architecture's performance across key dimensions (e.g., reliability, scalability, security, cost), identifying areas for improvement.
- Method: Define assessment dimensions and metrics, assign weights, score, compare against targets (Ref Appendices Six/Eight).

Part Nine: Domain-Specific Design Considerations (Mainly from Appendix Eight)

I. Mobile-Specific Design

Architecture Evolution: MVC -> MVP/MVVM -> Componentization -> Cross-Platform (React Native/Flutter) -> Declarative UI (SwiftUI/Jetpack Compose).
Performance Opt. : Startup speed (phased loading, AOT), Memory management (leak detection, image caching), Render performance (async drawing, offscreen rendering opt.), Bundle size opt. (code/resource obfuscation, dynamic loading).
Network Opt. : Weak network handling (retry, timeout, compression), DNS pre-resolution, Connection reuse.
Power Opt. : Reduce background activity, Batch network requests, Use location/sensors judiciously.
Hybrid Dev. : Communication between Native and Web/JS/Dart (Bridge, JSI, FFI).

II. Internet of Things (IoT) System Design

Massive Device Onboarding: Lightweight protocols (MQTT, CoAP), Long connection management (EMQX/NATS), Device auth & security.
Edge Computing: Data filtering/preprocessing/aggregation, Local rule engine/AI models, Store-and-forward on disconnect, Reduce cloud load & bandwidth cost.
Data Storage: Time-series DB (InfluxDB/TDengine) for sensor data, Cold/Hot separation.
Device Management: Registration/Discovery, Device Shadow for state sync, Config pushdown, OTA firmware updates.
Low Power Design: Protocol choice (CoAP, LoRaWAN, BLE), Sleep/wake mechanisms.

III. Game Server Architecture

Real-time Sync: Frame Sync (RTS/MOBA, low bandwidth, strong consistency), State Sync (MMO, high bandwidth, server authority), Prediction/Rollback (FPS, handles latency, complex), P2P hybrid.
Matchmaking Service: Match based on player skill (ELO)/latency/preference, Rule engine.
Scalability: Zoning/Sharding (e.g., by map/feature), Stateless Gateway + Stateful Logic Servers.
Anti-Cheat: Client detection (memory/file integrity), Server validation (move speed/action freq.), Data analysis.

IV. Big Data Platform Architecture

Architecture Patterns: Lambda (Batch+Stream) vs Kappa (Pure Stream) vs Hybrid (Real-time OLAP).
Storage Formats: Parquet/ORC (Columnar for analytics), Avro (Serialization/Row-based), Delta/Iceberg/Hudi (Data Lake formats, support ACID).
Compute Engines: Spark (Batch/Stream), Flink (Stream/Batch), Hive/Impala (SQL on Hadoop).
Resource Scheduling: YARN, Mesos, Kubernetes.
Data Governance: Metadata management, Data quality monitoring, Data lineage.

V. AI Engineering and Platform

MLOps Full Lifecycle: Data Prep -> Feature Eng. -> Model Training -> Model Eval. -> Model Deploy -> Monitoring Feedback.
Distributed Training: Data Parallelism (Horovod), Model Parallelism (DeepSpeed).
Feature Store: Centralized management, storage, serving of online/offline features (Feast).
Model Serving: High-performance inference service (TF Serving, Triton), Serverless Inference (KServe).
Experiment Mgmt & Version Control: MLflow, DVC.
Pipeline Orchestration: Kubeflow Pipelines, Airflow, Metaflow.
Large Model Challenges: Storage/Compute Opt. (ZeRO, Pipeline Parallelism), Inference Acceleration (Quantization, Pruning).

Part Ten: Architect Growth and Decision Making (Mainly from Appendix Six)

I. Architect Competency Model

graph LR
    A[Technical Depth] --> B(Systems Thinking)
    B --> C(Business Acumen)
    C --> D(Communication & Collaboration)
    D --> E(Decision Making & Trade-offs)
    E --> F(Technical Leadership)

* Technical Depth: Master core tech principles, possess hands-on skills. * Systems Thinking: Understand interactions between system parts, have a holistic view. * Business Acumen: Understand business goals and processes, connect tech with business. * Communication & Collaboration: Clearly articulate designs, facilitate cross-team work. * Decision Making & Trade-offs: Make sound tech choices under constraints. * Technical Leadership: Influence teams, drive tech evolution and innovation.

II. Learning and Growth Path

Stage	Key Competencies	Recommended Practices
Junior Engineer	Component Principles, Coding Standards	Participate in projects, Read source code, Write unit tests
Mid-level Eng.	Module/Subsystem Design, Troubleshooting	Lead module dev, Code Review, Participate in OnCall
Senior Engineer	Complex System Design, Tech Selection	Lead project architecture, Solve perf. bottlenecks, Tech sharing
Architect	Tech Strategy, Cross-domain Coord.	Define arch. evolution roadmap, Review designs, Solve tough problems
Senior Architect	Business Architecture Design, Org. Enablement	Define tech standards, Drive platformization, Mentor talent

III. Architecture Decision Framework

Define Problem & Goals: What business problem to solve? Key quality attributes?
Identify Constraints: Time, cost, team skills, existing tech stack, compliance, etc.
Propose Candidate Solutions: Based on experience, research, industry practices.
Evaluate & Trade-off: Use methods like ATAM, scorecards; assess pros/cons across dimensions, identify trade-offs.
Record Decision (ADR): Clearly document final choice, rationale, potential impact.
Validate & Iterate: Verify decision effectiveness via POC, gray release; adjust based on feedback.