Glossary

Glossary of Technical Terms

1. Distributed Systems Fundamentals

1.1 Consistency Models - Strong Consistency - Definition: Any read operation can retrieve the most recently written data; the system behaves like a single machine. - Scenarios: Banking transfers, inventory deduction, and other scenarios with extremely high consistency requirements. - Association: Usually requires sacrificing availability (CP system in the CAP theorem).

Eventual Consistency
Definition: Data replicas reach a consistent state after some time; stale data might be read during this period.
Scenarios: Social media like counts, comment displays, and other scenarios where brief inconsistencies are tolerable.
Association: Default model for DNS systems, AP-type databases (e.g., Cassandra).

1.2 Fault Tolerance Mechanisms - Byzantine Fault Tolerance (BFT) - Definition: Allows a subset of nodes to fail arbitrarily (including malicious behavior) while still reaching consensus. - Scenarios: Blockchains, military systems, and other scenarios requiring defense against malicious attacks. - Association: Directly related to the PBFT (Practical Byzantine Fault Tolerance) algorithm.

Split Brain
Definition: A cluster forms multiple sub-clusters due to network partitioning, each believing it is the master node.
Scenarios: Can occur in improperly configured ZooKeeper/Redis Sentinel clusters.
Association: Needs prevention through Quorum or Lease mechanisms.

1.3 Core Protocols/Algorithms

Quorum Mechanism
Definition: A strategy to ensure consistency in distributed systems by setting read (R) and write (W) replica counts such that R + W > N (N is the total number of replicas).
Scenarios: Consistency control in NoSQL databases like Cassandra, ensuring read operations see at least one latest written replica.
Association: CAP theorem, trade-off between read/write performance and consistency.
Lease
Definition: A mechanism in distributed protocols used to coordinate access to shared resources or confirm master node identity by issuing "leases" with expiration times.
Scenarios: Master election in GFS (Google File System), preventing split brain.
Association: Distributed locks, Master election.
Paxos Protocol
Definition: A message-passing based, highly fault-tolerant distributed consensus algorithm used to solve consensus problems in distributed systems.
Scenarios: Chubby (Google's distributed lock service), ensuring atomicity and ordering of operations in a distributed environment.
Association: Raft protocol (a more understandable variant), 2PC.
Raft Protocol
Definition: An easy-to-understand and implement distributed consensus algorithm ensuring consensus through Leader Election, Log Replication, and Safety.
Scenarios: Core of distributed systems like etcd, Consul, TiKV.
Association: Paxos protocol, an engineering simplification and improvement.
Consistent Hashing
Definition: A special hashing algorithm where only a small amount of data needs remapping when nodes are added or removed, unlike traditional hash modulo which causes large-scale data migration.
Scenarios: Node management in Redis Cluster, Memcached clusters, content distribution in CDN.
Association: Hash ring, Virtual Nodes.

2. Data Storage & Databases

2.1 Storage Engines - LSM Tree (Log-Structured Merge Tree) - Definition: A data structure achieving high write throughput through append-only writes and background merging (Compaction). - Scenarios: Write-intensive storage like LevelDB/RocksDB/HBase/Cassandra. - Association: Contrasted with random write optimization of B+ trees (e.g., MySQL InnoDB).

WAL (Write-Ahead Log)
Definition: Logging changes before applying them to data, used for crash recovery and ensuring durability.
Scenarios: Durability guarantee in databases (MySQL, PostgreSQL), message queues (Kafka).
Association: Works with Checkpoint mechanism, AOF (Append-Only File).
B+ Tree
Definition: A self-balancing tree data structure commonly used for indexing in databases and file systems. Characterized by storing all data in leaf nodes, non-leaf nodes storing only keys, and pointers linking leaf nodes.
Scenarios: Index implementation in relational databases like MySQL InnoDB, PostgreSQL.
Association: B Tree, optimized for range queries.

2.2 Transaction Isolation Levels - Repeatable Read - Definition: Multiple reads of the same data within a transaction yield consistent results, but Phantom Reads may occur. - Scenarios: Default level in MySQL InnoDB, suitable for most OLTP scenarios. - Association: Implemented via MVCC, prevents phantom reads with Gap Locks.

Snapshot Isolation
Definition: Transactions operate on a data snapshot from a specific time point, avoiding dirty reads, non-repeatable reads, and usually phantom reads, but Write Skew may occur.
Scenarios: Supported by Oracle, PostgreSQL, etc., suitable for long transactions.
Association: Strongly related to Multi-Version Concurrency Control (MVCC).

2.3 Concurrency Control

MVCC (Multi-Version Concurrency Control)
Definition: A technique enabling concurrent access by preserving snapshots of data at specific time points. Reads access snapshots, writes create new versions, achieving non-blocking reads.
Scenarios: Implementing non-blocking reads in databases like PostgreSQL, MySQL InnoDB, Oracle.
Association: Snapshot Isolation, Repeatable Read, Undo Log.
2PC (Two-Phase Commit)
Definition: A protocol ensuring atomicity for distributed transactions. Divided into a Prepare phase and a Commit phase.
Scenarios: Basis for the XA specification, used for distributed transactions across multiple databases or resources.
Association: Distributed transactions, performance bottleneck (synchronous blocking), comparison with Saga, TCC patterns.
Distributed Lock
Definition: A mechanism in a distributed environment to control mutually exclusive access to shared resources by multiple processes/threads.
Scenarios: Preventing overselling in flash sales (locking product ID), preventing duplicate execution of scheduled tasks by multiple instances.
Association: Implementations based on Redis (RedLock), ZooKeeper, etcd; potential performance bottlenecks and deadlock risks.

2.4 Database Architectures

Read-Write Splitting
Definition: A database architecture pattern routing write operations to a primary database (Master) and read operations to one or more secondary databases (Slave) to improve read performance and availability.
Scenarios: Business scenarios with high read-to-write ratios, such as news portals, blog platforms.
Association: Master-slave replication, data latency issues, read-write splitting middleware (ProxySQL, ShardingSphere).
Sharding (Database/Table Splitting)
Definition: Splitting data into different database instances (database sharding) or tables (table sharding) based on specific rules (e.g., hash, range) to address performance bottlenecks caused by excessively large single databases/tables.
Scenarios: User tables, order tables with tens or hundreds of millions of records.
Association: Horizontal sharding, vertical sharding, sharding middleware (MyCAT, ShardingSphere), cross-shard query issues, distributed ID generation.
HTAP (Hybrid Transactional/Analytical Processing)
Definition: An architecture aiming to support both high-concurrency online transaction processing (OLTP) and complex online analytical processing (OLAP) within the same database system.
Scenarios: Businesses requiring real-time analysis of transactional data, such as real-time risk control, real-time reporting.
Association: TiDB (TiKV + TiFlash), combination of row store and column store, resource isolation.
CDC (Change Data Capture)
Definition: A technique used to capture data changes (INSERT, UPDATE, DELETE) in a database.
Scenarios: Data synchronization (e.g., MySQL to Elasticsearch), building real-time data warehouses, event-driven architectures.
Association: Debezium, Maxwell, Canal, based on database Binlog.

2.5 NoSQL & Caching

Redis Sorted Set
Definition: A Redis data structure similar to a combination of Java's SortedSet and HashMap. Each element is associated with a score, Redis sorts elements by score, and allows efficient score retrieval by element name.
Scenarios: Leaderboards (sorted by score), social network Feed streams (sorted by timestamp).
Association: O(logN) time complexity for insertion and lookup.
Cache Pattern
Cache-Aside: Most common pattern. Read: Check cache first, if miss, read DB, then write back to cache. Write: Update DB first, then delete cache.
Read-Through: Application reads only from cache; cache is responsible for reading from DB.
Write-Through: Application writes to cache; cache is responsible for writing to DB (synchronously).
Write-Back (Write-Behind): Application writes to cache; cache writes to DB asynchronously.
Scenarios: Caching data in web applications to improve read performance.
Association: Redis, Memcached, cache coherency issues.
Cache Issues
Cache Breakdown: A hot key expires, causing numerous requests to hit the DB simultaneously. Solutions: Mutex lock, logical expiration.
Cache Penetration: Querying non-existent data bypasses the cache and hits the DB directly. Solutions: Bloom filter, caching null values.
Cache Avalanche: A large number of cache keys expire at the same time, or the cache service goes down, leading to immense pressure on the DB. Solutions: Add random jitter to expiration times, multi-level caching, rate limiting, circuit breaking.
Bloom Filter
Definition: A space-efficient probabilistic data structure used to test whether an element is possibly a member of a set. Has a certain false positive rate (may mistakenly identify an element not in the set as being in the set) but never false negatives (an element in the set will always be identified correctly).
Scenarios: Preventing cache penetration, web crawler URL deduplication, blacklist filtering.
Association: False Positive Rate, hash functions.

3. Computation & Messaging Systems

3.1 Stream Processing Concepts - Event Time vs Processing Time - Definition: Event time is when the data actually occurred; processing time is when the system receives the data. - Scenarios: Processing out-of-order events (e.g., delayed IoT device reports) requires event time. - Association: Flink's Watermark mechanism is based on event time.

State Management
Definition: Intermediate data retained across events in stream computing (e.g., window aggregation results).
Scenarios: Real-time statistics, session window analysis.
Association: Requires coordination with Checkpoint for fault recovery (e.g., Flink's RocksDB state backend).
Watermark
Definition: A mechanism in stream processing indicating the progress of event time, suggesting that the system believes data before a certain point has arrived, allowing event-time-based window computations to be triggered.
Scenarios: Flink processing out-of-order events, ensuring the completeness of window calculations.
Association: Event Time, Window (Tumbling/Sliding/Session), Allowed Lateness.
Window
Definition: A mechanism in stream processing that divides an unbounded data stream into bounded chunks for processing.
Types: Tumbling Window, Sliding Window, Session Window.
Scenarios: Real-time statistics (e.g., PV per minute), aggregate analysis (e.g., user session ends after 30 minutes of inactivity).
Association: Watermark, Event Time, Processing Time.
Backpressure
Definition: In a data processing pipeline, when a downstream operator's processing speed cannot keep up with an upstream operator's sending speed, the downstream provides feedback pressure to the upstream, causing it to reduce its sending rate.
Scenarios: Preventing memory overflow or task failure due to insufficient processing capacity in stream processing frameworks like Spark Streaming, Flink.
Association: TCP sliding window, flow control.
CEP (Complex Event Processing)
Definition: A technology based on event streams that infers, analyzes, and makes decisions in real-time by identifying patterns, relationships, and abstractions within the streams.
Scenarios: Financial fraud detection (e.g., "multiple small transactions followed by a large withdrawal within a short period"), real-time risk control, IoT device anomaly detection.
Association: Flink CEP library, Drools Fusion, rule engines.

3.2 Messaging Patterns

Publish/Subscribe (Pub-Sub)
Definition: Message producers (Publisher) send messages to a topic (Topic), and multiple consumers (Subscriber) consume them independently.
Scenarios: Kafka's Topic mode, Redis Pub/Sub, log broadcasting, event notification.
Association: Contrasted with the Point-to-Point (Queue) model.
Dead Letter Queue (DLQ)
Definition: Messages that cannot be consumed normally (e.g., reached max retry count, incorrect message format) are routed to a special queue for later analysis or processing.
Scenarios: Handling messages rejected due to format errors or business rules, avoiding blocking the normal queue.
Association: Both RabbitMQ/Kafka support DLQ configuration.
Push-Pull Hybrid
Definition: A strategy combining push (write diffusion) and pull (read diffusion) models, often used in scenarios like Feed streams.
Scenarios: Social network Timeline. Use push model for users with few followers (push posts to followers' cache upon posting), use pull model for popular users (fetch timeline on read), balancing read and write performance.
Association: Feed stream design, write diffusion, read diffusion.

3.3 Computation Frameworks

MapReduce
Definition: A programming model and software framework proposed by Google for processing large data sets. The core idea is to decompose computation tasks into Map and Reduce phases.
Scenarios: Offline batch processing, such as log analysis, inverted index construction, PageRank calculation.
Association: Hadoop, Spark (as its performance improvement), divide and conquer strategy.
DAG (Directed Acyclic Graph)
Definition: A graph structure where nodes represent operations or tasks, directed edges represent dependencies, and there are no cycles.
Scenarios: Spark's task scheduling engine, building a DAG from a series of RDD transformations to optimize the execution plan and reduce intermediate data disk writes.
Association: Spark RDD, wide/narrow dependencies.
RDD (Resilient Distributed Dataset)
Definition: Spark's core abstraction, an immutable, partitioned collection of elements that can be operated on in parallel, with fault tolerance (recovery via lineage).
Scenarios: The foundation for in-memory computation in Spark.
Association: Spark, memory caching (persist), DAG, lineage.

4. Cloud Native & Operations

4.1 Service Governance - Circuit Breaking - Definition: Temporarily prevents further calls to a dependent service when its failure rate or response time exceeds a threshold (like an electrical circuit breaker), attempting recovery after a period. - Scenarios: Dependency calls between microservices (e.g., Hystrix/Sentinel/Istio), preventing cascading failures (snowball effect). - Association: Forms the fault tolerance trio with Fallback and Rate Limiting.

Canary Release
Definition: Introduces a new version of a service into the production environment, directing only a small portion of user traffic (e.g., 1%) to it initially to observe its behavior. Gradually increases the traffic proportion if verified okay, eventually completing the full rollout.
Scenarios: A/B testing, smooth deployment of high-risk changes, new feature validation.
Association: Requires traffic shifting mechanisms (e.g., Istio VirtualService), Blue-Green Deployment.
Service Mesh
Definition: An infrastructure layer for handling inter-service communication. It reliably delivers requests, typically implemented by deploying a lightweight network proxy (Sidecar) alongside each service, forming the mesh.
Scenarios: Traffic management, service discovery, load balancing, circuit breaking, telemetry, secure communication in microservice architectures.
Association: Istio, Linkerd, Envoy (as Sidecar proxy).
API Gateway
Definition: Acts as a single entry point for all client requests in a microservice architecture. Handles common functions like request routing, protocol translation, authentication/authorization, rate limiting/circuit breaking, logging/monitoring.
Scenarios: Frontend access layer in microservice architectures.
Association: Kong, Spring Cloud Gateway, Zuul.
RPC (Remote Procedure Call)
Definition: Allows a program on one computer to call a subroutine on another computer without the programmer explicitly coding for this interaction.
Scenarios: Synchronous communication between microservices.
Association: gRPC, Dubbo, Thrift, REST API (for comparison).

4.2 Observability

Three Pillars of Observability
Definition: Refers to the three key data sources for building observable systems: Logs, Metrics, and Traces.
Logs: Record discrete events, like error messages, application startup logs.
Metrics: Aggregatable numerical data reflecting system state, like CPU usage, QPS.
Traces: Record the complete path and latency of a single request across multiple services.
Scenarios: Understanding distributed system behavior, quickly locating and diagnosing problems. Core goal of monitoring system evolution (Course 4).
Association: ELK (Logs), Prometheus (Metrics), Jaeger/Zipkin (Traces), OpenTelemetry.
OpenTelemetry
Definition: An open-source project hosted by CNCF aiming to provide a unified set of APIs, libraries, agents, and collector services for generating, collecting, and exporting telemetry data (metrics, logs, traces) for effective observability.
Scenarios: Standardizing monitoring data collection for distributed systems.
Association: Replaces older OpenTracing and OpenCensus, integrates with backends like Jaeger, Prometheus.
RED Metrics
Definition: A metric model for monitoring microservice health, focusing on three key indicators: Rate (requests per second), Errors (proportion of failed requests), Duration (distribution of request processing times, e.g., P99 latency).
Scenarios: Quickly assessing service operational status and performance.
Association: Similar to Google SRE's Four Golden Signals (Latency, Traffic, Errors, Saturation). USE metrics (Utilization/Saturation/Errors).
Alertmanager
Definition: An alert handling component in the Prometheus ecosystem. Receives alert notifications from Prometheus Server (or other clients), performs deduplication, grouping, silencing, inhibition, and routes alerts via configured receivers (e.g., Email, Slack, Webhook).
Scenarios: Centralized management and dispatch of alerts in monitoring systems (Course 4).
Association: Prometheus, Alerting Rules.

4.3 Containers & Orchestration

Containerization
Definition: A lightweight operating-system-level virtualization technique allowing applications and all their dependencies (libraries, config files, etc.) to be packaged into a standardized unit (container) for deployment.
Scenarios: Application deployment, ensuring environment consistency, packaging microservices.
Association: Docker, Kubernetes, OCI (Open Container Initiative).
Kubernetes (K8s)
Definition: An open-source container orchestration system for automating the deployment, scaling, and management of containerized applications.
Scenarios: Large-scale container cluster management, microservice deployment, CI/CD.
Association: Docker, Pod, Service, Deployment, CNCF.
Service Discovery
Definition: The process of dynamically finding available service instances (IP address and port) in a distributed system.
Scenarios: Service consumers needing to find the network location of service providers in a microservice architecture.
Association: Kubernetes Service, Consul, Nacos, Eureka.
Sidecar Pattern
Definition: A deployment pattern splitting application functionality into separate processes (sidecars). The Sidecar container is deployed alongside the main application container in the same Pod, sharing network and storage volumes, used to enhance the main application's functionality (e.g., log collection, monitoring agent, service mesh proxy).
Scenarios: Istio's Envoy proxy, log collection agents (e.g., Filebeat).
Association: Kubernetes Pod, Service Mesh.

4.4 Serverless

Serverless Architecture
Definition: A cloud computing execution model where the cloud provider dynamically manages the allocation and provisioning of server resources. Applications are deployed as functions, billed based on actual usage, and developers don't manage underlying servers.
Scenarios: Event-driven applications (e.g., image processing, API backends), scheduled tasks, scenarios requiring extreme elasticity.
Association: FaaS (Function as a Service), BaaS (Backend as a Service), AWS Lambda, Google Cloud Functions, Azure Functions, Alibaba Cloud Function Compute (FC).

5. AI & Big Data

5.1 Machine Learning - Embedding - Definition: Maps discrete objects (like words, items, users) into low-dimensional dense continuous vectors. These vectors capture semantic or relational features of the objects. - Scenarios: Similarity calculation in recommendation systems (e.g., finding similar items/users), word vector representation in NLP (Word2Vec, GloVe, BERT), image recognition. - Association: Used with vector databases like Faiss/Pinecone/Milvus for efficient similarity search.

Federated Learning
Definition: A distributed machine learning technique allowing multiple parties to train models locally using their own data, then sharing only model updates (like gradients or weights) instead of raw data, enabling joint modeling while protecting data privacy.
Scenarios: Joint modeling in privacy-sensitive domains like healthcare and finance, cross-institutional data collaboration.
Association: Relies on Secure Aggregation protocols, Differential Privacy. FATE framework.
Feature Engineering
Definition: The process of using domain knowledge and data analysis techniques to extract, construct, or select features (variables) most useful for machine learning model predictive performance from raw data.
Scenarios: Almost all machine learning tasks, a critical step determining the upper limit of model effectiveness. User, item, and context feature construction in recommendation systems (Course 7) and risk control systems (Course 5).
Association: Feature extraction, feature selection, feature transformation, feature stores (e.g., Feast).
Model Serving
Definition: The process of deploying a trained machine learning model as a service accessible via API (e.g., REST or gRPC), enabling it to receive input data and return predictions in a production environment.
Scenarios: Integrating AI models into real applications, such as online recommendations (Course 7), image recognition APIs, fraud detection services (Course 5). A core component of AI platforms (Course 12).
Association: TensorFlow Serving, Triton Inference Server, KFServing/KServe, Seldon Core, model compression & optimization (ONNX, TensorRT).
MLOps
Definition: A set of practices aimed at reliably and efficiently building, deploying, and operating machine learning systems. It combines Machine Learning (ML), Data Engineering, and DevOps, aiming to streamline the entire lifecycle from data preparation to model monitoring.
Scenarios: Achieving end-to-end automation, standardization, and reproducibility for machine learning projects from experiment to production. Core philosophy of AI platforms (Course 12).
Association: CI/CD for ML, Kubeflow, MLflow, Metaflow, experiment tracking, model monitoring, version control, feature stores.
CTR (Click-Through Rate)
Definition: The ratio of clicks to impressions for an online advertisement or recommended content.
Scenarios: A core optimization metric in recommendation systems (Course 7) and online advertising, used to evaluate content attractiveness.
Association: CVR (Conversion Rate), Rank Model.
CVR (Conversion Rate)
Definition: The ratio of users completing a desired action (e.g., purchase, registration) to the number of impressions or clicks.
Scenarios: Key business metric for e-commerce recommendations (Course 7) and ad effectiveness evaluation.
Association: CTR, multi-objective optimization.
Recall
Definition: In recommendation systems or information retrieval, the process of initially filtering a candidate subset of potentially relevant items from a massive item pool.
Scenarios: The first stage of recommendation systems (Course 7), aiming to include as many relevant items as possible, prioritizing coverage over precision. Common strategies include collaborative filtering, content-based recall, popular item recall.
Association: Ranking, Multi-channel Recall, coverage.
Ranking
Definition: In recommendation systems, the process of finely sorting the candidate item set generated during the recall phase. The goal is to predict the user's preference degree (e.g., click-through rate) for each item and rank the most likely interested items higher.
Scenarios: The core stage of recommendation systems (Course 7), directly impacting user experience and business metrics.
Association: Fine-Ranking, Coarse-Ranking, CTR estimation, machine learning ranking models (LR, GBDT, DeepFM).
Multi-Objective Optimization
Definition: In recommendation or decision systems, simultaneously optimizing multiple conflicting or related objectives (e.g., CTR, CVR, user dwell time, content diversity) rather than just a single objective.
Scenarios: Later stages of recommendation systems (Course 7), balancing short-term user interest with long-term experience and platform business goals.
Association: Ranking model design, A/B testing.
EE (Explore & Exploit)
Definition: A strategic balance in recommendation or decision systems. Exploit refers to recommending items most likely to succeed based on known user preferences; Explore refers to trying new items or areas the user might be interested in to discover new preferences and avoid filter bubbles.
Scenarios: Enhancing diversity and novelty in recommendation systems (Course 7), addressing the cold start problem.
Association: Bandit algorithms (e.g., Thompson Sampling, LinUCB).

5.2 Big Data Architecture

Lambda Architecture
Definition: A data processing architecture combining batch processing (Batch Layer) and stream processing (Speed Layer). The batch layer processes all historical data for accurate results, the speed layer processes real-time data for low-latency approximate results, and the serving layer merges results from both.
Scenarios: Systems needing both real-time and offline analysis, but complex and costly to maintain.
Association: Gradually being replaced or simplified by the Kappa architecture (pure stream processing).
Kappa Architecture
Definition: A simplified data processing model derived from Lambda, assuming all data processing can be done via stream processing. It has only one processing layer (stream layer); historical data can be reprocessed by re-consuming messages from the message queue.
Scenarios: Scenarios prioritizing architectural simplicity and high real-time requirements.
Association: Stream processing technologies like Flink, Kafka; evolution of the Lambda architecture.
Data Lake
Definition: A centralized repository allowing storage of all structured and unstructured data at any scale. Data can be stored as-is without predefined structure, processed later when needed for analysis.
Scenarios: Enterprise-level data analytics foundation, storing raw data for use by multiple analytics engines.
Association: Contrasted with Data Warehouse's Schema-on-Write model; Data Lakes use Schema-on-Read. Delta Lake, Iceberg, Hudi are common data lake storage formats.
Data Warehouse
Definition: A subject-oriented, integrated, time-variant, non-volatile collection of data in support of management's decision-making process. Typically stores structured data that has been cleaned, transformed, and integrated.
Scenarios: Business Intelligence (BI) reporting, decision support systems.
Association: ETL (Extract, Transform, Load), OLAP, Data Mart, Snowflake, Redshift, BigQuery.

5.3 Distributed Training

Data Parallelism
Definition: A distributed training strategy where training data is split into multiple minibatches. Each compute device (e.g., GPU) loads a full model replica, processes one minibatch, computes gradients, and then synchronizes or averages gradients (e.g., via AllReduce) to update model parameters.
Scenarios: Accelerating training when a single GPU's memory is sufficient for the model. Common acceleration technique in AI platforms (Course 12).
Association: Horovod, PyTorch DDP, NCCL/Gloo (communication libraries), comparison with Parameter Server architecture.
Model Parallelism
Definition: A distributed training strategy used when a model is too large to fit into a single compute device. Different parts (layers) of the model are distributed across different devices for computation.
Types: Pipeline Parallelism, Tensor Parallelism.
Scenarios: Training ultra-large models (e.g., GPT-3 scale models with hundreds of billions of parameters). Key technology for large model training in AI platforms (Course 12).
Association: DeepSpeed, Megatron-LM, ZeRO.
ZeRO (Zero Redundancy Optimizer)
Definition: A technology proposed by Microsoft DeepSpeed to optimize memory usage for large-scale model training. It significantly reduces memory redundancy per GPU by partitioning optimizer states, gradients, and model parameters across data-parallel processes.
Scenarios: Training extremely large models with high memory demands on AI platforms (Course 12).
Association: DeepSpeed, Data Parallelism, Model Parallelism.
Parameter Server (PS)
Definition: A distributed machine learning architecture. Compute nodes are divided into parameter servers (storing and updating model parameters) and worker nodes (computing gradients). Workers pull parameters from PS, compute gradients, and push them back to PS for updates.
Scenarios: Scenarios with extremely high feature dimensions (sparse models), such as recommendation systems, ad click-through rate prediction. An alternative distributed training approach in AI platforms (Course 12).
Association: Widely used in early TensorFlow versions; asynchronous updates can lead to stale gradients. Comparison with Data Parallelism (AllReduce).
ONNX (Open Neural Network Exchange)
Definition: An open format for representing deep learning models. Aims to enable model interoperability between different frameworks (e.g., TensorFlow, PyTorch, Caffe2) and facilitate deployment across various hardware platforms.
Scenarios: Cross-framework model training, optimization, and deployment in AI platforms (Course 12).
Association: Model Serving, Inference Optimization (TensorRT, ONNX Runtime).
Inference Optimization
Definition: Various optimizations applied to trained machine learning models to improve their prediction (inference) speed, reduce latency, and minimize resource consumption in production environments.
Methods: Quantization, Pruning, Distillation, Operator Fusion, using optimized inference engines.
Scenarios: Key part of model serving in AI platforms (Course 12), especially for large models and edge computing.
Association: TensorRT, OpenVINO, ONNX Runtime, vLLM, DeepSpeed Inference.
Triton Inference Server
Definition: A high-performance inference server developed by NVIDIA, supporting multiple frameworks (TensorFlow, PyTorch, ONNX, TensorRT, etc.), providing HTTP/gRPC interfaces, and supporting features like dynamic batching, model management, multi-GPU inference.
Scenarios: Deploying and managing multiple machine learning models for online inference in AI platforms (Course 12) and recommendation systems (Course 7).
Association: Model Serving, KFServing/KServe, TensorFlow Serving, TorchServe.
KFServing / KServe
Definition: An open-source platform built on Kubernetes for deploying and serving machine learning models (now renamed KServe). Provides advanced features like Serverless inference, autoscaling, traffic management (Canary/Shadow), model explainability.
Scenarios: Standardized model deployment and management in Kubernetes environments for AI platforms (Course 12).
Association: Kubernetes, Serverless, Istio/Knative, Model Serving, Triton Inference Server.

6. Other Important Concepts

CDN (Content Delivery Network)
Definition: A network of servers distributed across different geographical locations used to deliver static content (like images, videos, CSS, JS files) faster and more reliably to users. Achieves acceleration by caching content on edge nodes closer to users.
Scenarios: Website acceleration, video on demand/live streaming, large file downloads.
Association: Edge computing, caching strategies (TTL).
DNS (Domain Name System)
Definition: A core internet service that resolves human-readable domain names (e.g., www.google.com) into machine-readable IP addresses (e.g., 172.217.160.142).
Scenarios: Accessing websites, email routing, and all internet services relying on domain names.
Association: Eventual consistency, DNS caching, intelligent DNS (returning different IPs based on user location).
Nginx
Definition: A high-performance HTTP and reverse proxy server, also an IMAP/POP3/SMTP proxy server. Known for its high performance, stability, rich feature set, simple configuration, and low resource consumption.
Scenarios: Web server, reverse proxy, load balancing, static/dynamic content separation, API gateway.
Association: Apache (for comparison), OpenResty (Nginx-based extension).
Load Balancing
Definition: The process of distributing network or application traffic across multiple servers. Aims to optimize resource use, maximize throughput, minimize response time, and ensure high availability.
Types: L4 (Transport Layer, e.g., TCP/UDP), L7 (Application Layer, e.g., HTTP).
Algorithms: Round Robin, Least Connections, IP Hash, URL Hash.
Scenarios: Improving scalability and reliability of services like web applications, databases.
Association: Nginx, HAProxy, F5 BIG-IP, Cloud provider load balancers (ELB, ALB).
OAuth 2.0
Definition: An open authorization standard allowing users to grant third-party applications access to their information stored with another service provider without giving the third-party application their username and password.
Scenarios: Third-party login (e.g., "Login with Google/WeChat account"), API authorization.
Association: OpenID Connect (an identity layer built on OAuth 2.0).
JWT (JSON Web Token)
Definition: An open standard (RFC 7519) defining a compact and self-contained way for securely transmitting information (claims) between parties. This information can be verified and trusted because it is digitally signed.
Scenarios: API authentication and authorization, Single Sign-On (SSO).
Association: Session authentication (for comparison), typically consists of Header, Payload, Signature.
SLA (Service Level Agreement)
Definition: A defined commitment between a service provider and a client regarding service performance (e.g., availability, response time).
Scenarios: Cloud service contracts, metrics for system reliability.
Association: SLI (Service Level Indicator), SLO (Service Level Objective). E.g., 99.99% availability ("four nines").
Chaos Engineering
Definition: The discipline of experimenting on a distributed system in order to build confidence in the system's capability to withstand turbulent conditions in production. Involves proactively injecting failures (like network latency, node crashes) to test system resilience and recovery capabilities.
Scenarios: Verifying system fault tolerance, discovering potential weaknesses, improving system resilience.
Association: Netflix Chaos Monkey, Chaos Mesh, Gremlin, Litmus.

7. Additional Architecture Patterns

Monolithic Architecture
Definition: A traditional software architecture style where all functional modules of an application (e.g., UI, business logic, data access) are packaged into a single deployment unit.
Scenarios: Simple applications, rapid prototyping in early project stages, small team development.
Association: Microservice architecture (as the starting point and contrast), layered architecture.
Event-Driven Architecture (EDA)
Definition: A software architecture pattern emphasizing asynchronous communication and decoupling between components through the production, detection, consumption, and reaction to events.
Scenarios: Microservice decoupling, real-time response systems (e.g., risk control), complex business process orchestration.
Association: Message queues (Kafka/RabbitMQ), publish/subscribe pattern, Serverless.
Unified Batch and Stream Processing
Definition: A data processing paradigm aiming to use the same codebase or API to handle both bounded (batch) and unbounded (stream) data.
Scenarios: Situations requiring both real-time and offline analysis where simplified development and maintenance are desired.
Association: Flink, Spark Structured Streaming, Kappa architecture.
Cold/Hot Data Separation
Definition: A strategy of storing data on different storage media with varying costs and performance based on access frequency and timeliness. Hot data is stored on high-speed storage (e.g., SSD, memory), cold data on low-cost storage (e.g., HDD, object storage).
Scenarios: Log systems, monitoring systems, historical order archiving where data needs long-term storage but access frequency decreases over time.
Association: Data lifecycle management, storage cost optimization, time-series databases.
CQRS (Command Query Responsibility Segregation)
Definition: A pattern separating the model for data update operations (Commands) from the model for data read operations (Queries).
Scenarios: Systems with vastly different read and write loads, scenarios requiring separate optimization for reads and writes, complex query scenarios.
Association: Event Sourcing, read-write splitting, microservice architecture.
Compensation Mechanism / Saga Pattern
Definition: In distributed transactions, when a sub-transaction fails, predefined compensation operations are executed to undo the effects of successfully executed sub-transactions, achieving eventual consistency. Saga is a common pattern for implementing compensation.
Scenarios: Ensuring eventual consistency across multiple service operations where 2PC is infeasible or unsuitable.
Association: Distributed transactions, eventual consistency, Seata Saga mode, TCC.
A/B Testing
Definition: An online experimentation method where users are randomly divided into two groups (A and B), shown different product versions (e.g., UI, algorithm), and user behavior metrics (e.g., CTR, conversion rate) are compared to determine which version performs better.
Scenarios: Product feature optimization, UI/UX improvements, evaluating recommendation algorithm/ad strategy effectiveness.
Association: Canary release, Feature Flag, statistical significance.

8. Additional Data Storage & Processing

Database Index
Definition: A database object used to speed up data retrieval operations on a table. It works by creating pointers to the physical location of data, usually based on the values of one or more columns.
Scenarios: Accelerating WHERE clauses, ORDER BY clauses, and JOIN operations in SELECT queries.
Association: B+ Tree, LSM Tree (as index structure), query optimizer, index covering.
Master-Slave Replication
Definition: A database high availability and read scaling technique where data changes from one database instance (Master) are copied to one or more other instances (Slave).
Scenarios: Database read-write splitting, data backup, failover.
Association: Read-write splitting, data lag, asynchronous/semi-synchronous/synchronous replication, MGR.
Time Series Database (TSDB)
Definition: A database optimized specifically for storing, retrieving, and analyzing time-stamped time-series data.
Scenarios: Monitoring metrics, IoT sensor data, financial transaction data, real-time application performance data.
Association: InfluxDB, TDengine, Prometheus TSDB, high cardinality issues, data compression.
Vector Database
Definition: A database specifically designed for storing, managing, and efficiently retrieving high-dimensional vector data, typically used for similarity searches.
Scenarios: Embedding-based recommendation systems, image/video retrieval, semantic search in natural language processing.
Association: Embedding, Approximate Nearest Neighbor (ANN) search algorithms (e.g., HNSW, IVF_FLAT), Faiss, Milvus, Pinecone.
Graph Database
Definition: A database that uses graph structures (nodes, edges, properties) to store and query data, particularly adept at handling complex relationship networks.
Scenarios: Social network analysis, fraud detection, knowledge graphs, recommendation systems.
Association: Neo4j, JanusGraph, Cypher, Gremlin, relationship graphs.

9. Additional AI & MLOps

Feature Store
Definition: A centralized system for managing, storing, discovering, and serving machine learning features, ensuring feature consistency between training and inference.
Scenarios: Feature engineering management in MLOps workflows, solving Training-Serving Skew, promoting feature reuse.
Association: Feast, Tecton, MLOps, Feature Engineering, real-time/offline features.
Online Learning
Definition: A machine learning paradigm where the model continuously updates and learns from real-time incoming data, rather than relying on periodic offline batch training.
Scenarios: Situations requiring models to quickly adapt to user behavior or environmental changes, such as real-time recommendations, online advertising, fraud detection.
Association: FTRL (Follow The Regularized Leader), incremental learning, real-time features.
Cold Start
Definition: In recommendation systems or machine learning applications, the problem of difficulty in making effective recommendations or predictions for new users, new items, or new models due to lack of historical data.
Scenarios: New user registration, new item listing, initial model deployment.
Association: Recommendation systems, content features, user profiles, Explore & Exploit (EE).

Appendix Usage Guide

Search Function: Quickly locate terms by technical area (e.g., "Distributed", "Database") or specific noun (e.g., "Paxos").
Associated Learning: Jump to related terms via the "Association" field to build a knowledge network.
Scenario Mapping: Understand the practical basis for technology selection using the "Scenarios" field.