Skip to content

Case 4: Database

Course 4: Real-time Monitoring System Architecture Evolution Case Study

Goal: Modern distributed systems rely on robust monitoring systems for protection. This course uses the evolution of a typical monitoring system as an example to help you gradually master core capabilities like high-throughput log collection, real-time metric calculation, massive monitoring data storage, and intelligent alerting, understanding the key elements of building observability.


Phase 0: Prehistoric Era (Log Files + Shell Scripts)

System Description

  • The earliest monitoring might be this simple:
  • Web servers (like Nginx) diligently generate access logs (access.log).
  • Ops personnel write a Shell script (or Perl/Python), run it via Cron Job daily at midnight, using tools like AWK, grep to analyze logs and calculate key metrics like yesterday's PV/UV.
  • Analysis results? Send an email report to relevant staff.
  • Tech Stack: Quite "primitive":
  • Logs: Local server files
  • Analysis: Shell (AWK, grep, sort, uniq...) + Cron

Current Architecture Diagram

[Nginx Server] → [Local access.log File] → [Scheduled Shell Script Analysis] → [Send Email Report]
Pain Points at this moment: - Too Laggy: Yesterday's problems are known only today, too late! Cannot achieve real-time discovery and response. - Storage Pressure: Log files accumulate, quickly filling up server disks. - Poor Scalability: With many servers, manually managing and analyzing logs is a nightmare.


Phase 1: Goodbye T+1, Embrace Real-time Log Analysis → ELK Stack Appears

Challenge Emerges

  • Business demands real-time visibility into application errors, e.g., a sudden increase in HTTP 500 error codes needs immediate attention.
  • Increased number of servers requires a centralized platform to manage and query logs from all servers.
  • Shell scripts struggle with complex formats and large-scale logs.

❓ Architect's Thinking Moment: How to process logs in real-time and centrally?

(How to collect so many logs? Where to store them? How to query quickly? What are the popular industry solutions?)

✅ Evolution Direction: Introduce ELK for Centralized Log Management and Real-time Analysis

The ELK Stack (now often called Elastic Stack) is the classic combination for solving these problems:

  1. Log Collection by Filebeat:
    • Deploy lightweight Filebeat Agents on each server needing log collection.
    • Filebeat monitors specified log files (like access.log, error.log) in real-time and efficiently sends new log entries.
  2. Log Processing & Transformation by Logstash (or handled in Filebeat/Elasticsearch Ingest Node):
    • Logstash (or other processing pipelines) receives logs from Filebeat.
    • Responsible for parsing log formats (like Nginx logs, JSON logs), extracting key fields (status_code, request_time, user_agent), data cleaning, or enrichment.
  3. Storage & Search by Elasticsearch:
    • Processed structured log data is sent to an Elasticsearch cluster for storage.
    • Built on Lucene, Elasticsearch provides powerful full-text search and aggregation capabilities, enabling near real-time log querying.
  4. Visualization & Exploration by Kibana:
    • Users access Kibana's web interface to easily search logs, create visualization dashboards (like error rate trend charts, Top N access paths), and set up monitoring alerts.

Architecture Adjustment:

graph TD
    subgraph "Various Business Servers"
        Nginx --> A(access.log);
        App --> B(app.log);
        A & B --> C("Filebeat Agent");
    end
    subgraph "Log Processing Pipeline"
        C --> D("Logstash / Ingest Node");
    end
    subgraph "Storage & Visualization"
        D --> E("Elasticsearch Cluster");
        E --> F(Kibana);
        U["User/Ops"] --> F;
    end
(Note: In high-traffic scenarios, Kafka is often added as a buffer between Filebeat and Logstash)


Phase 2: Beyond Logs, Need Metrics → Prometheus + Grafana Ecosystem

New Requirement: System Metrics Monitoring

  • Logs help pinpoint issue details but don't fully reflect the system's real-time operational status. We need to monitor key metrics like server CPU, memory, disk I/O, network traffic, as well as JVM GC activity, thread counts, etc.
  • While the ELK Stack is powerful, it's primarily log-oriented (unstructured/semi-structured text). It's not optimal for storing and querying numerical time-series data (like CPU utilization over time) in terms of performance and storage efficiency.

❓ Architect's Thinking Moment: How to efficiently collect, store, and visualize time-series metrics?

(What tools to collect metrics? Where to store most efficiently? How to do alerting?)

✅ Evolution Direction: Introduce Prometheus Ecosystem for Metrics Monitoring

Prometheus has become the de facto standard for metrics monitoring in the cloud-native era:

  1. Prometheus Server Does the Core Work:
    • Prometheus Server uses a Pull model, actively scraping metrics data from configured Targets periodically.
    • Built-in TSDB (Time Series Database) for efficient metric data storage.
    • Provides a powerful query language PromQL for flexible querying and aggregation.
  2. Exporters Expose Metrics:
    • Corresponding Exporter programs need to be deployed on the monitored targets (servers, applications, databases, etc.).
    • For example, Node Exporter exposes host metrics (CPU, memory, disk, network); mysqld_exporter exposes MySQL metrics; applications can also embed Exporters via Client Libraries to expose business metrics directly.
  3. Grafana Provides Professional Visualization:
    • Although Prometheus has a basic UI, it's typically used with Grafana.
    • Grafana supports Prometheus as a data source and can create rich, beautiful, interactive monitoring dashboards.
  4. Alertmanager Handles Alerting:
    • Alerting rules (based on PromQL expressions) are defined in the Prometheus Server.
    • Triggered alerts are sent to the Alertmanager component.
    • Alertmanager handles deduplication, grouping, inhibition, silencing, and finally sends notifications via configured receivers (like email, Slack, PagerDuty).

Architecture Adjustment (Adding Metrics Monitoring Path):

graph TD
    subgraph "Monitored Targets (Servers/Apps)"
        A("Node Exporter") --> B("Expose Host Metrics");
        C("App Exporter/SDK") --> D("Expose App Metrics");
    end
    subgraph "Prometheus Ecosystem"
        E("Prometheus Server") -- Pull --> B;
        E -- Pull --> D;
        E --> F(Store TSDB);
        E -- Evaluate Rules --> G("Alertmanager");
        G --> H("Send Alert Notifications");
        F -- Query --> I("Grafana");
        U["User/Ops"] --> I;
    end

Phase 3: Cluster Size Grows, Can't Keep Up → Distributed Scaling of Monitoring System

New Challenges from Scale

  • As business grows, server count increases from tens to hundreds or thousands.
    • Prometheus Bottleneck: A single Prometheus Server reaches limits in scraping capacity, storage volume, and query performance.
    • Elasticsearch Bottleneck: Log volume surges, ES cluster faces high write pressure, potentially frequent Full GCs, and slower queries.

❓ Architect's Thinking Moment: How to horizontally scale the monitoring system itself?

(Single Prometheus/ES can't handle it, how to split? How to aggregate data? How to optimize the log collection pipeline?)

✅ Evolution Direction: Federation/Remote Storage + Log Buffering & Tiering

  1. Prometheus Scaling Solutions:
    • Federation: Suitable for hierarchical aggregation. Deploy a local Prometheus in each region/business line to scrape local metrics; deploy a global Prometheus to scrape aggregated key metrics (not raw data) from local instances.
    • Remote Read/Write: The more mainstream approach. Treat Prometheus's own storage as short-term. Use the remote_write interface to send all scraped metric data in real-time to horizontally scalable long-term storage backends supporting the Prometheus remote storage protocol, such as Thanos, Cortex, VictoriaMetrics, M3DB. Queries can then target these long-term backends directly via Grafana or through a global query view provided by components like Thanos Query / VictoriaMetrics.
  2. Log Pipeline Optimization & ES Scaling:
    • Introduce Kafka as a Buffer Layer: Add a Kafka cluster between Filebeat and Logstash/ES. Filebeat writes logs to Kafka, Logstash consumes from Kafka. Benefits: Peak shaving (handle sudden log bursts), Decoupling (collection and processing separated), Increased Reliability (Kafka persistence).
    • Elasticsearch Cluster Scaling: Add more ES nodes, plan shard and replica counts reasonably.
    • Hot/Warm/Cold Data Tiering: For log data, typically only recent data (days/weeks) needs frequent querying (hot data). Store hot data on high-performance nodes (e.g., SSDs), migrate older cold data to lower-cost nodes (e.g., HDDs) or archive to object storage (HDFS/S3), reducing ES cluster pressure and cost.

Architecture Adjustment (Remote Storage & Kafka Example):

graph TD
    subgraph "Log Pipeline"
        FB("Filebeat") --> Kafka("Kafka Cluster");
        Kafka --> LS("Logstash Cluster");
        LS --> ES("Elasticsearch Cluster - Hot/Warm/Cold");
        ES --> Kibana;
    end
    subgraph "Metrics Pipeline"
        Exp("Exporter") --> P("Prometheus Server");
        P -- remote_write --> TS("Thanos/VictoriaMetrics/M3DB - Long Term Storage");
        TS --> Grafana("Global Query View");
        Am("Alertmanager") --> Notify("Notifications");
        P -- Alerts --> Am;
    end

Phase 4: Too Many Noisy Alerts? → Towards Intelligent Alerting & AIOps

The Pain of Alert Storms

  • Alerts based on fixed thresholds (like CPU > 90%) are simple but often generate significant "noise" in practice.
    • Brief CPU spikes above 90% might be normal during business peaks.
    • A single underlying failure can trigger hundreds of related alerts, drowning out critical information.
  • Need smarter ways to detect real anomalies and reduce false positives.

❓ Architect's Thinking Moment: How to make alerting more precise and intelligent?

(Fixed thresholds are bad, dynamic thresholds? What can machine learning do? How to correlate alerts?)

✅ Evolution Direction: Dynamic Baselines + Anomaly Detection Algorithms + Alert Aggregation/Correlation

  1. Dynamic Baseline Alerting:
    • Instead of fixed thresholds, dynamically calculate the normal baseline range based on historical metric data (e.g., same time last week).
    • Trigger alerts only when metrics significantly deviate from this dynamic baseline. Prometheus functions like holt_winters offer basic capabilities; more complex implementations can query long-term storage and use scripts or dedicated platforms.
  2. Introduce AI Anomaly Detection:
    • Use machine learning algorithms (like Isolation Forest, Time Series Decomposition, LSTM) to automatically learn the normal patterns of metrics and detect anomalies that simple rules can't describe (e.g., sudden 50% drop in traffic, increased response time jitters).
    • Can integrate open-source libraries (like PyOD, Prophet) or use commercial AIOps platforms.
  3. Alert Aggregation & Correlation Analysis:
    • Utilize Alertmanager's Grouping feature to aggregate alerts of the same type from the same source.
    • Introduce more advanced event correlation platforms. Based on system topology, trace information, etc., correlate multiple alerts caused by an underlying failure, identify the root cause, and send only root cause alerts, significantly reducing alert volume.

Architecture Adjustment (Alert Processing Part):

graph TD
    P("Prometheus/Metric Source") --> AD{"Dynamic Baseline/AI Anomaly Detection"};
    AD -- Anomaly Determined --> AM("Alertmanager");
    AM --> Agg{"Alert Aggregation/Correlation Platform"};
    Agg --> Notify("Precise Alert Notification");

Phase 5: From Monitoring to Observability → Introducing Distributed Tracing

The Dilemma of Cross-Service Issues

  • In a microservices architecture, a single user request might span multiple services. When performance issues arise (like a particularly slow request), it's hard to quickly pinpoint which step, which service is the problem.
  • Logs and metrics only show the performance of individual services, failing to connect the entire request journey.

❓ Architect's Thinking Moment: How to trace a request's complete journey end-to-end?

(Which service is slow? Slow where? Database query? Network call?)

✅ Evolution Direction: Embrace OpenTelemetry, Build Distributed Tracing System

Distributed Tracing is the third pillar of observability:

  1. Instrumentation:
    • Need to integrate OpenTelemetry SDK (or other compatible tracing libraries like SkyWalking Agent) into application code.
    • The SDK automatically (or manually) generates and propagates Trace IDs and Span IDs for requests, recording the duration and metadata of key operations (like receiving requests, making RPC calls, DB queries) to form a call chain.
  2. Data Collection & Transmission:
    • The SDK sends generated Span data to an OpenTelemetry Collector (or directly to the tracing backend).
    • The Collector can process Span data (e.g., sampling, adding attributes) and export it to backend storage.
  3. Backend Storage & Visualization:
    • Use open-source tracing systems like Jaeger or Zipkin to store Span data and visualize the complete call chain graph, service dependencies, and pinpoint performance bottlenecks via their UI.
  4. Linking with Metrics/Logs is Key:
    • Best Practice: Ensure logs and metrics also include the Trace ID. This allows clicking on a slow request's Trace in Grafana or Kibana to jump directly to its related logs, or filtering traces related to an abnormal service metric.

Architecture Adjustment (Full Observability View):

graph TD
    subgraph "Application Layer (Instrumentation)"
        App1 -- Trace/Metrics/Logs --> OTelSDK1("OTel SDK");
        App2 -- Trace/Metrics/Logs --> OTelSDK2("OTel SDK");
    end
    subgraph "Data Collection Layer"
        OTelSDK1 & OTelSDK2 --> Collector("OpenTelemetry Collector");
    end
    subgraph "Backend Storage & Analysis"
        Collector -- Traces --> Jaeger("Jaeger/Zipkin");
        Collector -- Metrics --> Prom("Prometheus/VictoriaMetrics");
        Collector -- Logs --> Loki("Loki/Elasticsearch");
        Jaeger --> Grafana("Grafana");
        Prom --> Grafana;
        Loki --> Grafana;
        User["User"] --> Grafana;
    end
(This is an idealized integrated view; actual deployment might be more complex)


Summary: Monitoring System Evolution: From Seeing to Understanding

Phase Core Goal Key Solution Representative Tech/Pattern
0. Primitive T+1 Reporting Local Logs + Scheduled Script Cron, AWK, grep
1. Real-time Log Central Mgmt & Search Log Collection + Central Store/Search Filebeat, Logstash, Elasticsearch, Kibana (ELK)
2. Metrics Mon System/App Status Metric Collection + TSDB + Viz/Alert Exporter, Prometheus, Grafana, Alertmanager
3. Scale Out Horizontal Scale/HA Remote Storage/Federation + Kafka Buffer + Tiering Thanos/VictoriaMetrics, Kafka, ES Tiering
4. Smart Alert Reduce Noise/Precise Dynamic Baseline + Anomaly Detect + Correlation AI Anomaly Detection, Alertmanager Group/Inhibit
5. Observability End-to-End Perf Diag Distributed Tracing + 3 Pillars Link OpenTelemetry, Jaeger/Zipkin, Trace ID Linkage