Case 5: Security System

Course 5: Security and Risk Control System Architecture Evolution Case Study

Goal: Offense and defense in the online world are ever-present. This course simulates the gradual construction and upgrade process of a typical online business risk control system, giving you an in-depth understanding of core security architecture capabilities such as real-time rule engines, user behavior analysis, application of machine learning models in risk control, and how to dynamically combat evolving fraudulent activities.

Phase 0: Stone Age (Simple Rule Lists)

System Description

The earliest risk control might be simple "blacklists/whitelists":
When a user logs in or registers, check if the IP address is on the "blacklist".
When a user places an order and pays, verify if the phone number is a known "abnormal number".
Tech Stack: Very basic:
Database: A few tables in MySQL, e.g., blacklist_ips, abnormal_phones.
Backend Logic: Hardcode judgment logic in business code, e.g., if ip in blacklist_ips: reject().

Current Architecture Diagram

[User Request (Login/Payment)] → [Application Server (Hardcoded Rule Check)] → [Query MySQL Blacklist Table] → [Allow/Deny]

Pain Points at this moment: - Slow Response: Blacklist updates require code changes and redeployment, making the response time extremely poor for combating threats. - Fixed Rules: Can only handle very simple, known risk points, unable to identify more complex, pattern-based attack behaviors (like numerous password attempts from the same IP in a short time - credential stuffing). - Maintenance Difficulty: Rule logic scattered across various business code, hard to manage and iterate.

Phase 1: Goodbye Hardcoding → Introduce Dynamic Rule Engine

Challenge Emerges

Business operations find that fraudulent IP lists are updated massively daily; manually changing code and deploying is too slow.
Need more flexibility to define and manage risk control rules, e.g., "Identify attempts where the same IP address fails login more than 20 times within 5 minutes" (typical brute-force characteristic).
Hope business personnel can also participate in rule configuration.

❓ Architect's Thinking Moment: How to make rules effective dynamically and handle more complex logic?

(Hardcoding is out. Any mature rule engine solutions in the industry? How to handle time-window-based event patterns?)

✅ Evolution Direction: Introduce Rule Engine + Complex Event Processing (CEP)

Dynamic Rule Management & Loading:
- Separate risk control rules from code, store them in a dedicated place (like a database, config center, or cache like Redis for better read performance).
- The application or a dedicated rule engine service periodically (e.g., every 10 seconds) pulls the latest rule definitions.
Introduce Rule Engine Execution Logic:
- Use mature rule engines (like Drools, AviatorScript) to execute dynamically loaded rules, replacing hardcoded if-else.
Handle Complex Event Patterns (CEP):
- For rules requiring judgment based on time windows and event sequences (like "20 login failures in 5 minutes"), introduce Complex Event Processing (CEP) capabilities.
- Use libraries like Flink CEP or dedicated stream processing platforms (like Apache Siddhi). Receive user behavior event streams in real-time, match predefined patterns, and trigger corresponding risk control actions.
Risk Decision as a Service:
- Encapsulate the risk control judgment logic into an independent Risk Engine service. Business systems call this service via RPC or HTTP API to get risk decisions (e.g., "Pass," "Reject," "Needs secondary verification").

Architecture Adjustment:

graph TD
    subgraph "Business Application"
        App -- Risk Request --> RiskEngine;
    end
    subgraph "Risk Service (Risk Engine)"
        RiskEngine -- Rule Execution --> RE(Drools/Aviator);
        RiskEngine -- Complex Pattern Detection --> CEP("Flink CEP / Siddhi");
        RE -- Load Rules --> Cache("Redis Rule Store");
        CEP -- Real-time Event Stream --> Kafka("User Behavior Events");
    end
    subgraph "Rule Management"
        AdminUI --> DB(Rule Database) --> Sync --> Cache;
    end
    RiskEngine -- Risk Decision --> App;

Phase 2: Rules Aren't Enough → Introduce Behavior Analysis & User Profiling

New Challenge: Fraudsters Bypass Simple Rules

Fraudsters start using proxy IP pools, device farms, etc., bypassing simple rules based on single dimensions (like IP, phone number).
Need deeper analysis of user behavior patterns to identify anomalies. For example:
- A newly registered account with no browsing activity attempts a large order directly.
- Multiple seemingly unrelated accounts log in using the same device fingerprint information.

❓ Architect's Thinking Moment: How to discover risks from user behavior sequences and correlations?

(Looking at single points isn't enough. How to build user profiles? How to mine potential connections between devices, IPs, and accounts?)

✅ Evolution Direction: Build Real-time User Profiles + Association Graph Analysis

Collect User Behavior Data in Real-time:
- Use tracking points (instrumentation) to send various user behaviors (login, browse, add to cart, order, payment, device ID used, login IP, etc.) to a Kafka message queue in real-time.
Build Real-time/Near Real-time User Profiles:
- Use stream processing engines (like Spark Streaming or Flink) to consume behavior data from Kafka.
- Calculate users' short-term behavioral features, forming user profile tags, stored in high-speed KV storage (like Redis or HBase). Examples: "Recently used login cities," "Browsed product category distribution in the last hour," "Is this the first time using this device?" etc.
Utilize Graph Database for Association Analysis:
- Store entities like user accounts, device info, IP addresses, shipping addresses, and their relationships in a Graph Database (like Neo4j or JanusGraph).
- Graph queries can efficiently discover hidden association risks, e.g., detect if one device ID is associated with numerous different user accounts (possible device farm), or if one shipping address is used by multiple risky accounts (possible brushing gang).
Comprehensive Risk Scoring:
- The risk engine no longer makes simple "Reject/Pass" decisions but calculates a comprehensive risk score (e.g., 0-100) based on multiple dimensions: rule hits, user profile tags, graph analysis results, etc.
- Adopt different handling strategies based on risk score ranges: low score passes, medium score requires secondary verification (like SMS code), high score directly rejects.

Architecture Adjustment (Data & Analysis Layer):

graph TD
    subgraph "Data Collection"
        Behavior("User Behavior Tracking") --> Kafka("Kafka Cluster");
    end
    subgraph "Real-time Computation"
        Kafka --> Spark("Spark Streaming / Flink");
        Spark -- Update Profile --> Profile("Redis/HBase User Profile Store");
        Spark -- Update Graph --> GraphDB("Neo4j/JanusGraph Association Graph");
    end
    subgraph "Risk Decision"
        RiskEngine -- Query Profile --> Profile;
        RiskEngine -- Query Graph --> GraphDB;
        RiskEngine -- Rules/Model --> Decision("Output Risk Score");
    end

Phase 3: Rules + Profiles Still Miss Things → Deploy Machine Learning Models

Challenge Escalates: Advanced Bots & Camouflaged Behavior

Fraudsters use smarter techniques, mimicking normal user behavior patterns, making judgments based on fixed rules and simple profiles increasingly difficult. False positive and false negative rates start to rise.
Need stronger pattern recognition capabilities to distinguish real users from advanced bots.

❓ Architect's Thinking Moment: How to leverage AI power to improve risk control accuracy?

(Traditional methods hitting limits. How to use ML models in risk control? What preparation is needed? How to deploy online?)

✅ Evolution Direction: Introduce Machine Learning Models for Real-time Risk Prediction

Feature Engineering:
- This is key to ML success. Need to extract features with discriminative power for risk identification from raw behavior data, user profiles, and graph information.
- Features can include: statistical features (e.g., "number of failed payments in last 24 hours"), time-series features (e.g., "time interval since last login"), interaction features (e.g., "IP address risk score * device risk score").
- Feature data needs preparation for model training and online prediction, often involving building a Feature Store.
Model Training & Selection:
- Collect historical data with labels (normal/fraudulent) to train supervised learning models.
- Common risk control models include: Logistic Regression (LR), Gradient Boosting Trees (GBDT, XGBoost, LightGBM), Neural Networks (DNN), etc.
- Models need regular retraining (e.g., daily or weekly) or incremental updates with the latest data.
Model Serving:
- Deploy the trained model as an online prediction service.
- Use specialized model serving frameworks like TensorFlow Serving, Triton Inference Server, KFServing, which provide low-latency, high-concurrency prediction interfaces (usually gRPC or REST API).
- The risk engine, when making decisions, calls the model service, passes in real-time computed features, and gets the predicted risk probability from the model.
Combining Models and Rules:
- Usually don't rely solely on models. Practice often involves a combination of model score + rule engine. The model provides a risk probability, and the rule engine makes the final decision based on this probability combined with other business rules.
A/B Testing & Monitoring:
- Before launching a new model, conduct A/B testing to compare its effectiveness (e.g., accuracy, recall, false positive rate) against the existing strategy.
- Online models need continuous monitoring of their performance to prevent model degradation.

Architecture Adjustment (Adding Model Path):

graph TD
    subgraph "Feature & Training (Offline/Nearline)"
        Data(Historical Data) --> FeatureEng("Feature Engineering Spark/Flink");
        FeatureEng --> FeatureStore("Feature Store Feast/HBase");
        FeatureStore --> ModelTrain(Model Training Platform);
        ModelTrain -- Deploy --> ModelServing;
    end
    subgraph "Online Prediction"
        RiskEngine -- Real-time Features --> FeatureStore;
        RiskEngine -- Get Features & Call --> ModelServing("Model Service TF Serving/Triton");
        ModelServing -- Return Probability --> RiskEngine;
        RiskEngine -- Combine with Rules --> Decision("Final Decision");
    end

Phase 4: The Arms Race → Dynamic Countermeasures & Defense Upgrades

Fraudster Evolution: Model Evasion & Adversarial Samples

Fraudsters also study your risk control strategies, even trying to find weaknesses in your models and construct adversarial samples to bypass detection.
Example: Slightly modifying request parameters or behavior patterns to mislead the model.
The defense system cannot be static; it must have dynamic adaptation and countermeasure capabilities.

❓ Architect's Thinking Moment: When fraudsters start targeting your model, how to respond?

(What if the model is bypassed? Any proactive methods? How to identify more hidden camouflage?)

✅ Evolution Direction: Introduce Device Fingerprinting, Honeypots, Online Learning, etc., for Dynamic Defense

Enhance Device Identification: Device Fingerprinting:
- Beyond simple device IDs, collect harder-to-forge device environment information to generate a device fingerprint.
- Techniques include using browser APIs to get Canvas fingerprint, WebGL fingerprint, font list, screen resolution, plugin info, etc., combined to generate a relatively unique device identifier.
- Server-side validation of device fingerprint consistency to identify emulators, proxies, etc.
Active Trapping: Honeypot:
- Set up "traps" in pages or APIs that are invisible to normal users but easily scanned by automated scripts (like hidden form fields, fake API interfaces).
- Once a request accesses these honeypots, it can be determined with high probability to be a bot or malicious scan, and immediately blocked/blacklisted.
Continuous Model Evolution: Online Learning:
- For rapidly changing attack patterns, relying on offline batch-trained model updates might not be timely enough.
- Introduce online learning mechanisms, allowing the model to continuously fine-tune its parameters based on real-time feedback (e.g., which requests were blocked, which were false positives), adapting to new risks faster.
Upgrade Human-Machine Verification Challenges:
- Introduce more advanced human-machine verification like Google's reCAPTCHA v3 (invisible verification), behavior-based slider verification, etc., increasing the cost for automated scripts to simulate.

Architecture Adjustment (Adding Frontend/Proactive Defense Layer):

graph TD
    subgraph "Client/Frontend"
        Client -- Collect Env Info --> Fingerprint("Device Fingerprint JS SDK");
        Client -- Interact --> Captcha("Human-Machine Verification");
    end
    subgraph "Server-side"
        Fingerprint -- Report Fingerprint --> RiskEngine;
        Captcha -- Verification Result --> RiskEngine;
        RiskEngine --> Honeypot("Honeypot Detection");
        Honeypot -- Hit --> Blacklist("Real-time Blacklist");
        RiskEngine -- Online Feedback --> OnlineLearning("Online Learning Module");
        OnlineLearning -- Update --> ModelServing;
    end

Limitations of Data Silos

A single company's risk control data and perspective are limited. Fraudsters often operate across platforms and industries.
How to utilize external intelligence? How to collaborate with other institutions to jointly enhance risk control capabilities while protecting user privacy?

❓ Architect's Thinking Moment: How to break down data barriers and achieve joint defense?

(Can blacklists be shared directly? How to ensure privacy? Are there more advanced collaboration methods?)

✅ Evolution Direction: Access Threat Intelligence + Explore Federated Learning

Access External Threat Intelligence:
- Subscribe to or purchase professional threat intelligence services (like Alibaba Cloud Risk Identification, Tencent TianYu, third-party security vendor intelligence feeds).
- Obtain the latest malicious IP address lists, risk device lists, known attack patterns, etc., in real-time to supplement internal risk control data.
Explore Privacy-Preserving Joint Risk Control: Federated Learning:
- An emerging distributed machine learning technique.
- Allows multiple participants (like different banks, e-commerce platforms) to jointly train machine learning models without sharing raw data.
- Each party trains the model locally with its own data, only uploading encrypted model parameter updates (gradients or weights) to a coordinator for aggregation, resulting in a global model that incorporates data from all parties and performs better.
- Requires specialized federated learning frameworks (like Google's TensorFlow Federated, WeBank's FATE framework).
(Optional) Blockchain for Evidence Recording:
- For critical malicious behaviors requiring evidence (like confirmed fraudulent transactions), related evidence (after anonymization) can be recorded using blockchain technology to ensure immutability, facilitating subsequent legal processes.

Architecture Adjustment (Ecosystem Collaboration Layer):

graph TD
    subgraph "Internal Risk Control System"
        RiskEngine;
    end
    subgraph "External Collaboration"
        RiskEngine <-- Real-time Sync --> TI("Threat Intelligence Platform");
        ModelTrain <-- Parameter Aggregation --> FL("Federated Learning Coordinator Node");
        OrgA_FL("Org A Federated Node") --> FL;
        OrgB_FL("Org B Federated Node") --> FL;
        RiskEngine -- "(Optional) Record Evidence" --> BC("Blockchain Network");
    end

Summary: The Offense-Defense Evolution Path of Risk Control Systems

Phase	Core Challenge	Key Solution	Representative Tech/Pattern
0. Rules	Static/Known Risks	Hardcoded Black/Whitelists	MySQL/Hardcode
1. Dynamic Rules	Rule Timeliness/Complexity	Dynamic Rule Engine + CEP	Drools/Flink CEP, Redis Rule Store
2. Behavior	Bypass Single Dimension	Real-time Profile + Graph Analysis	Spark/Flink Streaming, Redis/HBase, Neo4j/JanusGraph
3. ML Model	Advanced Bots/Camouflage	ML Risk Prediction + Feature Eng.	LR/GBDT/DNN, TF Serving/Triton, Feature Store
4. Dynamic	Model Evasion/Adversarial	Device Fingerprint + Honeypot + OL	Canvas/WebGL Fingerprint, Honeypot, Online Learning
5. Ecosystem	Data Silos/Privacy	Threat Intel Sharing + Fed Learn	Threat Intelligence API, FATE/Federated Learning
```