Case 5: Security System
Course 5: Security and Risk Control System Architecture Evolution Case Study
Goal: Offense and defense in the online world are ever-present. This course simulates the gradual construction and upgrade process of a typical online business risk control system, giving you an in-depth understanding of core security architecture capabilities such as real-time rule engines, user behavior analysis, application of machine learning models in risk control, and how to dynamically combat evolving fraudulent activities.
Phase 0: Stone Age (Simple Rule Lists)
System Description
- The earliest risk control might be simple "blacklists/whitelists":
- When a user logs in or registers, check if the IP address is on the "blacklist".
- When a user places an order and pays, verify if the phone number is a known "abnormal number".
- Tech Stack: Very basic:
- Database: A few tables in MySQL, e.g.,
blacklist_ips
,abnormal_phones
. - Backend Logic: Hardcode judgment logic in business code, e.g.,
if ip in blacklist_ips: reject()
.
Current Architecture Diagram
[User Request (Login/Payment)] → [Application Server (Hardcoded Rule Check)] → [Query MySQL Blacklist Table] → [Allow/Deny]
Phase 1: Goodbye Hardcoding → Introduce Dynamic Rule Engine
Challenge Emerges
- Business operations find that fraudulent IP lists are updated massively daily; manually changing code and deploying is too slow.
- Need more flexibility to define and manage risk control rules, e.g., "Identify attempts where the same IP address fails login more than 20 times within 5 minutes" (typical brute-force characteristic).
- Hope business personnel can also participate in rule configuration.
❓ Architect's Thinking Moment: How to make rules effective dynamically and handle more complex logic?
(Hardcoding is out. Any mature rule engine solutions in the industry? How to handle time-window-based event patterns?)
✅ Evolution Direction: Introduce Rule Engine + Complex Event Processing (CEP)
- Dynamic Rule Management & Loading:
- Separate risk control rules from code, store them in a dedicated place (like a database, config center, or cache like Redis for better read performance).
- The application or a dedicated rule engine service periodically (e.g., every 10 seconds) pulls the latest rule definitions.
- Introduce Rule Engine Execution Logic:
- Use mature rule engines (like Drools, AviatorScript) to execute dynamically loaded rules, replacing hardcoded
if-else
.
- Use mature rule engines (like Drools, AviatorScript) to execute dynamically loaded rules, replacing hardcoded
- Handle Complex Event Patterns (CEP):
- For rules requiring judgment based on time windows and event sequences (like "20 login failures in 5 minutes"), introduce Complex Event Processing (CEP) capabilities.
- Use libraries like Flink CEP or dedicated stream processing platforms (like Apache Siddhi). Receive user behavior event streams in real-time, match predefined patterns, and trigger corresponding risk control actions.
- Risk Decision as a Service:
- Encapsulate the risk control judgment logic into an independent Risk Engine service. Business systems call this service via RPC or HTTP API to get risk decisions (e.g., "Pass," "Reject," "Needs secondary verification").
Architecture Adjustment:
graph TD
subgraph "Business Application"
App -- Risk Request --> RiskEngine;
end
subgraph "Risk Service (Risk Engine)"
RiskEngine -- Rule Execution --> RE(Drools/Aviator);
RiskEngine -- Complex Pattern Detection --> CEP("Flink CEP / Siddhi");
RE -- Load Rules --> Cache("Redis Rule Store");
CEP -- Real-time Event Stream --> Kafka("User Behavior Events");
end
subgraph "Rule Management"
AdminUI --> DB(Rule Database) --> Sync --> Cache;
end
RiskEngine -- Risk Decision --> App;
Phase 2: Rules Aren't Enough → Introduce Behavior Analysis & User Profiling
New Challenge: Fraudsters Bypass Simple Rules
- Fraudsters start using proxy IP pools, device farms, etc., bypassing simple rules based on single dimensions (like IP, phone number).
- Need deeper analysis of user behavior patterns to identify anomalies. For example:
- A newly registered account with no browsing activity attempts a large order directly.
- Multiple seemingly unrelated accounts log in using the same device fingerprint information.
❓ Architect's Thinking Moment: How to discover risks from user behavior sequences and correlations?
(Looking at single points isn't enough. How to build user profiles? How to mine potential connections between devices, IPs, and accounts?)
✅ Evolution Direction: Build Real-time User Profiles + Association Graph Analysis
- Collect User Behavior Data in Real-time:
- Use tracking points (instrumentation) to send various user behaviors (login, browse, add to cart, order, payment, device ID used, login IP, etc.) to a Kafka message queue in real-time.
- Build Real-time/Near Real-time User Profiles:
- Use stream processing engines (like Spark Streaming or Flink) to consume behavior data from Kafka.
- Calculate users' short-term behavioral features, forming user profile tags, stored in high-speed KV storage (like Redis or HBase). Examples: "Recently used login cities," "Browsed product category distribution in the last hour," "Is this the first time using this device?" etc.
- Utilize Graph Database for Association Analysis:
- Store entities like user accounts, device info, IP addresses, shipping addresses, and their relationships in a Graph Database (like Neo4j or JanusGraph).
- Graph queries can efficiently discover hidden association risks, e.g., detect if one device ID is associated with numerous different user accounts (possible device farm), or if one shipping address is used by multiple risky accounts (possible brushing gang).
- Comprehensive Risk Scoring:
- The risk engine no longer makes simple "Reject/Pass" decisions but calculates a comprehensive risk score (e.g., 0-100) based on multiple dimensions: rule hits, user profile tags, graph analysis results, etc.
- Adopt different handling strategies based on risk score ranges: low score passes, medium score requires secondary verification (like SMS code), high score directly rejects.
Architecture Adjustment (Data & Analysis Layer):
graph TD
subgraph "Data Collection"
Behavior("User Behavior Tracking") --> Kafka("Kafka Cluster");
end
subgraph "Real-time Computation"
Kafka --> Spark("Spark Streaming / Flink");
Spark -- Update Profile --> Profile("Redis/HBase User Profile Store");
Spark -- Update Graph --> GraphDB("Neo4j/JanusGraph Association Graph");
end
subgraph "Risk Decision"
RiskEngine -- Query Profile --> Profile;
RiskEngine -- Query Graph --> GraphDB;
RiskEngine -- Rules/Model --> Decision("Output Risk Score");
end
Phase 3: Rules + Profiles Still Miss Things → Deploy Machine Learning Models
Challenge Escalates: Advanced Bots & Camouflaged Behavior
- Fraudsters use smarter techniques, mimicking normal user behavior patterns, making judgments based on fixed rules and simple profiles increasingly difficult. False positive and false negative rates start to rise.
- Need stronger pattern recognition capabilities to distinguish real users from advanced bots.
❓ Architect's Thinking Moment: How to leverage AI power to improve risk control accuracy?
(Traditional methods hitting limits. How to use ML models in risk control? What preparation is needed? How to deploy online?)
✅ Evolution Direction: Introduce Machine Learning Models for Real-time Risk Prediction
- Feature Engineering:
- This is key to ML success. Need to extract features with discriminative power for risk identification from raw behavior data, user profiles, and graph information.
- Features can include: statistical features (e.g., "number of failed payments in last 24 hours"), time-series features (e.g., "time interval since last login"), interaction features (e.g., "IP address risk score * device risk score").
- Feature data needs preparation for model training and online prediction, often involving building a Feature Store.
- Model Training & Selection:
- Collect historical data with labels (normal/fraudulent) to train supervised learning models.
- Common risk control models include: Logistic Regression (LR), Gradient Boosting Trees (GBDT, XGBoost, LightGBM), Neural Networks (DNN), etc.
- Models need regular retraining (e.g., daily or weekly) or incremental updates with the latest data.
- Model Serving:
- Deploy the trained model as an online prediction service.
- Use specialized model serving frameworks like TensorFlow Serving, Triton Inference Server, KFServing, which provide low-latency, high-concurrency prediction interfaces (usually gRPC or REST API).
- The risk engine, when making decisions, calls the model service, passes in real-time computed features, and gets the predicted risk probability from the model.
- Combining Models and Rules:
- Usually don't rely solely on models. Practice often involves a combination of model score + rule engine. The model provides a risk probability, and the rule engine makes the final decision based on this probability combined with other business rules.
- A/B Testing & Monitoring:
- Before launching a new model, conduct A/B testing to compare its effectiveness (e.g., accuracy, recall, false positive rate) against the existing strategy.
- Online models need continuous monitoring of their performance to prevent model degradation.
Architecture Adjustment (Adding Model Path):
graph TD
subgraph "Feature & Training (Offline/Nearline)"
Data(Historical Data) --> FeatureEng("Feature Engineering Spark/Flink");
FeatureEng --> FeatureStore("Feature Store Feast/HBase");
FeatureStore --> ModelTrain(Model Training Platform);
ModelTrain -- Deploy --> ModelServing;
end
subgraph "Online Prediction"
RiskEngine -- Real-time Features --> FeatureStore;
RiskEngine -- Get Features & Call --> ModelServing("Model Service TF Serving/Triton");
ModelServing -- Return Probability --> RiskEngine;
RiskEngine -- Combine with Rules --> Decision("Final Decision");
end
Phase 4: The Arms Race → Dynamic Countermeasures & Defense Upgrades
Fraudster Evolution: Model Evasion & Adversarial Samples
- Fraudsters also study your risk control strategies, even trying to find weaknesses in your models and construct adversarial samples to bypass detection.
- Example: Slightly modifying request parameters or behavior patterns to mislead the model.
- The defense system cannot be static; it must have dynamic adaptation and countermeasure capabilities.
❓ Architect's Thinking Moment: When fraudsters start targeting your model, how to respond?
(What if the model is bypassed? Any proactive methods? How to identify more hidden camouflage?)
✅ Evolution Direction: Introduce Device Fingerprinting, Honeypots, Online Learning, etc., for Dynamic Defense
- Enhance Device Identification: Device Fingerprinting:
- Beyond simple device IDs, collect harder-to-forge device environment information to generate a device fingerprint.
- Techniques include using browser APIs to get Canvas fingerprint, WebGL fingerprint, font list, screen resolution, plugin info, etc., combined to generate a relatively unique device identifier.
- Server-side validation of device fingerprint consistency to identify emulators, proxies, etc.
- Active Trapping: Honeypot:
- Set up "traps" in pages or APIs that are invisible to normal users but easily scanned by automated scripts (like hidden form fields, fake API interfaces).
- Once a request accesses these honeypots, it can be determined with high probability to be a bot or malicious scan, and immediately blocked/blacklisted.
- Continuous Model Evolution: Online Learning:
- For rapidly changing attack patterns, relying on offline batch-trained model updates might not be timely enough.
- Introduce online learning mechanisms, allowing the model to continuously fine-tune its parameters based on real-time feedback (e.g., which requests were blocked, which were false positives), adapting to new risks faster.
- Upgrade Human-Machine Verification Challenges:
- Introduce more advanced human-machine verification like Google's reCAPTCHA v3 (invisible verification), behavior-based slider verification, etc., increasing the cost for automated scripts to simulate.
Architecture Adjustment (Adding Frontend/Proactive Defense Layer):
graph TD
subgraph "Client/Frontend"
Client -- Collect Env Info --> Fingerprint("Device Fingerprint JS SDK");
Client -- Interact --> Captcha("Human-Machine Verification");
end
subgraph "Server-side"
Fingerprint -- Report Fingerprint --> RiskEngine;
Captcha -- Verification Result --> RiskEngine;
RiskEngine --> Honeypot("Honeypot Detection");
Honeypot -- Hit --> Blacklist("Real-time Blacklist");
RiskEngine -- Online Feedback --> OnlineLearning("Online Learning Module");
OnlineLearning -- Update --> ModelServing;
end
Phase 5: Fighting Alone Isn't Enough → Towards Intelligence Sharing & Federated Ecosystem
Limitations of Data Silos
- A single company's risk control data and perspective are limited. Fraudsters often operate across platforms and industries.
- How to utilize external intelligence? How to collaborate with other institutions to jointly enhance risk control capabilities while protecting user privacy?
❓ Architect's Thinking Moment: How to break down data barriers and achieve joint defense?
(Can blacklists be shared directly? How to ensure privacy? Are there more advanced collaboration methods?)
✅ Evolution Direction: Access Threat Intelligence + Explore Federated Learning
- Access External Threat Intelligence:
- Subscribe to or purchase professional threat intelligence services (like Alibaba Cloud Risk Identification, Tencent TianYu, third-party security vendor intelligence feeds).
- Obtain the latest malicious IP address lists, risk device lists, known attack patterns, etc., in real-time to supplement internal risk control data.
- Explore Privacy-Preserving Joint Risk Control: Federated Learning:
- An emerging distributed machine learning technique.
- Allows multiple participants (like different banks, e-commerce platforms) to jointly train machine learning models without sharing raw data.
- Each party trains the model locally with its own data, only uploading encrypted model parameter updates (gradients or weights) to a coordinator for aggregation, resulting in a global model that incorporates data from all parties and performs better.
- Requires specialized federated learning frameworks (like Google's TensorFlow Federated, WeBank's FATE framework).
- (Optional) Blockchain for Evidence Recording:
- For critical malicious behaviors requiring evidence (like confirmed fraudulent transactions), related evidence (after anonymization) can be recorded using blockchain technology to ensure immutability, facilitating subsequent legal processes.
Architecture Adjustment (Ecosystem Collaboration Layer):
graph TD
subgraph "Internal Risk Control System"
RiskEngine;
end
subgraph "External Collaboration"
RiskEngine <-- Real-time Sync --> TI("Threat Intelligence Platform");
ModelTrain <-- Parameter Aggregation --> FL("Federated Learning Coordinator Node");
OrgA_FL("Org A Federated Node") --> FL;
OrgB_FL("Org B Federated Node") --> FL;
RiskEngine -- "(Optional) Record Evidence" --> BC("Blockchain Network");
end
Summary: The Offense-Defense Evolution Path of Risk Control Systems
Phase | Core Challenge | Key Solution | Representative Tech/Pattern |
---|---|---|---|
0. Rules | Static/Known Risks | Hardcoded Black/Whitelists | MySQL/Hardcode |
1. Dynamic Rules | Rule Timeliness/Complexity | Dynamic Rule Engine + CEP | Drools/Flink CEP, Redis Rule Store |
2. Behavior | Bypass Single Dimension | Real-time Profile + Graph Analysis | Spark/Flink Streaming, Redis/HBase, Neo4j/JanusGraph |
3. ML Model | Advanced Bots/Camouflage | ML Risk Prediction + Feature Eng. | LR/GBDT/DNN, TF Serving/Triton, Feature Store |
4. Dynamic | Model Evasion/Adversarial | Device Fingerprint + Honeypot + OL | Canvas/WebGL Fingerprint, Honeypot, Online Learning |
5. Ecosystem | Data Silos/Privacy | Threat Intel Sharing + Fed Learn | Threat Intelligence API, FATE/Federated Learning |
``` |