Data middle platform architecture and technical system

Overall architecture design of data middle platform

Data middle platform hierarchical architecture

Data acquisition layer
- Data source type: Business systems (ERP, CRM), logs, IoT devices, third-party APIs, etc.
- Collection method：
  
  Real-time collection: Kafka, Flink CDC (change data capture).
  
  Offline collection: Sqoop, DataX (batch synchronous database).
  
  Log collection：Flume、Filebeat。
- Data buffering and preprocessing: Use message queues (such as Kafka) as buffer to deal with data traffic peaks.
Data storage layer
- Data Lake: Stores raw data (structured, semi-structured, unstructured), and supports low cost and high scalability (HDFS, Alibaba Cloud OSS).
- Data Warehouse: Stores cleaned structured data and is modeled towards topics (Hive, ClickHouse, Snowflake).
- Converged architecture (LakeHouse): Combining the advantages of data lakes and data warehouses (integrated lakes and warehouses), such as Delta Lake and Apache Iceberg.
- Tiered storage strategy：
  
  Hot data (high frequency access): stored in distributed databases (such as HBase, Cassandra).
  
  Temperature data (low frequency access): stored in the data warehouse.
  
  Cold data (archive): Saved into object storage (such as AWS Glacier).
Data calculation layer
- Batch Processing Engine: Handle offline tasks (T+1 reports, ETL cleaning), such as Apache Spark, Hive.
- Stream Processing Engine: Process data flows in real time (such as risk control, real-time monitoring), such as Apache Flink and Kafka Streams.
- Interactive analysis engine: Supports quick query (OLAP), such as Presto and Doris.
- AI/ML computing platform: Integrated machine learning framework (TensorFlow, PyTorch), providing full process support from feature engineering to model deployment.
- Unified calculation scheduling: Use Airflow and DolphinScheduler to manage task dependencies and resource allocation.
Data service layer
- Unified data service interface: RESTful API, GraphQL, SQL query interface, API gateway (Kong, Apigee).
- Data visualization and self-service analysis: Integrate BI tools (Tableau, Superset) to support independent analysis by business personnel.
- Data subscription and push: Real-time data push based on message queue (Kafka).
- Data Sandbox: Provides a safe and isolated test environment to support data exploration and experiments.
- Service Governance: Service current limit, circuit breaker, monitoring (such as Prometheus + Grafana).
Data Governance and Security Layer
- Metadata management: Record data blood relationship, business meaning, and technical attributes (Apache Atlas, Alibaba DataWorks).
- Data quality management: Define rules (uniqueness, integrity) and automatically detect exceptions (Great Expectations, Talend).
- Data security: Data desensitization (such as name, mobile phone number), encryption (AES), permission control (RBAC); Compliance: Meet GDPR, CCPA and other regulations.
- Lifecycle Management: Automate the grading, archiving and deletion strategies of hot and cold data.
Cloud Native Data Middle Platform
- Infrastructure layer: Cloud platform (AWS/Azure/Ali Cloud) provides IaaS resources (ECS, container service K8s).
- Data storage layer: Data Lake (S3) + Data Warehouse (Redshift) + Real-time Database (Redis).
- Computing Engine Layer: Spark (batch processing), Flink (stream processing), SageMaker (AI).
- Service layer: API Gateway + Microservices (Spring Cloud) provides data services.
- Governance and monitoring: Unified logs (ELK Stack), monitoring (Prometheus), permissions (IAM).

Data aggregation

Data acquisition and access

Data source classification：
- Structured data: Database (MySQL, Oracle), data warehouse (Hive, ClickHouse).
- Semi-structured data: Log (JSON, XML), message queue (Kafka).
- Unstructured data: Text, pictures, audio and video (storage in object storage).
Offline batch synchronization: Sqoop (database → HDFS), DataX (multi-source heterogeneous synchronization), suitable for T+1 reports and historical data migration.
Real-time streaming acquisition: Kafka Connect, Flink CDC (real-time capture database changes), suitable for real-time monitoring of transaction flows and collection of user behavior logs.
Log collection: Filebeat (lightweight log collection), Flume (distributed log aggregation).

Data cleaning and standardization

Data analysis: Parsing unstructured data (such as logs) into structured fields, tools: JSONPath, Avro Schema.
Data removal: Delete duplicate records based on business primary key (such as order ID), such as Spark'sdropDuplicates(), Hive window function.
Field standardization: Unified time format (such as UTC timestamp), unit (such as the amount is uniformly US dollars).
Null value processing: Fill in the default value (such as 0), remove invalid records, and predict missing values based on the algorithm.
Data Verification: Verify the data range (such as age > 0), enumeration values (such as gender only allows "male/female"), such as Great Expectations, Debezium (real-time verification).

Data development

Data development and modeling

Batch Development: Offline reports, user profile updates, tools: Spark SQL, Hive, Airflow (task scheduling).
Real-time processing and development: Real-time risk control, dynamic pricing, tools: Flink SQL, Kafka Streams.
Dimensional modeling: Fact table (such as transaction records) + dimension table (such as users and products), suitable for BI analysis and OLAP query.
Wide table modeling: Pre-calculate the multi-table association results into wide tables to reduce query complexity, and are suitable for user behavior analysis (fusion clicks, purchases, and browsing data).
Graphic Model: Build entity relationship networks (such as social networks, anti-fraud maps), such as Neo4j and TigerGraph.

Data quality management and monitoring

Quality Rules Definition：
- Integrity: The key field is non-empty (such as user ID).
- consistency: Cross-system data matching (such as the amount of the financial system and the order system is the same).
- Timely: Data arrives on time (such as T+1 task delay alarm).
Quality monitoring and alarm: Detect abnormalities → Notify the person in charge → Trigger rerun or manual repair, and use Prometheus to monitor real-time indicators.
Data blood tracing: Record the complete link from the source to the application, which facilitates troubleshooting and impact analysis (Apache Atlas, Alibaba DataWorks).

Data storage

Storage hierarchy and architecture

Raw Layer ：
- Store content: Unprocessed raw data (log, database Binlog, IoT device data).
- Data Lake: HDFS, AWS S3, Alibaba Cloud OSS (supports massive unstructured data).
- Distributed file system: Such as Ceph (enterprise private cloud scenario).
- Features: Low-cost, high-throughput writing, suitable for long-term storage of raw data.
Cleaned layer ：
- Store content: Data after standardization, deduplication, and null value processing.
- Column storage: Parquet, ORC (high compression rate, suitable for analysis scenarios).
- Distributed database: HBase (semi-structured data), Cassandra (high availability write).
- Features: The data format is unified and supports efficient query.
Service Layer ：
- Store content: High-frequency access data for business (such as user portraits, real-time statistical results).
- OLAP database: ClickHouse, Doris (supports fast aggregation query).
- Memory database: Redis, Memcached (cached hot data).
- Features: Low latency response, directly connect to business applications.
Archived Layer ：
- Store content: Historical data accessed by low frequency (such as log archives required by compliance).
- Object storage cold storage: AWS Glacier, Alibaba Cloud Archive Storage.
- Features: Very low cost, but high read delay (need to thaw).
Hot and cold separation：
- Hot data (high frequency access): Store in distributed databases (such as HBase, Redis).
- Cold data (archive): After compression, it is stored in object storage (such as AWS Glacier).

Comparison of storage technology selection

Scene	Technical Solution	Advantages	Limited
Massive raw data storage	HDFS/S3	High scalability, low cost	Poor random read and write performance
Real-time data writing	Kafka、Apache Pulsar	High throughput, low latency	Need to be used with downstream storage
Structured data analysis	Hive、Snowflake	SQL compatible and supports complex queries	Insufficient real-time
Real-time query and cache	Redis、ClickHouse	Sub-second response, high concurrency	The amount of data is limited by memory/resources

Data calculation

Computing engine classification and selection

Batch Processing Engine：
- Scene: Offline ETL, T+1 report, historical data cleaning.
- Apache Spark: Memory calculation optimization, suitable for complex data processing.
- Hive: Based on MapReduce, suitable for stable large-scale offline tasks.
Stream Processing Engine：
- Scene: Real-time monitoring, risk control, dynamic pricing.
- Apache Flink: Low latency, Exactly-Once semantics, supporting complex event processing (CEP).
- Kafka Streams: Lightweight, deep integration with Kafka.
Interactive analysis engine：
- Scene: Ad-hoc, BI tool docking.
- Presto: Multi-data source federated query, second-level response.
- Apache Druid: OLAP optimization for timing data.
AI computing engine：
- Scene: Machine learning model training, inference.
- TensorFlow/PyTorch: Deep Learning Framework.
- Spark MLlib: Distributed machine learning library.

Computational architecture model

Lambda architecture：
- design: Batch layer (offline accurate results) + Speed layer (real-time approximation results) → Service layer merges output.
- Pros and cons: Taking into account both accuracy and real-timeness; but maintaining two sets of logic, it is very complex.
Kappa architecture：
- design: Only process historical and real-time data through stream processing engines (such as Flink), relying on the replay mechanism.
- Pros and cons: Simplify architecture and unified logic; but rely on the long-term storage capabilities of message queues (such as Kafka retention policy).
Lakehouse ：
- design: Converge data lakes (flexible storage) and data warehouses (efficient analysis), such as Delta Lake and Apache Iceberg.
- advantage: Supports ACID transactions and schema evolution, and performs analysis directly based on data lakes.

Data governance

Metadata Management

Metadata collection: Automatically collect database table structure, field definition, ETL task blood relationship, etc. (Apache Atlas, Alibaba DataWorks).
Data Catalog: Provides a search interface, and business personnel can quickly locate data assets, labeled classification, data popularity statistics, and user ratings.
Data Lineage: Visually display the complete link of data from the source to the application (such as "User Table → ETL Task → Report").

Data Quality Management

Rule definition：
- Integrity: The key field is non-empty (such as order ID).
- accuracy: Numerical range verification (such as age >0).
- consistency: Cross-system data matching (such as the total amount of the financial system and the order system is consistent).
- Timely: Data arrives on time (such as T+1 task delay alarm).
Quality monitoring: Timed tasks scan data, trigger alarms, Great Expectations (define rules and generate reports), Debezium (real-time data verification).
Closed-loop processing: Correct data through SQL scripts or ETL tasks.

Master Data Management (MDM)

Master data definition: Formulate enterprise-level standards (such as "customer" must include ID, name, mobile phone number, and registration time).
Master data distribution: Synchronize master data to each business system through API or message queue (Kafka).
Conflict resolution: Define priority rules (such as the CRM system is the authoritative source of "customer" information).

Data lifecycle management

Hot data: High frequency access, stored in high-performance storage (such as Redis, SSD).
Warm data: Low frequency access, stored in data warehouse (such as Hive).
Cold data: Archive to low-cost storage (such as AWS Glacier), automatically deleted after the retention period of the regulations is met.

Data security

Data classification and grading

Classification: Divided by business attributes (such as user data, transaction data, log data).
Grading: Develop differentiated protection strategies according to sensitivity level (such as open, internal, and confidential).

Access Control

Role-based access control (RBAC): Define roles (such as data analysts, risk controllers) and assign minimum necessary permissions (Apache Ranger, AWS IAM).
Dynamic desensitization: Return part of the data according to the user role (for example, customer service can only see the last four digits of the user's mobile phone number) (ProxySQL, data desensitization middleware).

Data encryption

Encryption at rest (At Rest): Database encryption (such as MySQL TDE), file system encryption (such as HDFS Transparent Encryption).
Transit Encryption (In Transit): Use the TLS/SSL protocol to protect data transmission (such as Kafka SSL encryption).

Audit and monitoring

Operational Audit: Record data access, modification, and deletion logs, and retain them for at least 6 months (meets compliance requirements), ELK Stack (log analysis), AWS CloudTrail.
Exception detection: Identify exception access patterns through machine learning models (such as batch export of data during non-working hours).

Data Services and APIs

A hierarchical structure of data service

Data preparation layer：
- Data Modeling: Design wide tables, aggregate tables or feature tables (such as user portrait tables) according to business needs.
- Data processing: Generate reusable data assets (such as daily sales statistics table) through ETL/ELT.
- Technical selection：Spark、Flink、Hive。
Service encapsulation layer：
- Service abstraction: Encapsulate data into an interface with clear business semantics (such as "get user's recent order").
- Service Type：
  
  Query Class: SQL query, key-value query (such as obtaining portrait based on user ID).
  
  Analytical categories: Aggregation statistics (such as sales ranking by region).
  
  Intelligent: Model prediction (such as user churn probability score).
- Technical selection: Spring Boot (microservice), GraphQL (flexible query).
Service Open Layer：
- API Gateway: Authentication, current limiting, and monitoring of unified entrance management API.
- Agreement support: RESTful API (80% scenario), WebSocket (real-time push), gRPC (high-performance communication).
- Technical selection: Kong, Apigee, Alibaba Cloud API Gateway.

API design principles

standardization: Follow the OpenAPI specification (Swagger) and provide clear interface documentation.
Security: Supports OAuth 2.0, JWT token authentication, and desensitizes sensitive data (such as hidden part of the mobile phone number).
Performance optimization：
- Cache Policy: Redis caches high-frequency query results to reduce database pressure.
- Pagination and current limit: Avoid the system dragging down by large result sets (such as up to 100 items per page, and the QPS limit is 1000 times/second).

API Lifecycle Management

Development and testing：
- Mock Services: Use Postman or Swagger UI to simulate the interface for front-end parallel development.
- Automated testing: Verify interface performance and correctness through JMeter or Python scripts.
Deployment and Version Control：
- Blue and green release: When the new version of the API is launched, the old version will be retained and the traffic will be gradually switched.
- Version logo: URL path (eg/v1/user) or request header (e.g.Accept-Version: 2.0）。
Monitoring and operation and maintenance：
- Indicator monitoring: Track API calls, delays, and error rates in real time (Prometheus + Grafana).
- Log Analysis: Record the request details to facilitate troubleshooting (ELK Stack).

Typical API types and scenarios

API Type	Scene	Technology implementation
Query API	Query user information based on ID	Redis Cache + MySQL Library and Table
Analytics API	Statistics on the number of active users in the past 7 days	Presto query Hive wide table
Real-time push API	Trading risk alarm notification	WebSocket + Kafka Message Queue
Prediction API	User credit score	Flask deploys TensorFlow model