Hive-based big data analytics system

1. Overview

In the process of building a big data analysis system, we are faced with the challenge of massive and multi-source data, and how to effectively solve the problem of analyzing these fragmented data has been the core concern of research in the field of big data. Big data analysis and processing platform as a tool to deal with this challenge, is committed to integrating a variety of current mainstream big data processing and analysis frameworks and tools, in order to achieve a comprehensive mining and in-depth analysis of data. In this blog, I will introduce how to build a big data analytics platform to realize the accurate extraction and in-depth analysis of valuable information in the complex data environment.

2. Content

Building a complete big data analytics platform involves numerous components, which often have different focuses and functional characteristics. From data collection, storage, to processing and analysis of various aspects, each component plays a key role. How to realize synergy in this complex component system and combine them organically has become a very complex and critical task.
This process needs to take into account the requirements of data size, variety, and real-time processing. At the same time, in order to achieve the goal of mining massive data, it is also necessary to consider the technical challenges of distributed computing, storage optimization, and efficient algorithm design. Only through the careful design and integration of these components can the deep mining of massive data be accomplished to obtain valuable information for business and decision-making.

2.1 Understanding Big Data Analytics Systems

Before building a big data analytics platform, we must first stand at the forefront of business requirements and deeply understand user expectations and scenarios. Big data analytics platform is not just a stack of technology, but also an intelligent engine to serve the business. Clarifying business demand scenarios and user expectations, and understanding what valuable information we pursue in this ocean of data is the key starting point for building a big data analytics system.

2.1.1 Understanding the value of big data analytics systems

Building a perfect big data analysis system not only builds a basic data center for the enterprise, but also provides a unified data storage system for the enterprise, and lays a solid foundation for the presentation of the value of the data through data modeling.

1. Building a basic data center

The first value of a big data analytics system is reflected in the construction of an enterprise's basic data center. Through a unified data storage system, enterprises can effectively manage, store and retrieve massive amounts of data, including data from different business units and multiple sources. This centralized data management not only improves data reliability and consistency, but also reduces the complexity and cost of data management.

2. Unified data modeling

Through unified modeling of data, big data analytics systems provide enterprises with a standardized way of representing data, enabling different departments and businesses to work with the same data model. This consistent data model helps eliminate data silos and promotes cross-departmental and cross-system data sharing and collaborative work, thus improving overall enterprise efficiency and decision-making.

3. Sinking of data-processing capacity

Big data analytics systems sink data processing capabilities and build centralized data processing centers, providing enterprises with powerful data processing capabilities. This means that enterprises are able to clean, transform and analyze data more efficiently, so as to better explore the potential value of data. At the same time, this centralized processing model helps to improve processing efficiency and reduce processing costs.

4. Unified data management monitoring system

In order to ensure the stable operation of the big data analysis system, a unified data management and monitoring system has been constructed. This includes comprehensive monitoring of data quality, security and availability, as well as real-time monitoring of system performance and faults. Through this comprehensive monitoring system, enterprises are able to identify and solve potential problems in a timely manner to ensure stable and reliable operation of the system.

5. Build a unified application center

In the end, the big data analysis system truly reflects the value of data by building a unified application center to meet the business needs of enterprises. Through the application center, enterprises are able to develop a variety of intelligent applications based on the data and analysis results provided by the big data analytics system to provide stronger support for business. This makes data no longer a passive resource, but a power source that can actively create value for business.
In a nutshell, the value of a big data analytics system lies not only in processing and analyzing massive amounts of data, but also in building a unified and efficient data infrastructure for the enterprise, which provides strong support for business innovation.

2.1.2 Understanding the purpose of big data analytics systems

In today's wave of digitization, big data is no longer just a huge pile of information, but has become a core resource that drives intelligent decision-making and business innovation. The purpose of understanding big data analytics systems is far more than just chasing technology trends, it's about gaining deep insights into the role of data in guiding business actions.

1. Data metrics: insights into business trends

One of the primary purposes of a big data analytics system is to help organizations gain insight into business trends. By analyzing massive amounts of data, the system is able to identify and understand market trends, consumer behavior, and competitor strategies. This in-depth insight helps organizations predict future trends, develop strategic plans, and make sharp business decisions.

2. Data understanding: improving the decision-making process

Another key objective of a big data analytics system is to improve the decision-making process. By providing real-time, accurate data analytics, the system can help management better understand the current business situation and reduce blindness in decision-making. Such data-driven decision-making can reduce risk, increase the probability of success and maintain flexibility in a competitive marketplace.

3. Data-driven: optimizing operational efficiency

Big data analytics systems also aim to optimize the operational efficiency of a business. Through in-depth analysis of business processes, the system can identify potential optimization points to improve productivity and reduce resource wastage. This optimization not only brings cost reduction, but also accelerates business operations and improves customer satisfaction.

4. Data prediction: enabling personalized marketing

Big data analytics systems help companies achieve a more personalized marketing strategy. Through in-depth understanding of customer behavior and preferences, the system is able to generate accurate user profiles and provide companies with more targeted marketing programs. This personalized marketing not only improves the effectiveness of marketing, but also strengthens customer relationships and enhances brand loyalty.

5. Data security: enhancing security and compliance

Another important goal of a big data analytics system is to enhance the security and compliance of an organization. By monitoring and analyzing data, the system is able to detect unusual activities and potential security threats in a timely manner. It also helps to ensure that organizations follow regulations and industry standards, reducing legal risks.

2.1.3 Understand the application scenarios of big data analytics systems

In today's information age, big data is becoming a key force driving technology and business development. With the continuous progress of technology, the application scenarios of big data analysis systems are becoming more and more extensive. These systems are not only a powerful assistant for enterprise decision-making, but also show strong application potential in many fields such as healthcare, urban planning and finance.

1. Enterprise decision-making optimization

The use of big data analytics systems in business marketing and sales is long overdue. By analyzing large-scale market data, companies can better understand consumer behavior, trends and preferences. Based on the results of these analyses, companies can optimize advertising strategies, develop personalized marketing plans, and target potential customers more precisely. At the same time, in the sales process, big data analytics systems can also help companies monitor inventory, adjust pricing strategies in real time, and improve sales efficiency.

2. Financial risk control and anti-fraud

In the financial sector, big data analysis systems provide strong support for risk management and anti-fraud. By analyzing users' transaction history, behavioral patterns and other multidimensional data, financial institutions can more accurately assess credit risk and detect abnormal transaction behavior in a timely manner, thereby improving risk control. Big data analytics systems are also capable of building sophisticated fraud detection models to identify potential fraudulent activities and protect users' assets.

3. Medical and health management

In the medical field, big data analytics systems provide unprecedented support for health management and medical decision-making. By analyzing information such as patients' medical history data, medical records and vital signs, healthcare providers can better understand patients' health status, predict the risk of chronic diseases and develop personalized treatment plans. Big data analytics systems can also assist in medical research and accelerate the process of new drug development and clinical trials.

4. Smart City Construction

In urban management, big data analytics systems provide strong support for the construction of smart cities. By collecting and analyzing data from all aspects of the city, including traffic flow, environmental pollution, energy consumption, etc., city managers can better plan urban development, optimize traffic flow, and improve the overall operational efficiency of the city.

5. Manufacturing Intelligent Production

In the manufacturing industry, big data analytics systems provide critical support for smart production. By monitoring a large amount of sensor data on the production line, companies can understand the production status in real time, predict equipment failures, so as to carry out timely maintenance and improve production efficiency. Big data analytics systems can also optimize supply chain management, reduce inventory costs and improve the accuracy of production planning.
Overall, the application scenarios of big data analysis systems are becoming more and more extensive, and their role in different fields cannot be ignored. By deeply mining and analyzing data, we are able to understand complex systems and phenomena more comprehensively and accurately, thus providing powerful support for decision-making, innovation and development.

2.2 Understanding Big Data Analytics Systems Architecturally

Big data analytics system plays the role of integrating, organizing and analyzing a huge data set, which is not only a simple data warehouse, but also a complex system covering multiple dimensions of information such as system data and business data.
The core task of a big data analysis system is to realize the mining and analysis of data under a unified data framework. This means that numerous components and complex functions are involved, so how to skillfully combine these components organically becomes a crucial aspect in the construction process of the system. This subsection will explore the component structure of the big data analytics system, analyze the synergy between the components, and how to achieve efficient data processing and visualization display in this multi-layered and multi-functional system.

2.2.1 Understanding the architecture of big data analytics systems

With the dramatic expansion of data size and diversity, building an efficient and scalable big data analytics system has become critical. In order to deeply understand the operation of this mammoth system, this section will lead readers to explore its architecture together, from data collection to the final presentation of insights, revealing how a big data analytics system can discover valuable information in the vast and diverse ocean of data. As shown in the figure.

1. Data acquisition layer: connecting diverse data sources

The data collection layer is the foundation of the big data analysis platform, which is directly related to the acquisition and integration of data. The bottom layer is various types of data sources, including various business data, user data, log data, and so on. To ensure comprehensiveness, both traditional ETL offline acquisition and real-time acquisition are often used. The goal of this layer is to integrate fragmented data from all corners to form a comprehensive and coherent data set.

2. Data storage and processing layer: strong support for data

With the underlying data, the next step is to store the data in a suitable persistent storage layer (e.g. Hive Data Warehouse) and preprocess the data according to different needs and scenarios. This includes OLAP, machine learning, and many other forms. At this level, the data is further processed to ensure the quality, availability and security of the data, providing a solid foundation for subsequent deeper analysis.

3. Data analysis layer: unlocking the deeper value of data

At the data analysis layer, reporting systems and BI analytics systems play a key role. Data is simply processed at this stage and then analyzed and mined at a deeper level. The task of this layer is to extract valuable information from the huge amount of data to provide strong support for enterprise decision-making. At this stage, data becomes more intelligent and easy to understand.

4. Data application layer: transforming data into business insights

Ultimately, data is categorized into different categories of applications based on business needs. This includes forms such as data reports, dashboards, digital big screens, and just-in-time queries. The data application layer is the output of the entire data analysis process and the key to demonstrate the value of data externally. Through visualization, the analysis results are vividly presented to the end user to help business decisions.
In-depth understanding of the system architecture is not only a matter of technology, but also requires a deep understanding of business requirements and user expectations. Only through a reasonably designed architecture, from data collection to final application, can we realize comprehensive mining and deep analysis of data.

2.2.2 Designing the core modules of a big data analytics system

The core modules for designing a big data analysis system cover data acquisition, data storage, data analysis, and data services, etc. These key modules work together to build a complete and efficient big data analysis system. As shown in the figure.

1. Data Acquisition

As the first step of the system, the data collection module undertakes the task of assembling information data from various business self-systems. The system chooses to support Kafka, Flume and traditional ETL collection tools to ensure efficient processing and integration of diverse data sources.

2. Data storage

The data storage module adopts an integrated storage scheme that combines Hive, HBase, Redis and MySQL to form a distributed storage system that supports massive data. This integrated storage model ensures efficient management and retrieval of large-scale data.

3. Data analysis

The data analytics module is the core engine of the system, supporting both traditional OLAP analytics and conventional Spark-based machine learning algorithms. This enables the system to dig deeper into huge datasets to discover potential values and trends, providing strong support for decision-making.

4. Data services

The data service module is the hub of the system, providing unified management and scheduling of data resources. Through data services, the system realizes the overall governance of data, enabling the flow, storage and analysis of data in an orderly and efficient manner. At the same time, it provides data services to the outside world, providing a standardized interface and access to other systems and applications.
The synergy of these core modules enables the big data analytics system to form an organic and perfect architecture from data collection to storage, to analysis, and finally to provide services to the outside world. By integrating the functions of each module, the system is able to cope with changing data environments and provide users with efficient, reliable and flexible big data analysis solutions.

2.3 Realization of a big data analysis system

The process of implementing a big data analytics system mainly covers the following key steps, including data collection, data integration, data processing and data visualization. This series of steps constitutes what is commonly referred to as a one-stop big data analytics platform.
On this platform, data acquisition is responsible for obtaining raw data from multiple sources, and subsequent data integration aggregates this data and ensures format consistency. Next, the data processing phase performs data cleansing, transformation and processing to bring the data up to analyzable standards.
Ultimately, through data visualization, users are able to understand and explore data in an intuitive way, providing strong support for decision-making. This standard process provides the basic framework for designing and implementing a big data analytics system that can efficiently handle large data sets and meet diverse analytical needs.

2.3.1 Data acquisition

Data collection is the crucial first step in a big data analysis system, which plays a key role in the system's access to the source of information. At this stage, the system collects raw data extensively and efficiently through various channels and technologies to lay the foundation for subsequent analysis and processing. The data collection process covers diverse data sources from sensors, logs, external databases to online platforms, ensuring that the system is able to obtain comprehensive and multi-dimensional information.
In this subsection, we have specifically designed an application in order to simulate a data collection scenario, whose main function is to generate simulated data as raw data and send this data to the Kafka messaging middleware.
Here is a simple application written in Java to generate simulated movie data to be sent concurrently to Kafka.In this example, we use the Java client library for Apache Kafka. The specific dependencies are shown in the code.

<dependency>
    <groupId></groupId>
    <artifactId>kafka_2.13</artifactId>
    <version>3.4.0</version>
</dependency>

The detailed steps to implement sending simulated data to Kafka are shown in the code. Here are some key implementation details:

Kafka Configuration: In the code, you need to configure parameters such as Kafka's server address (), serializer for key and value, etc. in order to establish a connection to the Kafka cluster.
Creating a KafkaProducer: Use the configuration information to create a KafkaProducer object, which is responsible for sending data to the Kafka cluster.
Generate simulation data: In a loop, use your data generation logic to generate simulated data. This may include creating data in JSON format, setting data fields, simulating dates, etc.
Building a ProducerRecord: Use the generated mock data to construct a ProducerRecord object, which includes the target subject, key (if any), and data to be sent.
Send data: Use the send method of KafkaProducer to send the ProducerRecord to the Kafka topic.
Control send rate (optional): In a loop, you can control the rate at which data is generated and sent to avoid sending it too often by, for example.

The implementation is shown in the code.

@Slf4j
public class MovieDataProducer {
    public static void main(String[] args) {
        sendRawData();
    }

    private static void sendRawData() {
        // Kafka Server Address
        String kafkaBootstrapServers = "localhost:9092";

        // Kafka Themes
        String kafkaTopic = "ods_movie_data";

        // Creating a Kafka Producer Configuration
        Properties properties = new Properties();
        ("", kafkaBootstrapServers);
        ("", 
        "");
        ("", 
        "");

        // Creating a Kafka Producer
        try {
            Producer<String, String> producer 
            = new KafkaProducer<>(properties)
            // Generate and send analog movie data
            for (int i = 1; i <= 1000; i++) {
                String movieData = generateMovieData(i);
                (new ProducerRecord<>(kafkaTopic, 
                (i), movieData));

                // Printing of sent data information (optional)
                ("Sending data to Kafka:" + movieData);

                // Controls the rate of data generation, e.g. once per second.
                (1000);
            }
        } catch (InterruptedException e) {
            ("Send data to Kafka abnormalities:{}", e);
        }
    }

    // Generate analog movie data
    private static String generateMovieData(int rank) {
        String[] countries = {"America.", "China.", "India.", "England.", "Japan."};
        String[] genres = {"Action.", "Plot.", "Comedy.", "Sci-Fi.", "Adventure."};

        LocalDate releaseDate = ()
        .minusDays(new Random().nextInt(180));
        DateTimeFormatter formatter = 
        ("yyyy-MM-dd");

        MovieData movieData = new MovieData(
                rank,
                "Movie" + rank,
                (formatter),
                countries[new Random().nextInt()],
                genres[new Random().nextInt()],
                5 + 5 * (),
                new Random().nextInt(1000000)
        );

        // Returns a string result
        String result = "";

        // Using the Jackson Library to Convert Objects to JSON Strings
        try {
            ObjectMapper objectMapper = new ObjectMapper();
            result = (movieData);
        } catch (Exception e) {
            ("conversions JSON An exception occurs in the string:{}", e);
        }
        return result;

    }

    // Movie Data Class
    @Data
    private static class MovieData {
        private int rank;
        private String name;
        private String releaseDate;
        private String country;
        private String genre;
        private double rating;
        private int playCount;

        public MovieData(int rank, String name, String releaseDate
        , String country, String genre
        , double rating, int playCount) {
            this.rank = rank;
            this.name = name;
            this.releaseDate = releaseDate;
            this.country = country;
            this.genre = genre;
            this.rating = rating;
            this.playCount = playCount;
        }
    }
}

Make sure to replace localhost:9092 and ods_movie_data with the address of the Kafka server you are actually using and the theme name. This simple Java application generates mock movie data containing fields such as movie rank, movie name, release date, country of production, genre, rating, number of plays, etc. and sends it to the specified Kafka topic.

2.3.2 Data storage

Data storage plays a key role in big data analytics system, which not only needs to provide a highly reliable storage mechanism, but also needs to be intelligently partitioned according to business requirements for subsequent offline analysis and query. In the current scenario, we are facing a batch of real-time streaming data that keeps coming in, which needs to be effectively stored in Hive after real-time processing to meet the subsequent offline analysis requirements.
In order to ensure the timeliness of the data, we plan to store the data at 5-minute intervals as a time window, which will not only help improve query efficiency, but also better support time-based analytics. To achieve this goal, we will use Apache Flink as our stream processing engine to consume and process data in Kafka Topic in real-time through its integration with Kafka. The specific implementation flow is shown in Fig.

1. Environmental dependence

When consuming data from a Kafka cluster, Flink needs to introduce a series of dependencies to ensure the smooth operation of the system.
In order to achieve efficient consumption of data in a Kafka cluster, we need to introduce Flink-related dependencies. These dependencies include not only the core Flink libraries, but also libraries that connect and interact with Kafka. The specific dependencies are shown in the code.

<dependency>
    <groupId></groupId>
    <artifactId>flink-connector-filesystem_2.12</artifactId>
    <version>${}</version>
 </dependency>
<dependency>
    <groupId></groupId>
    <artifactId>flink-connector-kafka-0.11_2.12</artifactId>
    <version>${}</version>
 </dependency>
<dependency>
    <groupId></groupId>
    <artifactId>flink-streaming-java_2.12</artifactId>
    <version>${}</version>
 </dependency>

2. Read data

Write Flink code to consume Kafka Topic and store data directly to HDFS without additional logical processing for subsequent data preprocessing using MapReduce. Specific implementation is shown in the code.

@Slf4j
public class FlinkTemplateTask {

    public static void main(String[] args) {
        // Check that the input parameters meet the requirements
        if ( != 3) {
            ("kafka(server01:9092), 
            hdfs(hdfs://cluster01/data/), 
            flink(parallelism=2) must be exist.");
            return;
        }
        String bootStrapServer = args[0];
        String hdfsPath = args[1];
        int parallelism = (args[2]);

        // Creating a Flink Streaming Environment
        StreamExecutionEnvironment env = 
        ();
        (5000);
        (parallelism);
        ();

        // Reading data from Kafka
        DataStream<String> transction = 
        (new FlinkKafkaConsumer010<>("ods_movie_data"
        , new SimpleStringSchema(), configByKafkaServer(bootStrapServer)));

        // Store to HDFS
        BucketingSink<String> sink = new BucketingSink<>(hdfsPath);

        // Customize the name of the file stored on HDFS with hours and minutes to make it easier to calculate the policy later on
        (new JDateTimeBucketer<String>("HH-mm"));

        (1024 * 1024 * 4); // Size 5MB
        (1000 * 30); // Time 30s
        (sink);
        
        // Executing Flink Tasks
        ("Kafka2Hdfs");
    }

    // Setting up the Kafka consumer configuration
    private static Object configByKafkaServer(String bootStrapServer) {
        Properties props = new Properties();
        ("", bootStrapServer);
        ("", "test_bll_group");
        ("", "true");
        ("", "1000");
        ("", 
        "");
        ("", 
        "");
        return props;
    }

}

A special note here is that we set the time window to be short and check every 30 seconds. If no data arrives within the time window of the batch, we will generate a file and save it to HDFS.
In addition, we have rewritten DateTimeBucketer to create JDateTimeBucketer.The logic of this tweak is not complicated, it just adds a year-month-day/hour-minute file generation path to the original method. For example, the path generated on HDFS might be: xxxx/2023-10-10/00-00. This tweak helps to better organize and manage the generated files, making them more consistent with the time and date structure.

3. File naming strategy

In this step, we need to preprocess the files that have been stored on HDFS. The processing logic is as follows: for example, if the current time is 2023-10-10 14:00, we need to process the last 5 minutes of the day, 13:55, 13:56, 13:57, 13:58, and 13:59, and load them into one of Hive's last 5 minutes partitions. To accomplish this, we need to generate a logical policy collection with HH-mm as the key and the 5 nearest files to it as the value.This collection will be used for data preprocessing and merging. The implementation is shown in the code.

public class DateRangeStrategy {
    public static void main(String[] args) {
        getFileNameStrategy();
    }

    // Generation Strategy
    private static void getFileNameStrategy() {
        for (int i = 0; i < 24; i++) {
            for (int j = 0; j < 60; j++) {
                if (j % 5 == 0) {
                    if (j < 10) {
                        if (i < 10) {
                            if (i == 0 && j == 0) {
                                
                                ("0" + i + "-0" + j 
                                + "=>23-59,23-58,23-57,23-56,23-55");
                            } else {
                                if (j == 0) {
                                    String tmp = "";
                                    for (int k = 1; k <= 5; k++) {
                                        tmp += "0" + (i - 1) + "-" 
                                        + (60 - k) + ",";
                                    }
                                    
                                    ("0" + i + "-0" + j 
                                    + "=>" + (0, 
                                    () - 1));
                                } else {
                                    String tmp = "";
                                    for (int k = 1; k <= 5; k++) {
                                        if (j - k < 10) {
                                            tmp += "0" + i + "-0" 
                                            + (j - k) + ",";
                                        } else {
                                            tmp += "0" + i + "-" 
                                            + (j - k) + ",";
                                        }
                                    }
                                    ("0" + i + "-0" + j 
                                    + "=>" + (0, () - 1));
                                }
                            }
                        } else {
                            if (j == 0) {
                                String tmp = "";
                                for (int k = 1; k <= 5; k++) {
                                    if (i - 1 < 10) {
                                        tmp += "0" + (i - 1) + "-" + (60 - k) + ",";
                                    } else {
                                        tmp += (i - 1) + "-" + (60 - k) + ",";
                                    }
                                }
                                (i + "-0" + j + "=>" 
                                + (0, () - 1));
                            } else {
                                String tmp = "";
                                for (int k = 1; k <= 5; k++) {
                                    if (j - k < 10) {
                                        tmp += i + "-0" + (j - k) + ",";
                                    } else {
                                        tmp += i + "-" + (j - k) + ",";
                                    }
                                }
                                (i + "-0" + j 
                                + "=>" + (0, () - 1));
                            }
                        }
                    } else {
                        if (i < 10) {
                            String tmp = "";
                            for (int k = 1; k <= 5; k++) {
                                if (j - k < 10) {
                                    tmp += "0" + i + "-0" 
                                    + (j - k) + ",";
                                } else {
                                    tmp += "0" + i + "-" 
                                    + (j - k) + ",";
                                }
                            }
                            ("0" + i + "-" 
                            + j + "=>" + (0, () - 1));
                        } else {
                            String tmp = "";
                            for (int k = 1; k <= 5; k++) {
                                if (j - 1 < 10) {
                                    tmp += i + "-0" + (j - k) + ",";
                                } else {
                                    tmp += i + "-" + (j - k) + ",";
                                }
                            }
                            (i + "-" + j 
                            + "=>" + (0, () - 1));
                        }
                    }
                }
            }
        }
    }
}

4. Data loading

When the data is ready, we can load the pre-processed files on HDFS into the corresponding tables directly with the help of Hive's LOAD command. The specific implementation is shown in the code.

LOAD DATA INPATH '/data/hive/hfile/data/min/2023-10-10/14-05/' 
OVERWRITE INTO TABLE 
game_user_db.ods_movie_data PARTITION(day='2023-10-10',hour='14',min='05');

When executing a command, if the file does not exist it may cause a loading error. Therefore, before loading the HDFS path, we can determine whether the path exists or not. The implementation is shown in the code.

#!/bin/bash

# HDFS data path
hdfs_path='/data/hive/hfile/data/min/2023-10-10/14-05/'

# Check if the HDFS path exists
if hdfs dfs -test -e "$hdfs_path"; then
    # If present, perform the load operation
echo "Performing a Hive Data Load Operation"
    hive -e "LOAD DATA INPATH 
        '$hdfs_path' 
        OVERWRITE INTO TABLE 
        game_user_db.ods_movie_data 
        PARTITION(day='2023-10-10',hour='14',min='05');"
else
    echo "HDFS path: ['$hdfs_path'] does not exist"
fi

It is important to note that this script first checks if the HDFS path exists, and if it does, it performs the load operation, otherwise it outputs an error message. This avoids loading errors caused by non-existent files.

2.3.3 Data analysis

Data analysis is a science, through the organization, cleaning and analysis of data, we can dig out the laws and trends hidden behind the huge data. The continuous development of data analysis tools makes this process more efficient. The application of statistics, machine learning, artificial intelligence and other technologies enables us to understand the information in the data more deeply. Through data visualization techniques, we are able to turn abstract data into intuitive charts and images, making it easier to understand and convey the meaning of the data.

1. Analyze the film year histogram

In order to generate a histogram of movie years, we can do this by aggregating the movie year data. The implementation is shown in the code.

-- Year of movie
SELECT release_date, COUNT(1) AS pv
FROM ods_movie_data
WHERE day = '2023-10-10'
GROUP BY release_date;

Execute the above code and analyze the results as shown in the figure.

2. Analyze film genre fan charts

In order to generate a fan chart of movie types, we can do this by aggregating the movie type data. The implementation is shown in the code.

-- Movie Type
SELECT genre, COUNT(1) AS pv
FROM ods_movie_data
WHERE day = '2023-10-10'
GROUP BY genre;

Execute the above code and analyze the results as shown in the figure.

3. Analyze the movie rating scatterplot

In order to generate a scatterplot of movie ratings, we can do this by aggregating the movie rating data. The implementation is shown in the code.

-- movie rating
SELECT rating, COUNT(1) AS pv
FROM ods_movie_data
WHERE day = '2023-10-10'
GROUP BY rating;

Execute the above code and analyze the results as shown in the figure.

3. Summary

Focusing on building a big data analytics system, this chapter provides a comprehensive introduction to the architectural design of the system and delves into the implementation details of each module. Through a step-by-step approach, readers are able to gradually understand the process of building a big data analytics system, so as to better understand its operation mechanism.
Overall, this content aims to provide readers with a comprehensive and in-depth guide to the implementation of big data analytics systems, and to help readers better master the core concepts and techniques of big data analytics through a combination of theory and practice, so as to lay a solid foundation for practical project applications.

4. Concluding remarks

This blog will share with you here, if you have any questions in the process of research and study, you can add the group for discussion or send an email to me, I will do my best to answer for you, with you!

Also, the blogger has a new book out calledDeeper Understanding of Hiveand the simultaneous publication of theKafka is not hard to learnand theHadoop Big Data Mining from Introduction to Advanced PracticeIt can also be used in conjunction with the new book, so if you like it, you can use it with the new book.Click on the buy link on the bulletin board there to purchase the blogger's bookTo carry out the study, I would like to thank you for your support. Follow the public number below and follow the prompts to get free instructional videos for the books.