Getting Started with Hadoop from Scratch: IntelliJ IDEA Remote Connection Server Hadoop Running WordCount

Today we are here to talk about big data, as a Hadoop novice, I also do not dare to delve into the complex underlying principles. Therefore, the focus of this article is more from a practical and introductory practice point of view, leading you to understand the basic process of big data applications. We will help you get started through a classic case - WordCounter. Simply put, the goal of this case is to read each line from a text file, count the frequency of word occurrences in it, and finally generate a statistical result. On the surface, this task doesn't seem too difficult; after all, we can easily do it locally with a Java program.

However, the reality is not so simple. Although we can complete similar tasks on a computer through a simple Java program, but in the big data scenario, the amount of data far exceeds the ability of a machine to handle. At this point, simply relying on a machine's computing resources will not be able to cope with the huge amount of data, which is where the importance of distributed computing and storage technology lies. Distributed computing splits the task into multiple subtasks and uses multiple machines to work together, thus realizing efficient processing of massive data, while distributed storage can slice the data and store it on multiple nodes to solve the bottleneck of data storage and access.

Therefore, through today's introduction, I hope to take you from a simple example, step by step to understand how big data processing with the help of distributed frameworks such as Hadoop, to efficiently carry out data computation and storage.

environmental preparation

Hadoop Installation

I'm not a big fan of installing on a local Windows system, as local environments usually accumulate a lot of unnecessary files and configurations that may affect the cleanliness and smoothness of the system. Therefore, the focus of the demo will be on a Linux server-based environment with Docker for rapid deployment.

We will utilize the Pagoda panel for one-click installation, which allows you to complete the entire deployment process with simple operations, eliminating the hassle of manually knocking out commands and making the installation easier and faster.

open port

Here, the system itself has opened some ports to the outside world, such as 9870 for accessing the Web UI interface, but one important port, 8020, is not open. This is the port that we need to connect to and use through the local IntelliJ IDEA, so we have to manually configure additional ports to make sure that this port can be accessed properly. You can refer to the following schematic to set up the port in order to successfully complete the connection.

If you have successfully started and completed the configuration, you should be able to access and view the web page without any problems at this point. This is shown in the figure:

Project Development

Create a project

We can directly create a new project and manually configure the relevant project information according to the project requirements, for examplegroupId、artifactId、version The basic configurations. To ensure compatibility and stability, we chose to use JDK 8 as the development environment version.

First, let's take a look at the project's file directory structure in order to get a clear picture of how the entire project is organized and how the files are distributed.

tree /f can be generated directly

├─input
│      
├─output
├─src
│  ├─main
│  │  ├─java
│  │  │  └─org
│  │  │      └─xiaoyu
│  │  │              
│  │  │              
│  │  │              
│  │  │
│  │  └─resources
│  │          
│  │

Next, we will implement the classic example of Big Data - the "Hello, World!" program, which we usually call WordCounter. in order to do this, first we need to write the MapReduce program. In the Map phase, the main task is to parse the input file, breaking down the data into a regular format (e.g., key-value pairs of words and their occurrences). Then, in the Reduce phase, we will summarize and count the data output from the Map phase to get the statistics we want, such as the number of occurrences of each word.

In addition, we also need to write a startup class - Job class, used to configure and start the MapReduce task, to ensure that the Map and Reduce phase of the process can be carried out smoothly. Through the realization of this entire process, we have completed a basic WordCounter program, thus understanding the core ideas and applications of MapReduce.

pom dependencies

There's not much to say here, just add the relevant dependencies:

<dependency>
    <groupId></groupId>
    <artifactId>hadoop-common</artifactId>
    <version>3.2.0</version>
</dependency>
<dependency>
    <groupId></groupId>
    <artifactId>hadoop-hdfs</artifactId>
    <version>3.2.0</version>
</dependency>
<dependency>
    <groupId>log4j</groupId>
    <artifactId>log4j</artifactId>
    <version>1.2.17</version>
</dependency>
<dependency>
    <groupId></groupId>
    <artifactId>hadoop-client</artifactId>
    <version>3.2.0</version>
</dependency>

<!--mapreduce-->
<dependency>
    <groupId></groupId>
    <artifactId>hadoop-mapreduce-client-core</artifactId>
    <version>3.2.0</version>
</dependency>
<dependency>
    <groupId></groupId>
    <artifactId>hadoop-mapreduce-client-common</artifactId>
    <version>3.2.0</version>
</dependency>

Configured here is our remote Hadoop connection configuration information:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href=""?>
<configuration>
    <property>
        <name></name>
        <value>hdfs://Your own.ip:8020</value>
    </property>
</configuration>

We are mainly focusing on the demo this time, so we don't need to deal with very large files. To simplify the demo process, I have provided only a portion of the data here.

xiaoyu xiaoyu
cuicui ntfgh
hanhan dfb
yy yy
asd dfg
123 43g
nmao awriojd

InputCountMapper

Let's start by building the InputCountMapper class. The code is as follows.

import ;
import ;
import ;
import ;

import ;

public class InputCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String line = ().trim();
        for (int i = 0; i < (" ").length; i++) {
            ((" ")[i]);
            (word, one);
        }
    }
}

In MapReduce programming for Hadoop, the writeup is actually relatively simple; the key is to properly understand and define generics. You need to integrate aMapperclass and define four generic types for it based on the needs of the task. In this process, every two generic types form a pair, forming a K-V (key-value pair) structure. In the above example, the K-V types for the input data areLongWritable-TextThe K-V type of the output data is defined asText-IntWritable. Here.LongWritable、Text、IntWritableetc. are Hadoop customized data types that represent different data formats and types. In addition to theStringis replaced in Hadoop withText, other datatypes are usually followed by theWritableSuffix.

Next, forMapperclass's output format, we have defined the format type in our code. However, it is important to note that our rewrite of themapmethod does not return a value directly. Instead, theMapperclass will be passed through theContextcontext object to pass the final result.

Therefore, we just need to make sure that themapmethod to store the formatted data into theContextThen hand it over toReducerJust deal with it.

WordsCounterReducer

The code for this step is as follows:

import ;
import ;
import ;

import ;

public class WordsCounterReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        int sum = 0;
        for (IntWritable val : values) {
            sum += ();
        }
        (key, new IntWritable(sum));
    }
}

In MapReduce programming for Hadoop, theReduceThe writing of the phases also follows a fixed pattern. First, we need to integrate theReducerclass, and define the four generalized parameters, similar to theMapperPhase. These four generalizations include input key-value pair types, input value types, output key-value pair types, and output value types.

existReducestage, the format of the input data changes, especially in the value part, which usually becomesIterableA collection of types. The reason for this change is that theMapperstage processing, we usually deposit the number of occurrences of each word (or other statistical information) as 1 into theContext. For example, suppose that theMapperWhen the word "xiaoyu" is encountered at this stage, we will output a(xiaoyu, 1)The key-value pairs of "xiaoyu" and "xiaoyu" are used in the input data. As a result, if the word "xiaoyu" occurs multiple times in the input data, theContextwill combine these key-value pairs into aIterableSet, like.(xiaoyu, [1, 1]), indicating that the word appears twice.

In this example, theReduceThe operation of the stage is very simple, and only requires that each of theIterableIt is sufficient to accumulate the values in the set. For example, forxiaoyuThe set of inputs to the(xiaoyu, [1, 1])We just need to make all of its1The values are accumulated to give the final result 2.

Main

Finally we need to generate a Job with the following code:

import ;
import ;
import ;
import ;
import ;
import ;
import ;

public class Main {
    static {
        try {
            ("E:\\");//Absolute addresses are recommended，binPath to the file in the directory
        } catch (UnsatisfiedLinkError e) {
            ("Native code library failed to load.\n" + e);
            (1);
        }
    }


    public static void main(String[] args) throws Exception{
        Configuration conf = new Configuration();
        Job job = (conf, "wordCounter");
        ();
        ();
        ();

        ();
        ();

        (job, new Path("file:///E:/hadoop/test/input"));
        (job, new Path("file:///E:/hadoop/test/output"));

        ((true) ? 0 : 1);
    }
}

Okay, what is shown here is a completely fixed writeup, but in practice, special attention needs to be paid to the fact that we have to connect to a remote Hadoop cluster through a Windows environment to perform the relevant operations. There are a lot of potential problems and pitfalls to be encountered in this process, especially in terms of configuration, connection, permissions, etc.

Next, I will analyze and solve these common difficulties one by one, hoping to provide you with some practical reference and guidance to help you complete the operation more smoothly.

solution difficulties

Catalog does not exist

If you are not operating with a local Windows directory as the main focus, but rather with a directory on a remote server, then you might use a writeup similar to the following:

(job, new Path("/input"));
(job, new Path("/output"));

Well, in this case, we must first create the input directory related to the operation, but we need to pay special attention to not creating the output directory in advance, because Hadoop automatically creates this directory when running the job, and if the directory already exists, it will lead to a failure in the execution of the job. Therefore, just enter the Docker environment and execute the following commands directly to start the operation.

hdfs dfs -mkdir /input

Of course, there is a simpler way to create relevant directories or resources directly on the page through the graphical interface. You can refer to the following steps, as shown in the figure:

Permission denied

Next, when you are running the Job task, the system will try to create the output directory (output) in the last step. However, since the current user does not have enough privileges to do this, a privilege error message similar to the following will appear:Permission denied: user=yu, access=WRITE, inode="/":root:supergroup:drwxr-xr-x. This error means that the current user (yu) is trying to create a directory or file in the root directory, but the job execution fails because the permissions of the directory are set so that only administrators (root) can write to it, and ordinary users cannot perform write operations.

So you still need to go into the docker container and execute the following command:

hadoop fs -chmod 777 /

This basically completes the task successfully. Next, you can click directly into the output directory to view the contents of the file. However, it should be noted that since we did not configure a specific IP address, you will need to manually replace the IP address in the file with your own to ensure that the download process goes smoothly and you get the file you need.

Error: $Windows

This problem is usually due to a lack of file causes this. When running Hadoop on a Windows system, the or are required dependencies because they provide the native code support and execution environment that Hadoop needs on Windows.

To ensure smooth operation, you need to download the corresponding version of the or Files. Multiple Hadoop versions of these files have been prepared for you, all of which can be downloaded from the following links:/cdarlint/winutils

We're only downloading one here, and in order not to reboot the computer, we're just going to write it to death inside the code:

static {
  try {
      ("E:\\\");//suggest absolute address, file path in bin directory
  } catch (UnsatisfiedLinkError e) {
      ("Native code library failed to load.\n" + e);
      (1);
  }
}

If you still have problems, then configure the wsl subsystem for windows:

Use the Windows + R shortcut to open the Run dialog box and execute OptionalFeatures to open Windows Features.

Check "Windows Subsystem for Linux" and "Virtual Machine Platform", and then click "OK".

final result

Finally successfully ran the results! In this process, the output is sorted in the default order, but of course this sorting can be customized as needed. If you're interested in how to control the sorting, you can actually drill down and tweak the sorting mechanism.

summarize

Through today's sharing, we briefly understand WordCounter, a classic application in big data processing, and show how to use MapReduce for distributed computing through the practice of the Hadoop framework. Although on the surface, WordCounter is a relatively simple program, it reveals the core ideas in big data processing.

From installing and configuring to writing code, we have walked through the process of building a Hadoop cluster step by step. We hope that through this article, you can get some inspiration and help for big data application development, especially MapReduce programming under the Hadoop framework. The world of big data is huge and complex, but every little practice will take you one step closer to really mastering this technology.

I'm Rain, a Java server-side coder, studying the mysteries of AI technology. I love technical communication and sharing, and I am passionate about open source community. I am also a Tencent Cloud Creative Star, Ali Cloud Expert Blogger, Huawei Cloud Enjoyment Expert, and Nuggets Excellent Author.

💡 I won't be shy about sharing my personal explorations and experiences on the path of technology, in the hope that I can bring some inspiration and help to your learning and growth.

🌟 Welcome to the effortless drizzle! 🌟