Today we are here to talk about big data, as a Hadoop novice, I also do not dare to delve into the complex underlying principles. Therefore, the focus of this article is more from a practical and introductory practice point of view, leading you to understand the basic process of big data applications. We will help you get started through a classic case - WordCounter. Simply put, the goal of this case is to read each line from a text file, count the frequency of word occurrences in it, and finally generate a statistical result. On the surface, this task doesn't seem too difficult; after all, we can easily do it locally with a Java program.
However, the reality is not so simple. Although we can complete similar tasks on a computer through a simple Java program, but in the big data scenario, the amount of data far exceeds the ability of a machine to handle. At this point, simply relying on a machine's computing resources will not be able to cope with the huge amount of data, which is where the importance of distributed computing and storage technology lies. Distributed computing splits the task into multiple subtasks and uses multiple machines to work together, thus realizing efficient processing of massive data, while distributed storage can slice the data and store it on multiple nodes to solve the bottleneck of data storage and access.
Therefore, through today's introduction, I hope to take you from a simple example, step by step to understand how big data processing with the help of distributed frameworks such as Hadoop, to efficiently carry out data computation and storage.
environmental preparation
Hadoop Installation
I'm not a big fan of installing on a local Windows system, as local environments usually accumulate a lot of unnecessary files and configurations that may affect the cleanliness and smoothness of the system. Therefore, the focus of the demo will be on a Linux server-based environment with Docker for rapid deployment.
We will utilize the Pagoda panel for one-click installation, which allows you to complete the entire deployment process with simple operations, eliminating the hassle of manually knocking out commands and making the installation easier and faster.
open port
Here, the system itself has opened some ports to the outside world, such as 9870 for accessing the Web UI interface, but one important port, 8020, is not open. This is the port that we need to connect to and use through the local IntelliJ IDEA, so we have to manually configure additional ports to make sure that this port can be accessed properly. You can refer to the following schematic to set up the port in order to successfully complete the connection.
If you have successfully started and completed the configuration, you should be able to access and view the web page without any problems at this point. This is shown in the figure:
Project Development
Create a project
We can directly create a new project and manually configure the relevant project information according to the project requirements, for examplegroupId
、artifactId
、version
The basic configurations. To ensure compatibility and stability, we chose to use JDK 8 as the development environment version.
First, let's take a look at the project's file directory structure in order to get a clear picture of how the entire project is organized and how the files are distributed.
tree /f can be generated directly
├─input
│
├─output
├─src
│ ├─main
│ │ ├─java
│ │ │ └─org
│ │ │ └─xiaoyu
│ │ │
│ │ │
│ │ │
│ │ │
│ │ └─resources
│ │
│ │
Next, we will implement the classic example of Big Data - the "Hello, World!" program, which we usually call WordCounter. in order to do this, first we need to write the MapReduce program. In the Map phase, the main task is to parse the input file, breaking down the data into a regular format (e.g., key-value pairs of words and their occurrences). Then, in the Reduce phase, we will summarize and count the data output from the Map phase to get the statistics we want, such as the number of occurrences of each word.
In addition, we also need to write a startup class - Job class, used to configure and start the MapReduce task, to ensure that the Map and Reduce phase of the process can be carried out smoothly. Through the realization of this entire process, we have completed a basic WordCounter program, thus understanding the core ideas and applications of MapReduce.
pom dependencies
There's not much to say here, just add the relevant dependencies:
<dependency>
<groupId></groupId>
<artifactId>hadoop-common</artifactId>
<version>3.2.0</version>
</dependency>
<dependency>
<groupId></groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>3.2.0</version>
</dependency>
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>1.2.17</version>
</dependency>
<dependency>
<groupId></groupId>
<artifactId>hadoop-client</artifactId>
<version>3.2.0</version>
</dependency>
<!--mapreduce-->
<dependency>
<groupId></groupId>
<artifactId>hadoop-mapreduce-client-core</artifactId>
<version>3.2.0</version>
</dependency>
<dependency>
<groupId></groupId>
<artifactId>hadoop-mapreduce-client-common</artifactId>
<version>3.2.0</version>
</dependency>
Configured here is our remote Hadoop connection configuration information:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href=""?>
<configuration>
<property>
<name></name>
<value>hdfs://Your own.ip:8020</value>
</property>
</configuration>
We are mainly focusing on the demo this time, so we don't need to deal with very large files. To simplify the demo process, I have provided only a portion of the data here.
xiaoyu xiaoyu
cuicui ntfgh
hanhan dfb
yy yy
asd dfg
123 43g
nmao awriojd
InputCountMapper
Let's start by building the InputCountMapper class. The code is as follows.
import ;
import ;
import ;
import ;
import ;
public class InputCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = ().trim();
for (int i = 0; i < (" ").length; i++) {
((" ")[i]);
(word, one);
}
}
}
In MapReduce programming for Hadoop, the writeup is actually relatively simple; the key is to properly understand and define generics. You need to integrate aMapper
class and define four generic types for it based on the needs of the task. In this process, every two generic types form a pair, forming a K-V (key-value pair) structure. In the above example, the K-V types for the input data areLongWritable-Text
The K-V type of the output data is defined asText-IntWritable
. Here.LongWritable
、Text
、IntWritable
etc. are Hadoop customized data types that represent different data formats and types. In addition to theString
is replaced in Hadoop withText
, other datatypes are usually followed by theWritable
Suffix.
Next, forMapper
class's output format, we have defined the format type in our code. However, it is important to note that our rewrite of themap
method does not return a value directly. Instead, theMapper
class will be passed through theContext
context object to pass the final result.
Therefore, we just need to make sure that themap
method to store the formatted data into theContext
Then hand it over toReducer
Just deal with it.
WordsCounterReducer
The code for this step is as follows:
import ;
import ;
import ;
import ;
public class WordsCounterReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += ();
}
(key, new IntWritable(sum));
}
}
In MapReduce programming for Hadoop, theReduce
The writing of the phases also follows a fixed pattern. First, we need to integrate theReducer
class, and define the four generalized parameters, similar to theMapper
Phase. These four generalizations include input key-value pair types, input value types, output key-value pair types, and output value types.
existReduce
stage, the format of the input data changes, especially in the value part, which usually becomesIterable
A collection of types. The reason for this change is that theMapper
stage processing, we usually deposit the number of occurrences of each word (or other statistical information) as 1 into theContext
. For example, suppose that theMapper
When the word "xiaoyu" is encountered at this stage, we will output a(xiaoyu, 1)
The key-value pairs of "xiaoyu" and "xiaoyu" are used in the input data. As a result, if the word "xiaoyu" occurs multiple times in the input data, theContext
will combine these key-value pairs into aIterable
Set, like.(xiaoyu, [1, 1])
, indicating that the word appears twice.
In this example, theReduce
The operation of the stage is very simple, and only requires that each of theIterable
It is sufficient to accumulate the values in the set. For example, forxiaoyu
The set of inputs to the(xiaoyu, [1, 1])
We just need to make all of its1
The values are accumulated to give the final result 2.
Main
Finally we need to generate a Job with the following code:
import ;
import ;
import ;
import ;
import ;
import ;
import ;
public class Main {
static {
try {
("E:\\");//Absolute addresses are recommended,binPath to the file in the directory
} catch (UnsatisfiedLinkError e) {
("Native code library failed to load.\n" + e);
(1);
}
}
public static void main(String[] args) throws Exception{
Configuration conf = new Configuration();
Job job = (conf, "wordCounter");
();
();
();
();
();
(job, new Path("file:///E:/hadoop/test/input"));
(job, new Path("file:///E:/hadoop/test/output"));
((true) ? 0 : 1);
}
}
Okay, what is shown here is a completely fixed writeup, but in practice, special attention needs to be paid to the fact that we have to connect to a remote Hadoop cluster through a Windows environment to perform the relevant operations. There are a lot of potential problems and pitfalls to be encountered in this process, especially in terms of configuration, connection, permissions, etc.
Next, I will analyze and solve these common difficulties one by one, hoping to provide you with some practical reference and guidance to help you complete the operation more smoothly.
solution difficulties
Catalog does not exist
If you are not operating with a local Windows directory as the main focus, but rather with a directory on a remote server, then you might use a writeup similar to the following:
(job, new Path("/input"));
(job, new Path("/output"));
Well, in this case, we must first create the input directory related to the operation, but we need to pay special attention to not creating the output directory in advance, because Hadoop automatically creates this directory when running the job, and if the directory already exists, it will lead to a failure in the execution of the job. Therefore, just enter the Docker environment and execute the following commands directly to start the operation.
hdfs dfs -mkdir /input
Of course, there is a simpler way to create relevant directories or resources directly on the page through the graphical interface. You can refer to the following steps, as shown in the figure:
Permission denied
Next, when you are running the Job task, the system will try to create the output directory (output) in the last step. However, since the current user does not have enough privileges to do this, a privilege error message similar to the following will appear:Permission denied: user=yu, access=WRITE, inode="/":root:supergroup:drwxr-xr-x
. This error means that the current user (yu) is trying to create a directory or file in the root directory, but the job execution fails because the permissions of the directory are set so that only administrators (root) can write to it, and ordinary users cannot perform write operations.
So you still need to go into the docker container and execute the following command:
hadoop fs -chmod 777 /
This basically completes the task successfully. Next, you can click directly into the output directory to view the contents of the file. However, it should be noted that since we did not configure a specific IP address, you will need to manually replace the IP address in the file with your own to ensure that the download process goes smoothly and you get the file you need.
Error: $Windows
This problem is usually due to a lack of file causes this. When running Hadoop on a Windows system, the
or
are required dependencies because they provide the native code support and execution environment that Hadoop needs on Windows.
To ensure smooth operation, you need to download the corresponding version of the or
Files. Multiple Hadoop versions of these files have been prepared for you, all of which can be downloaded from the following links:/cdarlint/winutils
We're only downloading one here, and in order not to reboot the computer, we're just going to write it to death inside the code:
static {
try {
("E:\\\");//suggest absolute address, file path in bin directory
} catch (UnsatisfiedLinkError e) {
("Native code library failed to load.\n" + e);
(1);
}
}
If you still have problems, then configure the wsl subsystem for windows:
Use the Windows + R shortcut to open the Run dialog box and execute OptionalFeatures to open Windows Features.
Check "Windows Subsystem for Linux" and "Virtual Machine Platform", and then click "OK".
final result
Finally successfully ran the results! In this process, the output is sorted in the default order, but of course this sorting can be customized as needed. If you're interested in how to control the sorting, you can actually drill down and tweak the sorting mechanism.
summarize
Through today's sharing, we briefly understand WordCounter, a classic application in big data processing, and show how to use MapReduce for distributed computing through the practice of the Hadoop framework. Although on the surface, WordCounter is a relatively simple program, it reveals the core ideas in big data processing.
From installing and configuring to writing code, we have walked through the process of building a Hadoop cluster step by step. We hope that through this article, you can get some inspiration and help for big data application development, especially MapReduce programming under the Hadoop framework. The world of big data is huge and complex, but every little practice will take you one step closer to really mastering this technology.
I'm Rain, a Java server-side coder, studying the mysteries of AI technology. I love technical communication and sharing, and I am passionate about open source community. I am also a Tencent Cloud Creative Star, Ali Cloud Expert Blogger, Huawei Cloud Enjoyment Expert, and Nuggets Excellent Author.
💡 I won't be shy about sharing my personal explorations and experiences on the path of technology, in the hope that I can bring some inspiration and help to your learning and growth.
🌟 Welcome to the effortless drizzle! 🌟