Deeper Understanding of HDFS Error Recovery

We look at hdfs from a dynamic perspective

First of all, from the scene, we know that hdfs write file process is like this.

The data is written to hdfs in a pipeline fashion, and then for read operations, the client selects one of the DataNodes that holds a copy of the block to read. Consider these two scenarios.

When the hbase rs are writing wal logs. If an rs hangs. Then the rs will migrate and read the wal log to restore the previous state. If the pipeline to write the wal log is not completed when the rs hangs, then the wal log data must be different on different dn's. How does hdfs ensure that the rs are migrated? So how does hdfs ensure that the rs can be restored to the correct state after the transfer?
streaming computation to write hdfs, if the middle datanode hangs. hdfs is how to ensure that the streaming computation program does not throw an error, and continue to run?

This leads to a very important feature of hdfs is hdfs write error recovery. For hdfs write error recovery. Then you need to understand three important concepts: lease recovery, block recovery, and pipeline recovery . The fault tolerance of hdfs writes is guaranteed by these three concepts. These three concepts are interrelated and mutually inclusive. It all has to do with writing files.

Lease Recovery Before a client can write to an HDFS file, it must acquire a lease, which is essentially a lock. If the client wishes to continue writing, the lease must be renewed for an agreed upon period of time. If the lease is not explicitly renewed or the client holding it dies, then it expires. When this happens, HDFS will close the file on behalf of the client and release the lease so that other clients can write to it. This process is called lease restoration.
Block Recovery If the last block of a file being written is not passed to all DataNodes in the pipeline, the amount of data written to different nodes may be different when lease recovery occurs. A process is required to ensure that all copies of the last block have the same length before lease recovery causes the file to be closed. This process is called block recovery. Block recovery is triggered only during lease recovery and is triggered during lease recovery only when the last block of the file is not in the COMPLETE state.
Pipeline Recovery During a write pipeline operation, some DataNodes in the pipeline may fail. When this happens, the underlying write operation cannot just fail. Instead, HDFS will attempt to recover from the error to allow the pipeline to continue running and the client to continue writing to the file. The mechanism for recovering from pipeline errors is called pipeline recovery.

We know that writing a file is writing a block. The above error recovery, the ultimate goal is to ensure that all the client's file of all the blocks can be completely written to all the datanode. Therefore, we need to look at the block from a more detailed point of view, to understand some of the concepts and semantics of the block

First, the block in a datanode is called a replica. This is used to distinguish between blocks in a namenode. For a replica, it has the following states, which correspond to a dynamic process of writing the replica to the datanode.

FINALIZED When a replica is in this state, writes to the replica are complete and the data in the replica is "frozen" (of a defined length) unless the replica is reopened for appending. The data in the copy is "frozen" (length determined) unless the copy is reopened for appending.generation stamp All final copies (called GS) of a block should have the same data. The GS of the final copy may increase as a result of the recovery taking place.
RBW (Replica Being Written) This is the state of any replica that is being written, regardless of whether the file was created for writing or reopened for appending. The RBW copy always has a block of the file open. Data is still being written to the copy and has not yet been finalized. The data of the RBW copy (not necessarily all of it) is visible to the reading client. If any failure occurs, an attempt will be made to save the data in the RBW copy.
RWR (Replica Waiting to be Recovered) If a DataNode dies and is restarted, all of its RBW replicas will change to RWR status. The RWR copies are either obsolete and therefore discarded, or will participate in block recovery in lease recovery.
RUR (Replica Under Recovery) Non-TEMPORARY copies will change to RUR status when participating in lease recovery.
TEMPORARY A temporary replica, used for block replication, initiated by a replication monitor or cluster balancer. It is similar to an RBW copy, except that its data is invisible to all reader clients. If block replication fails, a TEMPORARY copy is deleted.

This is the state of a copy of a datanode, then compare it to the state of a block of a namenode.

UNDER_CONSTRUCTION This is the state at the time of writing. The UNDER_CONSTRUCTION block is the last block in the open file; its length and GS are still variable, and its data (not necessarily all of it) is visible to the reader. The UNDER_CONSTRUCTION block in a NameNode keeps track of the location of legal RBW and RWR copies in the pipeline.
UNDER_RECOVERY If the last block of a file is in the UNDER_CONSTRUCTION state at the expiration of the corresponding client's lease, block recovery will begin and it will change to the UNDER_RECOVERY state.
COMMITTED COMMITTED means that a block's data and GS are no longer variable (unless it is reopened for appending, and then the number of DataNodes reported with FINALIZED copies of the same GS/length is less than the set minimum number of copies. In order to service a read request, the COMMITTED block must keep track of the location of the RBW copy, its GS, and the length of its FINALIZED copy. The UNDER_CONSTRUCTION block changes to COMMITTED when the client asks the NameNode to add a new block to the file or to close the file If the last or penultimate block is in the COMMITTED state, the file cannot be closed and the client must retry.
COMPLETE The COMMITTED block changes to COMPLETE when the NameNode detects that the number of FINALIZED copies matching the GS/length requirement reaches the minimum number of copies required. the file can be closed only when all blocks of the file are COMPLETE. A block may be forced into the COMPLETE state even if it does not have the minimum number of replica copies . This is the case, for example, when a client requests a new block when the previous block has not yet been completed.

DataNode saves the state of replicas to disk, but NameNode does not save block state to disk. When the NameNode restarts, it changes the state of the last block of all previously opened files to the UNDER_CONSTRUCTION state, and changes the state of all other blocks to COMPLETE.

Simplified state transitions for copies and blocks are shown in two figures.

In the above copy/block state transition process, there is an important judgment basis, that is, Generation Stamp (GS)

GS is a monotonically increasing number of 8 bytes per block that is persistently maintained by the NameNode. The main roles of the GS for blocks and replicas are the following.

Detecting stale copies of blocks: i.e., this may happen when the copy GS is older than the block GS, e.g., when an append operation is skipped in some way in the copy.
Detecting expired copies on a DataNode, e.g. a datanode that has died for a long time and then rejoined the cluster.

A new GS needs to be generated when any of the following occurs:

A new file was created
Client opens an existing file for append or truncate
Client encounters an error while writing data to DataNode(s) and requests a new GS
Lease recovery for NameNode startup files

Next, let's look at lease recovery. Block recovery is triggered by lease recovery and is included in the lease recovery process.

The lease recovery process is triggered on the NameNode. It is triggered in two scenarios: when a monitoring thread detects that a lease hard limit has expired, or when a client tries to take over a lease from another client when the soft limit expires. Lease recovery examines each open file written by the same client, and if the last block of the file is not in the COMPLETE state, it performs a block recovery on the file, and then closes the file.

The following is the lease recovery process for a given file f . When a client dies abnormally, the following process also occurs for each file opened by this client's writes.

Get the DataNode of the last block containing f.
Assign one of the DataNodes as the primary DataNode p.
p Get a new GS tag from the NameNode.
p Get information about this block from each DataNode.
p Calculate to get the minimum length of this block.
p Update the blocks of a DataNode with a legal GS tag to the new GS tag and minimum block length.
p Notify the NameNode of the result of the update.
The NameNode updates the BlockInfo.
NameNode deletes the lease for f (other writers can now get leases to write to f).
NameNode commits changes to edit log.

Steps 3 through 7 are the block recovery portion of the recovery process.

Sometimes it is necessary to force the restoration of a file's lease before a hard restriction expires. To do this, you can force the lease to be restored using the command.

hdfs debug recoverLease [-path] [-retries ]

From the inside to the outside, next, continue to look at the outer layer of pipeline recovery (pipeline recovery)

First look at the flow of the write pipeline (write pipeline)

When an HDFS client writes a file, the data is written as sequential chunks. To write or construct a chunk, HDFS breaks the chunks into packets (not actually network packets, but messages; packets are actually classes with these messages) and passes them to each DataNode in the write pipeline, as follows.

The write pipeline is divided into three stages:

The pipeline starts. The client sends a Write_Block request along the pipeline and the last DataNode sends back an acknowledgement. After receiving the acknowledgement, the pipeline is ready to write.
Data flow. Data is sent as packets through the pipe. The client caches data until a packet is full and then sends the packet to the pipe. If the client calls hflush(), then even if a packet is not full, it is still sent to the pipeline and must receive an acknowledgement from the previous packet hflush().
Close (finalize the copy and close the pipe). The client waits until all packets are acknowledged and then sends a close request. All DataNodes in the pipeline change the status of their corresponding replicas to FINALIZED and report back to the NameNode. if a DataNode with the configured minimum number of replicas reports the FINALIZED status of its corresponding replica, the NameNode then changes the status of the block to COMPLETE.

Pipeline recovery is initiated when one or more DataNodes in the pipeline encounter an error in any of the three phases of writing a block.

Recovering from pipeline startup failure

If the pipeline is created for a new block, the client drops the block and requests a new block and a new DataNode list from the NameNode. The pipeline is reinitialized for the new block.
If a pipeline append block operation is created, the client rebuilds the pipeline using the remaining DataNode and adds the GS tag for the block.

Recovering from Data Flow Failure

When a DataNode in a pipeline detects an error (for example, a checksum error or a write-to-disk failure), that DataNode removes itself from the pipeline by closing all TCP/IP connections.
Then the client detects the failure, it stops sending data to the pipeline and rebuilds a new pipeline using the remaining DataNode. Next, all copies of the block are updated to a new GS.
The client uses this new GS to continue sending packets. If the sent data has already been received by some DataNode, they will ignore the packet and pass it down the pipeline.

Recovering from closure failure

When the client detects a failure in the shutdown state, it rebuilds the pipeline using the remaining DataNodes. If the replicas are not yet finalized, each DataNode increments the replica's GS and finalizes the replica.

When a DataNode goes bad, it removes itself from the pipeline. During pipeline recovery, the client may need to rebuild a new pipeline using the remaining DataNodes. (It may or may not replace the bad DataNode with a new DataNode, depending on the DataNode replacement policy configured below.) The replication monitor will be responsible for replicating blocks to meet the configured number of replicas.

Replacement strategy for datanode on failure

There are four configurable policies regarding whether to add additional DataNodes to replace bad DataNodes when using the remaining DataNode setup to recover the pipeline:

DISABLE: Disables DataNode substitution and throws an error on dn.
NEVER: Never replace a DataNode when a pipeline fails (usually not recommended).
DEFAULT: Replace according to the following conditions:
a. Assume r is the configured number of copies.
b. Let n be the number of nodes that now have replica data.
c. Add a new DataNode only if r >= 3 and any of the following conditions are met
- flour(r/2) >= n
- r > n and the block is hflushed/appended.
ALWAYS: always add a new DataNode when an existing DataNode fails. fails if the DataNode cannot be replaced.

The replace policy switch is , which disables all policies when false.

The value of true turns on the replacement policy, which is specified through the configuration, and the default policy is default.

When using default or always, if only one DataNode in the pipeline succeeds, error recovery will never succeed and the client will not be able to perform writes until a timeout occurs. This situation can be resolved by configuring the following attribute: -effort
The default is false. with the default setting, the client will continue to try until the specified policy is met. When this attribute is set to true, the client is allowed to continue to write even if the specified policy cannot be met (e.g., there is only one successful DataNode in the pipeline, which is smaller than the policy requirement).

Lease recovery, block recovery, and pipeline recovery are critical to HDFS fault tolerance. Together, they ensure that writes to HDFS are persistent and consistent, even in the presence of network and node failures.