Introduction

Lavanya Narang

Engineer Trainee(Ui Path) @MTSL | Engineering (IT) | Python | SQL | Cloud Solution | Data Enthusiast

Published Jun 3, 2024

Imagine a giant jigsaw puzzle that is too big to fit on one table. To solve this, you spread the puzzle pieces across several tables in different rooms. You keep a notebook that records which pieces are on which tables, so you can easily find any piece you need. To make sure pieces don’t get lost, you make extra copies of each piece and place them on different tables. When you want to work on the puzzle, you check your notebook to find where the pieces are and gather them. This system ensures that you can always find and assemble the puzzle, even if some tables are unavailable. This is how Hadoop HDFS stores and manages large amounts of data across multiple servers.

HDFS Architecture and Core Components

HDFS divides data into blocks, typically 128 MB or 256 MB in size, and replicates these blocks across multiple DataNodes within a cluster. This division and replication are pivotal for achieving fault tolerance and scalability. The two main components of HDFS are the NameNode and the DataNodes, each playing a crucial role in the system's operation.

NameNode

The NameNode is the master server in HDFS, responsible for managing the file system namespace and controlling access to files by clients. It handles the following key functions:

Metadata Management: The NameNode stores metadata, which includes the directory structure, file names, and permissions, persistently in memory and on disk.
File Operations Coordination: It coordinates file operations such as reading, writing, and deleting files, ensuring data consistency and integrity.
Single Point of Failure: Given its critical role, the NameNode is a single point of failure in HDFS. Thus, ensuring its high availability is paramount for the overall system reliability.

DataNode

DataNodes are the worker nodes in HDFS, tasked with the storage of actual data blocks. They perform the following functions:

Data Storage: DataNodes manage the storage devices attached to them and store data blocks as directed by the NameNode.
Health Monitoring: They periodically send heartbeats and block reports to the NameNode to confirm their operational status and report on the data blocks they hold.
Replication and Recovery: DataNodes handle data block replication and recovery processes, ensuring data availability and fault tolerance across the cluster.

Multiple DataNodes collectively form a distributed storage layer, providing the scalability and robustness required for handling large datasets in Hadoop.

The Role of the Secondary NameNode

While not a direct failover for the NameNode, the Secondary NameNode plays a vital supportive role in HDFS maintenance. Its primary function is to perform periodic checkpoints of the file system metadata. This involves merging the edits log with the fsimage to create a new, updated version of the fsimage, thus reducing the startup time of the NameNode and enhancing its reliability.

Recommended by LinkedIn

5 Best Big Data Frameworks To Consider in 2024

Oleksandr Andrieiev 6 months ago

Copy of Understanding the Hadoop Distributed File…

Sandhya Karki 3 weeks ago

Understanding Narrow and Wide Transformations in…

Kumar Preeti Lata 5 months ago

Checkpointing Process

Checkpointing is a critical process for maintaining the efficiency and reliability of HDFS. Here's how it typically unfolds:

Checkpoint Triggering: Checkpointing can be triggered based on a predefined interval or manually by an administrator. This interval is configurable in the Hadoop settings.

Preparing the Checkpoint: Upon triggering, the NameNode creates a new, empty fsimage file to store a snapshot of the filesystem's metadata. Concurrently, ongoing changes are recorded in an edits log file.

Saving the Namespace Image: The NameNode saves a snapshot of the current filesystem metadata into the new fsimage file. This image represents the checkpointed state of the file system and is written to disk for persistence.

Merging Edit Logs: The edits log, containing transactions since the last checkpoint, is merged into the fsimage, ensuring that the checkpoint reflects the most recent changes.

Edit Log Truncation: After merging, the edits log is truncated or cleared, managing its size and ensuring efficient operation.

Backup and Recovery: The checkpointed fsimage file and truncated edit logs are stored on the NameNode’s local disk or a shared storage system. These files serve as backups, essential for quick recovery in case of a failure. In the event of a failure, the NameNode can use the latest fsimage and edit logs to restore the namespace to a consistent state.

Conclusion

HDFS, with its sophisticated architecture and processes, provides a scalable, fault-tolerant storage solution for large datasets in Hadoop. The NameNode and DataNode architecture ensures efficient metadata management and data storage, while the Secondary NameNode's checkpointing process enhances reliability and recovery capabilities. Together, these components form a robust framework that underpins the powerful data processing capabilities of Hadoop, making it an indispensable tool for handling big data in a distributed environment.

Amit M.

Solution Architect | Catalyst | Cloud | GenAI | MLOps (Opinions are solely mine)

7mo

Liked the analogy👍

1 Reaction

To view or add a comment, sign in

Introduction

Lavanya Narang

Engineer Trainee(Ui Path) @MTSL | Engineering (IT) | Python | SQL | Cloud Solution | Data Enthusiast

Recommended by LinkedIn

More articles by Lavanya Narang

Insights from the community

Others also viewed

Understanding Narrow and Wide Transformations in Apache Hadoop and Apache Spark

The Neanderthal Guide to 5G data management : meet the open source Dumbo @

Hadoop Distributed File Storage

All about BIG data

Increasing/decreasing the size of Hadoop Datanode dynamically

Understanding What Data is Stored in the Name Node

Unleashing the Power of Hadoop for Big Data Processing in Banking and Finance Industry

Big Data Diagnosis: (Hadoop & Distributed Storage Clusters)

HDFS- The world’s most reliable storage system.

Spark Performance Local FS VS HDFS: Replication Factor is the key

Explore topics

Recommended by LinkedIn

More articles by Lavanya Narang

HADOOP HDFS

Introduction:

Introduction:

How Big Data Drives Tailored Online Experiences

Navigating in the World of Cryptography

Exploring Python Packages and PyPI

Power of Meta programming

Unlocking Advanced SQL Skills: A Comprehensive Journey from Novice to Expert

Mastering Advanced SQL Techniques: A Beginner's Guide

Exploring the Versatile World of Linux: Applications, Comparisons, and Contributions

Insights from the community

Others also viewed

Understanding Narrow and Wide Transformations in Apache Hadoop and Apache Spark

The Neanderthal Guide to 5G data management : meet the open source Dumbo @

Hadoop Distributed File Storage

All about BIG data

Increasing/decreasing the size of Hadoop Datanode dynamically

Understanding What Data is Stored in the Name Node

Unleashing the Power of Hadoop for Big Data Processing in Banking and Finance Industry

Big Data Diagnosis: (Hadoop & Distributed Storage Clusters)

HDFS- The world’s most reliable storage system.

Spark Performance Local FS VS HDFS: Replication Factor is the key

Explore topics