Blog
Entity Component System
What is Entity-Component-System? Entity-Component–System (ECS) is an architectural pattern. This...
HDFS is a distributed file system for storing very large data files, running on clusters of commodity hardware. It is fault tolerant, scalable, and extremely simple to expand. Hadoop comes bundled with HDFS (Hadoop Distributed File Systems).
When data exceeds the capacity of storage on a single physical machine, it becomes essential to divide it across a number of separate machines. A file system that manages storage specific operations across a network of machines is called a distributed file system. HDFS is one such software.
In this tutorial, we will learn,
HDFS cluster primarily consists of a NameNode that manages the file system Metadata and a DataNodes that stores the actual data.
Read/write operations in HDFS operate at a block level. Data files in HDFS are broken into block-sized chunks, which are stored as independent units. Default block-size is 64 MB.
HDFS operates on a concept of data replication wherein multiple replicas of data blocks are created and are distributed on nodes throughout a cluster to enable high availability of data in the event of node failure.
Do you know? A file in HDFS, which is smaller than a single block, does not occupy a block's full storage.
Data read request is served by HDFS, NameNode, and DataNode. Let's call the reader as a 'client'. Below diagram depicts file read operation in Hadoop.
In this section, we will understand how data is written into HDFS through files.
In this section, we try to understand Java interface used for accessing Hadoop's file system.
In order to interact with Hadoop's filesystem programmatically, Hadoop provides multiple JAVA classes. Package named org.apache.hadoop.fs contains classes useful in manipulation of a file in Hadoop's filesystem. These operations include, open, read, write, and close. Actually, file API for Hadoop is generic and can be extended to interact with other filesystems other than HDFS.
Reading a file from HDFS, programmatically
Object java.net.URL is used for reading contents of a file. To begin with, we need to make Java recognize Hadoop's hdfs URL scheme. This is done by calling setURLStreamHandlerFactory method on URL object and an instance of FsUrlStreamHandlerFactory is passed to it. This method needs to be executed only once per JVM, hence it is enclosed in a static block.
An example code is-
public class URLCat { static { URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory()); } public static void main(String[] args) throws Exception { InputStream in = null; try { in = new URL(args[0]).openStream(); IOUtils.copyBytes(in, System.out, 4096, false); } finally { IOUtils.closeStream(in); } } }
This code opens and reads contents of a file. Path of this file on HDFS is passed to the program as a command line argument.
This is one of the simplest ways to interact with HDFS. Command-line interface has support for filesystem operations like read the file, create directories, moving files, deleting data, and listing directories.
We can run '$HADOOP_HOME/bin/hdfs dfs -help' to get detailed help on every command. Here, 'dfs' is a shell command of HDFS which supports multiple subcommands.
Some of the widely used commands are listed below along with some details of each one.
1. Copy a file from the local filesystem to HDFS
$HADOOP_HOME/bin/hdfs dfs -copyFromLocal temp.txt /
This command copies file temp.txt from the local filesystem to HDFS.
2. We can list files present in a directory using -ls
$HADOOP_HOME/bin/hdfs dfs -ls /
We can see a file 'temp.txt' (copied earlier) being listed under ' / ' directory.
3. Command to copy a file to the local filesystem from HDFS
$HADOOP_HOME/bin/hdfs dfs -copyToLocal /temp.txt
We can see temp.txt copied to a local filesystem.
4. Command to create a new directory
$HADOOP_HOME/bin/hdfs dfs -mkdir /mydirectory
Check whether a directory is created or not. Now, you should know how to do it ;-)
What is Entity-Component-System? Entity-Component–System (ECS) is an architectural pattern. This...
A Virtual Machine (VM) is a software environment that emulates a computer system. It facilitates a...
Zip is an archive format that offers data compression without data loss. A zip file may contain...
What is a Website? A website is a group of globally accessible, interlinked web pages which have a...
What is Backend Development? Back-end Development refers to the server-side development. It...
In this tutorial, you will learn- What is a Computing Environment? What is a Variable? What are...