SQLServerWiki

“The Only Thing That Is Constant Is Change”

HDFS

Posted by database-wiki on June 12, 2016

What is Big Data?

A huge volume of structured, semi-structured and unstructured data that has the potential to be mined for information is considered as big data.

Big data is not a technology it’s an evolving problem.

The problem consists of three components:

a.) The extreme volume of data generated in a short span. (rate at which the data is being generated. For ex, 100’s of GB per hour)

b.) The wide variety of types of data.

c.) The velocity at which the data must be processed.

Although big data doesn’t refer to any specific quantity, the term is often used when speaking about petabytes and exabytes of data, much of which cannot be integrated easily.

Because big data takes too much time and costs too much money to load into a traditional relational database for analysis, new approaches to storing and analyzing data have emerged that rely less on data schema and data quality. Instead, raw data with extended metadata is aggregated in a data lake and machine learning and artificial intelligence (AI) programs use complex algorithms to look for repeatable patterns.

The analysis of large data sets in real-time requires a platform like Hadoop to store large data sets across a distributed cluster and MapReduce to coordinate, combine and process data from multiple sources.

What is Hadoop?

Hadoop is an open source software project that enables the distributed processing of large amount of data sets across clusters of commodity servers.

What components make up Hadoop 2.0?

H1

Hadoop Distributed File System (HDFS) – the Java-based scalable system that stores data across multiple machines (on the cluster) in Hadoop without prior organization.

YARN – resource management framework for scheduling and handling resource requests from distributed applications. (YARN is an acronym for Yet Another Resource Negotiator.)

MapReduce – a software programming model for processing large sets of data in parallel.

What is HDFS?

HDFS is Hadoop Distributed File System, which is responsible for storing data on the cluster in Hadoop. Files in HDFS are split into blocks before they are stored on the cluster.

The default size of a block is 128MB. The blocks belonging to one file are then stored on different nodes. The blocks are also replicated to ensure high reliability.

What are Hadoop Daemons?

Hadoop Daemons are the process that runs in the background. Hadoop has three such daemons. They are NameNode, DataNode and Secondary NameNode. Each daemons runs separately in its own JVM.

H2

Note: In hadoop 2.0, there is no concept of job tracker and task tracker. When a client submits a job, yarn controls the execution of the job.

  • NameNode – It is the Master node which is responsible for storing the meta-data for all the files and directories. It has information such as the blocks that make a file, and where are those blocks located in the cluster. (The role of Master node is to assign tasks to the Slaves.)
  • DataNode – It is the Slave node that contains the actual data. It reports information of the blocks it contains to the NameNode in a periodic fashion. The role of Slave node is to process the tasks that’s assigned by NameNode. The DataNode also send block report for the master to valid it.

The NameNode should run all the time and a failure will make the cluster inaccessible as there would be no information on where the files are located in the cluster.The DataNode report their status to NameNode using heartbeat. The default   heartbeat interval is 3 seconds and If the DataNode does not send heartbeat for 600 seconds (10 minutes) then the NameNode considers that DataNode as dead.

  • Secondary NameNode – The file system change in the NameNode are not updated in the FSImage instead the changes are stored in edit log. The edit log is updated from the in memory FSmetadata. The secondary NameNode periodically merges changes in the edit log with the FSImage so that it doesn’t grow too large in size. It also keeps a copy of the FSImage which can be used in case of failure of NameNode. This merge activity happens every 60 minutes or if the combined size of edit log exceeds 256MB.

Blocks are used:

For horizontal scalability.

For parallel processing as the block are widely spread across DataNodes.

For avoiding space constraints.

Replication is used:

To avoid data loss. By default 3 copies of the same block is stored in 3 different DataNodes.

To handle network failure multiple copies are needed.

Different operations in HDFS 2.0:-

Writing a 2TB file on the cluster:

H3

  1. The HDFS client requests to write crimes_2015-16.csv file on Hadoop cluster by calling ‘Create()’ method of DistributedFileSystem object.
  2. DistributedFileSystem object using RPC call to connect to NameNode and initiates new file creation. Here no blocks are associated in file creation process in the NameNode. The NameNode checks if the file crimes_2015-16.csv already exist in the Hadoop cluster and HDFS client has the required permissions to create new file. If file already exists or client does not have sufficient permission to create a new file, then IOException is thrown to client. Otherwise, operation succeeds and a new record for the file is created by the NameNode.
  3. Once new record is created in NameNode, an object of type FSDataOutputStream is returned to the client. Client uses it to write data into the HDFS.
  4. FSDataOutputStream contains DFSOutputStream object which looks after communication with DataNodes and NameNode. While client continues writing data, DFSOutputStream continues creating packets with this data. These packets are en-queued into a queue which is called as DataQueue.
  5. There is one more component called DataStreamer which consumes this DataQueue. DataStreamer also asks NameNode for allocation of new blocks and NameNode responds with desirable DataNodes to be used for replication.
  6. Now, the process of replication starts by creating a pipeline using DataNodes. In our case, we have chosen replication level of 3 and hence there are 3 DataNodes in the pipeline.
  7. The DataStreamer streams the packet into the first DataNode DN-1 in the pipeline. Generally packets are in 4MB size.
  8. Once the packet is written in DN-1, it is forwarded to DN-2 and DN-4.
  9. Another queue called ‘Ack Queue’ is maintained by DFSOutputStream to store packets which are waiting for acknowledgement from DataNodes.
  10. Once packets are replicated successfully the acknowledgment packet is sent to DFSOutputStream.
  11. DFSOutputStream removed the packet from the ‘Ack Queue’. In the event of any DataNode failure, packets from the ‘Ack queue’ are used to reinitiate the operation.
  12. When write operation is completed by HDFS client, it calls close() method which results in flushing remaining data packets to the pipeline followed by waiting for acknowledgement.
  13. The above process is repeated till the end of file is encountered. Once final acknowledgement is received for a block, file write operation successful is communicated to the name node. Name node saves HDFS metadata information like block location, permission, etc in files called FsImage in the disk, in-memory FSmetadata and EditLog. Note that all the blocks are written serially.
  14. The secondary NameNode queries for edit logs in regular intervals and applies to its FsImage. Once it has new FsImage, it copies the FsImage to the NameNode. The NameNode uses this FsImage for the next restart to reduce the startup time.

Reading a 2TB file from the cluster:

H4

  1. The user provides the filename crimes_2015-16.csv to the HDFS client. HDFS Client initiates read request by calling ‘open()’ method of DistributedFileSystem object.
  2. DistributedFileSystem object connects to namenode using RPC and check if the file exist and HDFS client has got access to read the file. If yes, metadata information such as the locations of the blocks of the file is provided by the NameNode.
  3. In response to this metadata request, addresses of the DataNodes having copy of that block, is returned back. The NameNode provides the location for first few blocks only.
  4. Once addresses of DataNodes are received, an object of type FSDataInputStream is returned to the client. FSDataInputStream contains DFSInputStream which takes care of interactions with DataNode and NameNode. In step 4 shown in above diagram, client invokes ‘read()’ method which causes DFSInputStream to establish a connection with the first DataNode with the first block of file.
  5. Data is read in the form of streams wherein client invokes ‘read()’ method repeatedly. This process of read() operation continues till it reaches end of block which is block b0 in our case. Once the end of block b0 is reached, the HDFS client does a CRC32 checksum validation of the block. If the checksum validation completes the DFSInputStream closes the connection. If the checksum fails mean if the block is corrupt, then the HDFS client will try to access the block b0 in the other replica set.
  6. The DFSInputStream move to the next block in the DataNodes.
  7. The process is repeated till the last block b15 on DataNode DN-4 is read by DFSInputStream.
  8. Once client has done with the reading, it calls close() method.

There are three kinds of failures:

  1. Complete failure:

While writing a particular block the data streamer request the NameNode for block location. The NameNode responds back with set of DataNodes for data pipeline.If the HDFS client is not able to communicate with all the 3 DataNodes after 4 retires a complete failure is reported to NameNode. The NameNode flushes the previous entry of DataNodes and provides a new set of DataNodes to the HDFS client. We call this a complete failure because HDFS client hasn’t even initiated the write operation and since failure has happened in the connect establishment phase itself.

  1. Partial failure:

When a write operation has been initiated for the data pipeline of 3 DataNodes and when a write fails for one of the DataNode the HDFS client waits for a specific time and reports the error to the NameNode as partial failure based on the replication factor meaning if the replication factor is set to 1 or 2 – no error will be reported as we have 2 copies written successfully. If the replication factor is greater than 2 then the error will be reported as we have only two copies of the block that being successfully written. Here the process moves forward and the copying of the failed block happens when metadata validates the block scan report from all the DataNodes. Writing of failed block will be done later to satisfy the replication factor thats been set.

  1. Acknowledgement failure:

If the block has been written successfully but sending acknowledgement has failed for one of the DataNode then NameNode considers this as a partial failure. The HDFS client reprocess the duplicate block (renames the failed block id) by seeking new allocation locations from NameNode. The NameNode provides fresh site of DataNodes for the pipeline and the process of write operation starts once again till the success acknowledgement message is received from all the DataNodes. As soon the write is complete all the DataNode will scan all the blocks and send the block report to NameNode. The NameNode validates and finds if there is any block that does not have metadata associated with it. The NameNode sends a command to the DataNode to remove the file that does not have relevant metadata in the name node. In this way the orphaned blocks as a result of acknowledgement failures are removed.

High Availability for NameNode:

We can implement high availability for NameNode by adding additional NameNode to solve single point of failure (SPOF) of NameNode. The additional NameNode will be passive. If the active NodeNode is dead then the zookeeper failover controller changes the status of passive NameNode to Active based on fencing and lock mechanism.

The metadata between the two nodes are kept in sync by: 1. Quorum of Journals (recommended) 2. Shared Disk. (Not recommended)

  1. Quorum of Journals:

H5.png

Prior to Hadoop 2.0, the NameNode was a single point of failure (SPOF) in an HDFS cluster. Each cluster had a single NameNode, and if that machine was unavailable, the cluster on the whole would be unavailable until the NameNode was either restarted or started on a separate machine. In a classic HA cluster, two separate machines are configured as NameNodes. At any point, one of the NameNodes will be in Active state and the other will be in a Standby state. The Active NameNode is responsible for all client operations in the cluster, while the Standby is simply acting as a slave, maintaining enough state to provide a fast failover.

In order for the Standby node to keep its state coordinated with the Active node, both nodes communicate with a group of separate daemons called ‘JournalNodes’ (JNs). Every edit log modification in the Active node is logged in the JournalNodes. The Standby node is capable of reading the amended information from the JNs, and is regularly monitoring them for changes. As the Standby Node sees the changes, it then applies them to its own edit log. In case of a failover, the Standby will make sure that it has read all the changes from the JounalNodes before changing its state to ‘Active state’. This guarantees that the namespace state is fully synched before a failover occurs.

To provide a fast failover, it is essential that the Standby node have to have the updated and current information regarding the location of blocks in the cluster. For this to happen, the DataNodes are configured with the location of both NameNodes, and send block location information and heartbeats to both.

It is essential that only one of the NameNodes must be Active at a time. Otherwise, the namespace state would deviate between the two and lead to data loss or erroneous results. In order to avoid this, the JournalNodes will only permit a single NameNode to a writer at a time. During a failover, the NameNode which is to become active will take over the responsibility of writing to the JournalNodes.

  1. Shared Disk:

Here the metadata is stored in a network storage disk share by both the Active and Passive NameNodes. If the shared disk fails the entire cluster goes down and hence this option is not recommended.

The above two solution only solve the SPOF problem in NameNode. The NameNode unavailability due to overwhelming requests is solved by Hadoop federation.

HDFS is a fit when,

  • Files to be stored are large in size.
  • Your application need to write once and read many times.
  •  You want to use cheap, commonly available hardware.

And it is not a fit when,

  • You want to store a large number of small files. It is better to store millions of large file when compare to billions of small files.
  • There are multiple writers. It is only designed for writing at the end of file and not at a random offset.

Installation of Hadoop.

Pseudo Distributed Cluster Installation Mode:

Apache Hadoop v2.7.1

Linux Operating System (Ubuntu 12.04)

  1. Download the latest stable version of Apache Hadoop tarball distribution.
  1. Download Java1.7 JDK tarball. Consider the architecture 32 bit (i386, i586, i686),  64bit (x86_64) before downloading.
  1. Assuming that the downloaded tarballs are present under the home directory of the user. Extract the tarballs

~$ tar -xvf hadoop-2.7.1.tar.gz

~$ tar -xvf jdk-7u79-linux-x86_64.gz

  1. After extracting, set up the environment variables in ~/.bashrc file

~$vim .bashrc

export JAVA_HOME=/home/balajimani/jdk1.7.0_79

export HADOOP_PREFIX=/home/balajimani/hadoop-2.7.1

export HADOOP_HOME=${HADOOP_PREFIX}

export HADOOP_CONF_DIR=${HADOOP_PREFIX}/etc/hadoop

export HUE_HOME=/home/balajimani/hue

export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$HUE_HOME/build/env/bin:$PATH

After appending these lines, save and close the file.

  1. For these variables to be set for the current shell, source the file.

~$ source .bashrc

Check whether the changes have been applied properly

~$ java -version

~$ hadoop version

  1. Next, edit the hadoop configuration files.

 ~$ cd $HADOOP_CONF_DIR

~hadoop-2.7.1/etc/hadoop$ vim hadoop-env.sh

 export JAVA_HOME=/home/balajimani/jdk1.7.0_79

~hadoop-2.7.1/etc/hadoop$ vim core-site.xml

<configuration>

<property>

<name>fs.defaultFS</name>

<value>hdfs://localhost:8020</value>

</property>

<property>

<name>hadoop.proxyuser.balajimani.hosts</name>

<value>*</value>

</property>

<property>

<name>hadoop.proxyuser.balajimani.groups</name>

<value>*</value>

</property>

</configuration>

~hadoop-2.7.1/etc/hadoop$ vim hdfs-site.xml

<configuration>

<property>

<name>dfs.namenode.name.dir</name>

<value>/home/balajimani/name</value>

</property>

<property>

<name>dfs.datanode.data.dir</name>

<value>/home/balajimani/data</value>

</property>

<property>

<name>dfs.webhdfs.enabled</name>

<value>true</value>

</property>

</configuration>

~hadoop-2.7.1/etc/hadoop$ vim yarn-site.xml

 <configuration>

<property>

<name>yarn.resourcemanager.hostname</name>

<value>localhost</value>

</property>

<property>

<name>yarn.nodemanager.aux-services</name>

<value>mapreduce_shuffle</value>

</property>

</configuration>

~hadoop-2.7.1/etc/hadoop$ cp mapred-site.xml.template mapred-site.xml

~hadoop-2.7.1/etc/hadoop$ vim mapred-site.xml

<configuration>

<property>

<name>mapreduce.framework.name</name>

<value>yarn</value>

</property>

</configuration>

~hadoop-2.7.1/etc/hadoop$ vim httpfs-site.xml

<property>

<name>httpfs.proxyuser.balajimani.hosts</name>

<value>*</value>

</property>

<property>

<name>httpfs.proxyuser.balajimani.groups</name>

<value>*</value>

</property>

~hadoop-2.7.1/etc/hadoop$ vim slaves

Localhost

After adding all these entries into their respective configuring files, the Hadoop is set to start in Pseudo distributed mode.

  1. To enable password-less login thru ssh

~$ ssh-keygen

~$ ssh-copy-id -i ~/.ssh/id_rsa.pub localhost

 This procedure avoids prompting for password, when starting the daemons.

  1. Format the namenode before starting the daemons.

~$hdfs namenode –format

This formats the dfs.namenode.name.dir location and creates the necessary files and folders required for namenode.

Note: Steps 7 and 8 are one-time procedures.

  1. Start the cluster

~$ start-dfs.sh

~$ start-yarn.sh

Alternatively, to start all the daemons

~$ start-all.sh

  1. To check for the daemons, use jps (java process status)

~$ jps

  1. To Stop the cluster

~$ stop-yarn.sh

~$ stop-dfs.sh

To stop all the daemons in one go

~$ stop-all.sh

Note: To stop or start daemons individually.

~$hadoop-demon.sh <start | stop> <namenode | datanode>

~$yarn-daemon.sh <start | stop> <resourcemanager | nodemanager>

~$mr-jobhistory-daemon.sh <start | stop> historyserver

HUE INSTALLATION:

Note: HUE is a GUI used for file manipulation in HDFS.

Hue Prerequisites: Already installed while installing OS

~$ sudo apt-get update

~$ sudo apt-get install gcc g++ libxml2-dev libxslt-dev libsasl2-dev libsasl2-modules-gssapi-mit libmysqlclient-dev python-dev python-setuptools libsqlite3-devant libsasl2-dev libsasl2-modules-gssapi-mit libkrb5-dev libtidy-0.99-0 libldap2-devlibssl-dev

The first command will update your ubuntu repositories and the next will install the necessary packages for installing Hue.

Let us begin with the assumption that hue tarball (hue-3.8.1.tgz) is already downloaded from the vendor site and is present under the home folder of the “balajimani” user.

  1. Extract the contents of the tarball

~$ tar -xvf /home/balajimani/hue-3.8.1.tgz

  1. Enter the folder and install using the following command

~$ cd /home/balajimani/hue-3.8.1

~$ PREFIX=/home/balajimani/ make install

The above command will run for some time. Proceed with the next step after the completion of the above command.

Configure Hue:

~$ cd $HUE_HOME/desktop/conf/

~$ vim hue.ini

Edit the following parameters

[desktop]

server_user=balajimani

server_group=balajimani

default_user=balajimani

default_hdfs_superuser=balajimani

Start HUE:

~$ nohup supervisor &

NAMENODE UI: http://localhost:50070

H6

YARN UI: http://localhost:8088

H7.png

HUE UI: http://localhost:8888

H8

HDFS commands:

Hadoop file system (fs) shell commands are used to perform various file operations like copying file, changing permissions, viewing the contents of the file, changing ownership of files, creating directories etc.

The syntax of fs shell command is

hdfs dfs <args>

Hdfs dfs Shell Commands

hdfs dfs ls:

The hadoop ls command is used to list out the directories and files. An example is shown below:

balajimani@ubuntu:~$ hdfs dfs -ls /user/balajimani

Found 2 items

drwxr-xr-x   – balajimani balajimani          0 2016-05-29 04:14 /user/balajimani/.Trash

-rw-r–r–   3 balajimani balajimani          6 2016-05-29 04:41 /user/balajimani/hello.txt

The above command lists out the files in the balajimani directory.

The second column of file hello.txt indicates the number of replicas and for a directory, the second field is empty.

hdfs dfs lsr:

The hadoop lsr command recursively displays the directories, sub directories and files in the specified directory. The usage example is shown below:

balajimani@ubuntu:~$ hdfs dfs -lsr /user/balajimani

lsr: DEPRECATED: Please use ‘ls -R’ instead.

drwxr-xr-x   – balajimani balajimani          0 2016-05-29 04:14 /user/balajimani/.Trash

drwxr-xr-x   – balajimani balajimani          0 2016-05-29 04:14 /user/balajimani/.Trash/Current

drwxr-xr-x   – balajimani balajimani          0 2016-05-29 04:14 /user/balajimani/.Trash/Current/tmp

-rw-r–r–   3 balajimani supergroup            0 2016-05-29 04:14 /user/balajimani/.Trash/Current/tmp/hue_config_validation.2028913943167022256

-rw-r–r–   3 balajimani balajimani          6 2016-05-29 04:41 /user/balajimani/hello.txt

The hdfs dfs lsr command is similar to the ls -R command in unix.

hdfs dfs cat:

Hadoop cat command is used to print the contents of the file on the terminal (stdout). The usage example of hadoop cat command is shown below:

balajimani@ubuntu:~$ hdfs dfs -cat /user/balajimani/hello.txt

hello

hdfs dfs chgrp:

hadoop chgrp shell command is used to change the group association of files. Optionally you can use the -R option to change recursively through the directory structure. The usage of

hdfs dfs -chgrp is shown below:

hdfs dfs -chgrp [-R] <NewGroupName> <file or directory name>

hdfs dfs chmod:

The hadoop chmod command is used to change the permissions of files. The -R option can be used to recursively change the permissions of a directory structure. The usage is shown below:

hdfs dfs -chmod [-R] <mode | octal mode> <file or directory name>

hdfs dfs chown:

The hadoop chown command is used to change the ownership of files. The -R option can be used to recursively change the owner of a directory structure. The usage is shown below:

hdfs dfs -chown [-R] <NewOwnerName>[:NewGroupName] <file or directory name>

hdfs dfs mkdir:

The hadoop mkdir command is for creating directories in the hdfs. You can use the -p option for creating parent directories. This is similar to the unix mkdir command. The usage example is shown below:

balajimani@ubuntu:~$ hdfs dfs -mkdir /user/balajimani/hadoopdemo

balajimani@ubuntu:~$ hdfs dfs -lsr /user/balajimani

lsr: DEPRECATED: Please use ‘ls -R’ instead.

drwxr-xr-x   – balajimani balajimani          0 2016-05-29 04:14 /user/balajimani/.Trash

drwxr-xr-x   – balajimani balajimani          0 2016-05-29 04:14 /user/balajimani/.Trash/Current

drwxr-xr-x   – balajimani balajimani          0 2016-05-29 04:14 /user/balajimani/.Trash/Current/tmp

-rw-r–r–   3 balajimani supergroup            0 2016-05-29 04:14 /user/balajimani/.Trash/Current/tmp/hue_config_validation.2028913943167022256

drwxr-xr-x   – balajimani balajimani          0 2016-06-03 07:31 /user/balajimani/hadoopdemo

-rw-r–r–   3 balajimani balajimani          6 2016-05-29 04:41 /user/balajimani/hello.txt

The above command creates the hadoopdemo directory in the /user/balajimani directory.

balajimani@ubuntu:~$ hdfs dfs -mkdir -p /user/balajimani/dir1/dir2

balajimani@ubuntu:~$ hdfs dfs -lsr /user/balajimani

lsr: DEPRECATED: Please use ‘ls -R’ instead.

drwxr-xr-x   – balajimani balajimani          0 2016-05-29 04:14 /user/balajimani/.Trash

drwxr-xr-x   – balajimani balajimani          0 2016-05-29 04:14 /user/balajimani/.Trash/Current

drwxr-xr-x   – balajimani balajimani          0 2016-05-29 04:14 /user/balajimani/.Trash/Current/tmp

-rw-r–r–   3 balajimani supergroup            0 2016-05-29 04:14 /user/balajimani/.Trash/Current/tmp/hue_config_validation.2028913943167022256

drwxr-xr-x   – balajimani balajimani          0 2016-06-03 07:32 /user/balajimani/dir1

drwxr-xr-x   – balajimani balajimani          0 2016-06-03 07:32 /user/balajimani/dir1/dir2

drwxr-xr-x   – balajimani balajimani          0 2016-06-03 07:31 /user/balajimani/hadoopdemo

-rw-r–r–   3 balajimani balajimani          6 2016-05-29 04:41 /user/balajimani/hello.txt

The above command creates the dir1/dir2 directory in /user/balajimani directory.

hdfs dfs copyFromLocal:

The hadoop copyFromLocal command is used to copy a file from the local file system to the hadoop hdfs. The syntax and usage example are shown below:

Syntax:

hdfs dfs -copyFromLocal <localsrc> URI

Example:

Check the data in local file

balajimani@ubuntu:~$ cat sample.txt

Kill Bill 1

Kill Bill 2

pulp fiction

Now copy this file to hdfs

balajimani@ubuntu:~$ hdfs dfs -copyFromLocal sample.txt /user/balajimani/dir1

View the contents of the hdfs file.

balajimani@ubuntu:~$ hdfs dfs -ls /user/balajimani/dir1

Found 2 items

drwxr-xr-x   – balajimani balajimani          0 2016-06-03 07:32 /user/balajimani/dir1/dir2

-rw-r–r–   3 balajimani balajimani         37 2016-06-03 08:14 /user/balajimani/dir1/sample.txt

balajimani@ubuntu:~$ hdfs dfs -cat /user/balajimani/dir1/sample.txt

Kill Bill 1

Kill Bill 2

pulp fiction

hdfs dfs copyToLocal:

The hadoop copyToLocal command is used to copy a file from the hdfs to the local file system. The syntax and usage example is shown below:

Syntax

hdfs dfs -copyToLocal [-ignorecrc] [-crc] URI <localdst>

Example:

balajimani@ubuntu:~$ hdfs dfs -copyToLocal /user/balajimani/hello.txt dir1

16/06/03 08:26:26 WARN hdfs.DFSClient: DFSInputStream has been closed already

The -ignorecrc option is used to copy the files that fail the crc check. The -crc option is for copying the files along with their CRC.

hdfs dfs cp:

The hadoop cp command is for copying the source into the target. The cp command can also be used to copy multiple files into the target. In this case the target should be a directory. The syntax is shown below:

balajimani@ubuntu:~$ hdfs dfs -cp /user/balajimani/hello.txt /user/balajimani/dir1/hello.txt

16/06/03 08:31:15 WARN hdfs.DFSClient: DFSInputStream has been closed already

balajimani@ubuntu:~$ hdfs dfs -ls /user/balajimani/dir1

Found 3 items

drwxr-xr-x   – balajimani balajimani          0 2016-06-03 07:32 /user/balajimani/dir1/dir2

-rw-r–r–   3 balajimani balajimani          6 2016-06-03 08:31 /user/balajimani/dir1/hello.txt

-rw-r–r–   3 balajimani balajimani         37 2016-06-03 08:14 /user/balajimani/dir1/sample.txt

balajimani@ubuntu:~$ hdfs dfs -cp /user/balajimani/hello.txt /user/balajimani/sample.txt hdfs://namenodehost/user/balajimani/dir2

-cp: java.net.UnknownHostException: namenodehost

Usage: hdfs dfs [generic options] -cp [-f] [-p | -p[topax]] <src> … <dst>

balajimani@ubuntu:~$ hdfs dfs -cp /user/balajimani/hello.txt /user/balajimani/sample.txt hdfs://localhost/user/balajimani/dir2

16/06/03 08:39:51 WARN hdfs.DFSClient: DFSInputStream has been closed already

16/06/03 08:39:51 WARN hdfs.DFSClient: DFSInputStream has been closed already

balajimani@ubuntu:~$ hdfs dfs -ls /user/balajimani/dir2

Found 2 items

-rw-r–r–   3 balajimani balajimani          6 2016-06-03 08:39 /user/balajimani/dir2/hello.txt

-rw-r–r–   3 balajimani balajimani         37 2016-06-03 08:39 /user/balajimani/dir2/sample.txt

hdfs dfs -put:

Hadoop put command is used to copy multiple sources to the destination system. The put command can also read the input from the stdin. The different syntaxes for the put command are shown below:

Syntax1: copy single file to hdfs

balajimani@ubuntu:~$ hdfs dfs -put test.txt /user/balajimani/dir2

balajimani@ubuntu:~$ hdfs dfs -ls /user/balajimani/dir2

Found 3 items

-rw-r–r–   3 balajimani balajimani          6 2016-06-03 08:39 /user/balajimani/dir2/hello.txt

-rw-r–r–   3 balajimani balajimani         37 2016-06-03 08:39 /user/balajimani/dir2/sample.txt

-rw-r–r–   3 balajimani balajimani         31 2016-06-03 08:44 /user/balajimani/dir2/test.txt

Syntax2: copy multiple files to hdfs

balajimani@ubuntu:~$ hdfs dfs -put test2.txt test3.txt /user/balajimani/dir2

balajimani@ubuntu:~$ hdfs dfs -ls /user/balajimani/dir2

Found 5 items

-rw-r–r–   3 balajimani balajimani          6 2016-06-03 08:39 /user/balajimani/dir2/hello.txt

-rw-r–r–   3 balajimani balajimani         37 2016-06-03 08:39 /user/balajimani/dir2/sample.txt

-rw-r–r–   3 balajimani balajimani         31 2016-06-03 08:44 /user/balajimani/dir2/test.txt

-rw-r–r–   3 balajimani balajimani         14 2016-06-03 08:49 /user/balajimani/dir2/test2.txt

-rw-r–r–   3 balajimani balajimani         13 2016-06-03 08:49 /user/balajimani/dir2/test3.txt

hdfs dfs get:

Hadoop get command copies the files from hdfs to the local file system. The syntax of the get command is shown below:

balajimani@ubuntu:~$ hdfs dfs -get /user/balajimani/dir2/test2.txt test2.txt

16/06/03 09:00:04 WARN hdfs.DFSClient: DFSInputStream has been closed already

balajimani@ubuntu:~$ ls –lrt

-rw-r–r–  1 balajimani balajimani   14 Jun  3 09:00 test2.txt

hdfs dfs getmerge:

hadoop getmerge command concatenates the files in the source directory into the destination file. The syntax of the getmerge shell command is shown below:

hdfs dfs -getmerge <src> <localdst> [addnl]

The addnl option is for adding new line character at the end of each file.

hdfs dfs moveFromLocal:

The hadoop moveFromLocal command moves a file from local file system to the hdfs directory. It removes the original source file. The usage example is shown below:

balajimani@ubuntu:~$ hdfs dfs -rm /user/balajimani/dir2/test2.txt /user/balajimani/dir2/test3.txt

16/06/03 09:01:43 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.

Deleted /user/balajimani/dir2/test2.txt

16/06/03 09:01:43 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.

Deleted /user/balajimani/dir2/test3.txt

balajimani@ubuntu:~$ hdfs dfs -moveFromLocal test2.txt test3.txt /user/balajimani/dir2

balajimani@ubuntu:~$ hdfs dfs -ls /user/balajimani/dir2

Found 2 items

-rw-r–r–   3 balajimani balajimani         14 2016-06-03 09:03 /user/balajimani/dir2/test2.txt

-rw-r–r–   3 balajimani balajimani         13 2016-06-03 09:03 /user/balajimani/dir2/test3.txt

hdfs dfs mv:

It moves the files from source hdfs to destination hdfs. Hadoop mv command can also be used to move multiple source files into the target directory. In this case the target should be a directory. The syntax is shown below:

balajimani@ubuntu:~$ hdfs dfs -mv /user/balajimani/dir2/test2.txt /user/balajimani/dir1

balajimani@ubuntu:~$ hdfs dfs -ls /user/balajimani/dir1

Found 4 items

drwxr-xr-x   – balajimani balajimani          0 2016-06-03 07:32 /user/balajimani/dir1/dir2

-rw-r–r–   3 balajimani balajimani          6 2016-06-03 08:31 /user/balajimani/dir1/hello.txt

-rw-r–r–   3 balajimani balajimani         37 2016-06-03 08:14 /user/balajimani/dir1/sample.txt

-rw-r–r–   3 balajimani balajimani         14 2016-06-03 09:03 /user/balajimani/dir1/test2.txt

hdfs dfs du:

The du command displays aggregate length of files contained in the directory or the length of a file in case its just a file. The syntax and usage is shown below:

balajimani@ubuntu:~$ hdfs dfs -du hdfs://localhost/user/balajimani/dir2/

6   hdfs://localhost/user/balajimani/dir2/hello.txt

37  hdfs://localhost/user/balajimani/dir2/sample.txt

31  hdfs://localhost/user/balajimani/dir2/test.txt

13  hdfs://localhost/user/balajimani/dir2/test3.txt

hdfs dfs dus:

The hadoop dus command prints the summary of file lengths

balajimani@ubuntu:~$ hdfs dfs -dus hdfs://localhost/user/balajimani

dus: DEPRECATED: Please use ‘du -s’ instead.

183  hdfs://localhost/user/balajimani

hdfs dfs expunge:

Used to empty the trash. The usage of expunge is shown below:

balajimani@ubuntu:~$ hdfs dfs -expunge

16/06/03 09:26:51 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.

16/06/03 09:26:51 INFO fs.TrashPolicyDefault: Created trash checkpoint: /user/balajimani/.Trash/160603092651

hdfs dfs rm:

Removes the specified list of files and empty directories. An example is shown below:

balajimani@ubuntu:~$ hdfs dfs -rm /user/balajimani/dir2/test3.txt

16/06/03 09:28:34 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.

Deleted /user/balajimani/dir2/test3.txt

hdfs dfs -rmr:

Recursively deletes the files and sub directories. The usage of rmr is shown below:

balajimani@ubuntu:~$ hdfs dfs -rmr /user/balajimani/dir2

rmr: DEPRECATED: Please use ‘rm -r’ instead.

16/06/03 09:29:04 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.

Deleted /user/balajimani/dir2

hdfs dfs setrep:

Hadoop setrep is used to change the replication factor of a file. Use the -R option for recursively changing the replication factor.

balajimani@ubuntu:~$ hdfs dfs -setrep -w 2 /user/balajimani/dir1/sample.txt

Replication 2 set: /user/balajimani/dir1/sample.txt

Waiting for /user/balajimani/dir1/sample.txt ……

hdfs dfs stat:

Hadoop stat returns the stats information on a path. The syntax of stat is shown below:

hdfs dfs -stat URI

balajimani@ubuntu:~$ hdfs dfs -stat /user/balajimani

2016-06-03 16:29:04

hdfs dfs tail:

Hadoop tail command prints the last kilobytes of the file. The -f option can be used same as in unix.

balajimani@ubuntu:~$ hdfs dfs -tail /user/balajimani/dir1/sample.txt

Kill Bill 1

Kill Bill 2

pulp fiction

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: