Blog
How to download DailyMotion Videos: 2 Easy Methods
Dailymotion is a popular platform for watching videos online. Because of network connectivity or...
Apache Oozie is a workflow scheduler for Hadoop. It is a system which runs the workflow of dependent jobs. Here, users are permitted to create Directed Acyclic Graphs of workflows, which can be run in parallel and sequentially in Hadoop.
In this tutorial, you will learn,
It consists of two parts:
Oozie is scalable and can manage the timely execution of thousands of workflows (each consisting of dozens of jobs) in a Hadoop cluster.
Oozie is very much flexible, as well. One can easily start, stop, suspend and rerun jobs. Oozie makes it very easy to rerun failed workflows. One can easily understand how difficult it can be to catch up missed or failed jobs due to downtime or failure. It is even possible to skip a specific failed node.
Oozie runs as a service in the cluster and clients submit workflow definitions for immediate or later processing.
Oozie workflow consists of action nodes and control-flow nodes.
An action node represents a workflow task, e.g., moving files into HDFS, running a MapReduce, Pig or Hive jobs, importing data using Sqoop or running a shell script of a program written in Java.
A control-flow node controls the workflow execution between actions by allowing constructs like conditional logic wherein different branches may be followed depending on the result of earlier action node.
Start Node, End Node, and Error Node fall under this category of nodes.
Start Node, designates the start of the workflow job.
End Node, signals end of the job.
Error Node designates the occurrence of an error and corresponding error message to be printed.
At the end of execution of a workflow, HTTP callback is used by Oozie to update the client with the workflow status. Entry-to or exit from an action node may also trigger the callback.
A workflow application consists of the workflow definition and all the associated resources such as MapReduce Jar files, Pig scripts etc. Applications need to follow a simple directory structure and are deployed to HDFS so that Oozie can access them.
An example directory structure is shown below-
<name of workflow>/</name> ??? lib/ ? ??? hadoop-examples.jar ??? workflow.xml
It is necessary to keep workflow.xml (a workflow definition file) in the top level directory (parent directory with workflow name). Lib directory contains Jar files containing MapReduce classes. Workflow application conforming to this layout can be built with any build tool e.g., Ant or Maven.
Such a build need to be copied to HDFS using a command, for example -
% hadoop fs -put hadoop-examples/target/<name of workflow dir> name of workflow
Steps for Running an Oozie workflow job
In this section, we will see how to run a workflow job. To run this, we will use the Oozie command-line tool (a client program which communicates with the Oozie server).
1. Export OOZIE_URL environment variable which tells the oozie command which Oozie server to use (here we’re using one running locally):
% export OOZIE_URL="http://localhost:11000/oozie"
2. Run workflow job using-
% oozie job -config ch05/src/main/resources/max-temp-workflow.properties -run
The -config option refers to a local Java properties file containing definitions for the parameters in the workflow XML file, as well as oozie.wf.application.path, which tells Oozie the location of the workflow application in HDFS.
Example contents of the properties file:
nameNode=hdfs://localhost:8020 jobTracker=localhost:8021 oozie.wf.application.path=${nameNode}/user/${user.name}/<name of workflow>
3. Get the status of workflow job-
Status of workflow job can be seen using subcommand 'job' with '-info' option and specifying job id after '-info'.
e.g., % oozie job -info <job id>
Output shows status which is one of RUNNING, KILLED or SUCCEEDED.
4. Results of successful workflow execution can be seen using Hadoop command like-
% hadoop fs -cat <location of result>
The main purpose of using Oozie is to manage different type of jobs being processed in Hadoop system.
Dependencies between jobs are specified by a user in the form of Directed Acyclic Graphs. Oozie consumes this information and takes care of their execution in the correct order as specified in a workflow. That way user's time to manage complete workflow is saved. In addition, Oozie has a provision to specify the frequency of execution of a particular job.
Dailymotion is a popular platform for watching videos online. Because of network connectivity or...
Audio Equalization is a technique for adjusting the balance between audible frequency components....
Linux is a multi-user system, which allows many users to work on it simultaneously. So what if...
cPanel is one of the most famous web hosting control panel. It enables website owners to...
Instagram downloader tools are applications that help you to download Instagram videos and photos.
A Virtual Machine (VM) is a software environment that emulates a computer system. It facilitates a...