Hive
Hive Function: Built-in & UDF (User Defined Functions)
Functions are built for a specific purpose to perform operations like Mathematical, arithmetic,...
In this tutorial, you will learn-
Hive is an ETL and Data warehousing tool developed on top of Hadoop Distributed File System (HDFS). Hive makes job easy for performing operations like
For setting up MySQL as database and to store Meta-data information check Tutorial "Installation and Configuration of HIVE and MYSQL"
Some of the key points about Hive:
By using Hive, we can perform some peculiar functionality that is not achieved in Relational Databases. For a huge amount of data that is in peta-bytes, querying it and getting results in seconds is important. And Hive does this quite efficiently, it processes the queries fast and produce results in second's time.
Let see now what makes Hive so fast.
Some key differences between Hive and relational databases are the following;
Relational databases are of "Schema on READ and Schema on Write". First creating a table then inserting data into the particular table. On relational database tables, functions like Insertions, Updates, and Modifications can be performed.
Hive is "Schema on READ only". So, functions like the update, modifications, etc. don't work with this. Because the Hive query in a typical cluster runs on multiple Data Nodes. So it is not possible to update and modify data across multiple nodes.( Hive versions below 0.13)
Also, Hive supports "READ Many WRITE Once" pattern. Which means that after inserting table we can update the table in the latest Hive versions.
NOTE: However the new version of Hive comes with updated features. Hive versions ( Hive 0.14) comes up with Update and Delete options as new features
The above screenshot explains the Apache Hive architecture in detail
Hive Consists of Mainly 3 core parts
Hive Clients:
Hive provides different drivers for communication with a different type of applications. For Thrift based applications, it will provide Thrift client for communication.
For Java related applications, it provides JDBC Drivers. Other than any type of applications provided ODBC drivers. These Clients and drivers in turn again communicate with Hive server in the Hive services.
Hive Services:
Client interactions with Hive can be performed through Hive Services. If the client wants to perform any query related operations in Hive, it has to communicate through Hive Services.
CLI is the command line interface acts as Hive service for DDL (Data definition Language) operations. All drivers communicate with Hive server and to the main driver in Hive services as shown in above architecture diagram.
Driver present in the Hive services represents the main driver, and it communicates all type of JDBC, ODBC, and other client specific applications. Driver will process those requests from different applications to meta store and field systems for further processing.
Hive Storage and Computing:
Hive services such as Meta store, File system, and Job Client in turn communicates with Hive storage and performs the following actions
Job exectution flow:
From the above screenshot we can understand the Job execution flow in Hive with Hadoop
The data flow in Hive behaves in the following pattern;
Hive Continuously in contact with Hadoop file system and its daemons via Execution engine. The dotted arrow in the Job flow diagram shows the Execution engine communication with Hadoop daemons.
Hive can operate in two modes depending on the size of data nodes in Hadoop.
These modes are,
When to use Local mode:
When to use Map reduce mode:
In Hive, we can set this property to mention which mode Hive can work? By default, it works on Map Reduce mode and for local mode you can have the following setting.
Hive to work in local mode set
SET mapred.job.tracker=local;
From the Hive version 0.7 it supports a mode to run map reduce jobs in local mode automatically.
HiveServer2 (HS2) is a server interface that performs following functions:
From the latest version it's having some advanced features Based on Thrift RPC like;
Summary:
Hive is an ETL and data warehouse tool on top of Hadoop ecosystem and used for processing structured and semi structured data.
For user specific logic to meet client requirements.
Functions are built for a specific purpose to perform operations like Mathematical, arithmetic,...
Why to Use MySQL in Hive as Metastore: By Default, Hive comes with derby database as metastore. Derby...
Tables, Partitions, and Buckets are the parts of Hive data modeling. What is Partitions? Hive...
Data types in Hive Data types are very important elements in Hive query language and data...
Hive provides SQL type querying language for the ETL purpose on top of Hadoop file system. Hive Query...
What is HiveQL(Hive Query Language)? Hive provides a CLI to write Hive queries using Hive Query...