Hive ETL: Loading JSON, XML, Text Data Examples

Hive as an ETL and data warehousing tool on top of Hadoop ecosystem provides functionalities like Data modeling, Data manipulation, Data processing and Data querying. Data Extraction in Hive means the creation of tables in Hive and loading structured and semi structured data as well as querying data based on the requirements.

For batch processing, we are going to write custom defined scripts using a custom map and reduce scripts using a scripting language. It provides SQL like environment and support for easy querying.

Working with Structured Data using Hive

Structured Data means that data is in the proper format of rows and columns. This is more of like RDBMS data with proper rows and columns.

Here we are going to load structured data present in text files in Hive

Step 1) In this step we are creating table “employees_guru” with column names such as Id, Name, Age, Address, Salary and Department of the employees with data types.

Working with Structured Data using Hive

From the above screenshot, we can observe the following,

  1. Creation of table “employees_guru”
  2. Loading data from Employees.txt into table”employees_guru”

Step 2) In this step we are displaying the contents stored in this table by using “Select” command. We can observe the table contents in the following screen shot.

Working with Structured Data using Hive

– Sample code Snippet

Queries to be performed

1) Create table employees_guru(Id INT, Name STRING, Age INT, Address STRING, Salary FLOAT, Department STRING)
> Row format delimited
> Fields terminated by ',';
2) load data local inpath '/home/hduser/Employees.txt' into TABLE employees_guru;
3) select * from employees_guru;

Working with Semi structured data using Hive (XML, JSON)

Hive performs ETL functionalities in Hadoop ecosystem by acting as ETL tool. It can be difficult to perform map reduce in some type of applications, Hive can reduce the complexity and provides the best solution to the IT applications in terms of data warehousing sector.

Semi structured data such as XML and JSON can be processed with less complexity using Hive. First we will see how we can use Hive for XML.

XML TO HIVE TABLE

In this, we are going to load XML data into Hive tables, and we will fetch the values stored inside the XML tags.

Step 1) Creation of Table “xmlsample_guru” with str column with string data type.

Working with Semi Structured Data using Hive

From the above screenshot, we can observe the following

  1. Creation of table “xmlsample_guru”
  2. Loading data from the test.xml into table “xmlsample_guru”

Step 2) Using XPath () method we will be able to fetch the data stored inside XML tags.

Working with Semi Structured Data using Hive

From the above screenshot, we can observe the following

  1. Using XPATH( ) method we are fetching the values stored under /emp/esal/ and /emp/ename/
  2. Values present Inside XML tags. In this step, we are displaying actual values stored under XML tags in table “xmlsample_guru”

Step 3) In this step, we will fetch and display the Raw XML of table “xmlsample_guru.”

Working with Semi Structured Data using Hive

From the above screenshot, we can observe the following

  • The actual XML data displaying with tags
  • If we observe single tag, it is with “emp” as parent tag with”ename” and “esal” as child tags.

Code Snippet:

Queries to be performed

1) create table xmlsample_guru(str string);                                                                                                   2) load data local inpath '/home/hduser/test.xml' overwrite  into table xmlsample_guru;
3) select xpath(str,'emp/ename/text()'), xpath(str,'emp/esal/text()') from xmlsample_guru;

JSON (JavaScript Object Notation)

Twitter and websites data is stored in JSON format. Whenever we try to fetch data from online servers it will return JSON files. Using Hive as data store we can able to load JSON data into Hive tables by creating schemas.

JSON TO HIVE TABLE

In this, we are going to load JSON data into Hive tables, and we will fetch the values stored in JSON schema.

Step 1) In this step, we are going to create JSON table name “json_guru”. Once created loading and displaying contents of the actual schema.

Json To Hive Table

From the above screenshot, we can observe the following

  1. Creation of table “json_guru”
  2. Loading data from test.json into table “json_guru”
  3. Displaying actual schema of JSON file stored in json_guru tables

Step 2) Using get_json_object() Method we can able to fetch the Data values stored in JSON hierarchy

Json To Hive Table

From the above screen shot we can observe the following

  1. Using get_json_object (str,’$.ecode) it can fetch ecode values from table json_guru. Similarly using get_json_object (str,’$.ename) ,get_json_object (str,’$.Sali) it will fetch ename sal values from table json_guru
  2. Values stored inside of JSON Hierarchy in json_guru

Code Snippet

Queries to be performed

1) create table json_guru(str string);
2) load data inpath 'home/hduser/test.json' into table json_guru;
3) select * from json1;
4) select get_json_object(str,'$.ecode') as ecode, get_json_object(str,'$.ename') as ename ,get_json_object(str,'$.sal') as salary from json_guru;

Complex JSON TO HIVE TABLE

In this, we are going to load Complex JSON data into Hive tables, and we will fetch the values stored in JSON schema

Step 1) Creating complexjson_guru with single column field

Complex Json To Hive Table

From the above screenshot, we can observe the following

  1. Creation on table complexjson_guru with single column field as string data type
  2. Loading data into complexjson_guru from emp.json complex JSON file

Step 2) By using get_json_object we can retrieve the actual content that stored inside of JSON file hierarchy.

From the following screenshot, we can able to see the output of the data stored in complexjson_guru.

Complex Json To Hive Table

Step 3) In this step, by using “Select” command we actually able to see complex JSON data stored inside table “complexjson_guru”

Complex Json To Hive Table

-Sample Code Snippet,

Queries to be performed

1) create table complexjson_guru(json string);
2) load data inpath 'home/hduser/emp.json' into table complexjson_guru;
3) select get_json_object(json,'$.ecode') as ecode ,get_json_object(json,'$.b') as code, get_json_object(json,'$.c') from complexjson_guru;
4) select * from complexjson_guru;

Hive in Real time projects – When and Where to Use

When and Where to Use Hive on Hadoop Ecosystem:

When

  • When working with strong and powerful statistical functions on Hadoop ecosystem
  • When working with structured and Semi structured data processing
  • As data warehouse tool with Hadoop
  • Real time data ingestion with HBASE, Hive can be used

Where

  • For ease usage of ETL and data warehousing tool
  • To provide SQL type environment and to query like SQL using HIVEQL
  • To use and deploy custom specified map and reducer scripts for the specific client requirements