Data operations in Hive

Data operations are the operations that can be performed by databases and tables present in the Hive such as DML, DDL commands, etc.








In this tutorial, you will learn -

Data types in Hive

Data types are very important elements in Hive query language and data modeling. For defining the table column types, we must have to know about the data types and its usage.

The following gives brief overview of some data types present in Hive:

These are

  • Numeric Types
  • String Types
  • Date/Time Types
  • Complex Types

Numeric Types:

Type Memory allocation
TINY INT Its 1-byte signed integer (-128 to 127)
SMALL INT 2-byte signed integer (-32768 to 32767)
INT 4 –byte signed integer ( -2,147,484,648 to 2,147,484,647)
BIG INT 8 byte signed integer
FLOAT 4 – byte single precision floating point number
DOUBLE 8- byte double precision floating point number
DECIMAL We can define precision and scale in this Type

String Types:

Type Length
CHAR 255
VARCHAR 1 to 65355
STRING We can define length here(No Limit)

Date/Time Types:

Type Usage
Timestamp Supports traditional UNIX timestamp with optional nanosecond precision
Date
  • It's in YYYY-MM-DD format.
  • The range of values supported for the Date type is be 0000-01-01 to 9999-12-31, dependent onsupport by the primitive Java Date type

Complex Types:

Type Usage
Arrays ARRAY<data_type> Negative values and non-constant expressions not allowed
Maps MAP<primitive_type, data_type> Negative values and non-constant expressions not allowed
Structs STRUCT<col_name :datat_type, ….. >
Union UNIONTYPE<data_type, datat_type, ……>

Creation and dropping of Database in Hive:

Create Database:

For creating database in Hive shell, we have to use the command as shown in syntax below:-

Syntax:

Create database <DatabaseName>

Example: -Create database "guru99"

Data operations in Hive

From the above screen shot, we are doing two things

  • Creating database "guru99" in Hive
  • Displaying existing databases using "show" command
  • In the same screen, Database "guru99" name is displayed at the end when we execute the show command. Which means Database "guru99" is successfully created.

Drop Database:

For Dropping database in Hive shell, we have to use the "drop" command as shown in syntax below:-

Syntax:

Drop database <DatabaseName>

Example:-

Drop database guru99

Data operations in Hive

In the above screenshot, we are doing two things

  • We are dropping database 'guru99' from Hive
  • Cross checking the same with "show" command
  • In the same screen, after checking databases with show command, database"guru99" does not appear inside Hive.
  • So we can confirm now that database "guru99" is dropped

Create, Drop and altering of tables in Hive

Table Operations such as Creation, Altering, and Dropping tables in Hive can be observed in this section.

Data operations in Hive

In the above screenshot, we are creating a table with columns and altering the table name.

  1. Creating table guru_sample with two column names such as "empid" and "empname"
  2. Displaying tables present in guru99 database
  3. Guru_sample displaying under tables
  4. Altering table "guru_sample" as "guru_sampleNew"
  5. Again when you execute "show" command, it will display the new name Guru_sampleNew

Dropping table guru_sampleNew:

Data operations in Hive

Table types and its Usage:

Coming to Tables it's just like the way that we create in traditional Relational Databases. The functionalities such as filtering, joins can be performed on the tables.

Hive deals with two types of table structures like Internal and External tables depending on the loading and design of schema in Hive.

Internal tables

  • Internal Table is tightly coupled in nature.In this type of table, first we have to create table and load the data.
  • We can call this one as data on schema.
  • By dropping this table, both data and schema will be removed.
  • The stored location of this table will be at /user/hive/warehouse.

When to Choose Internal Table:

  • If the processing data available in local file system
  • If we want Hive to manage the complete lifecycle of data including the deletion

Sample code Snippet for Internal Table

1. To create the internal table

Hive>CREATE TABLE guruhive_internaltable (id INT,Name STRING);

Row format delimited

Fields terminated by '\t';

2. Load the data into internal table

Hive>LOAD DATA INPATH '/user/guru99hive/data.txt' INTO table guruhive_internaltable;

3. Display the content of the table

Hive>select * from guruhive_internaltable;

4. To drop the internal table

Hive>DROP TABLE guruhive_internaltable;

If you dropped the guruhive_internaltable, including its metadata and its data will be deleted from Hive.

From the following screenshot, we can observe the output

Data operations in Hive

In above code and from screen shot we do following things,

  • Create the internal table
  • Load the data into internal table
  • Display the content of the table
  • To drop the internal table

External tables

  • External Table is loosely coupled in nature. Data will be available in HDFS.The table is going to create on HDFS data.
  • In other way, we can say like its creating schema on data.
  • At the time of dropping the table it drops only schema, the data will be still available in HDFS as before.
  • External tables provide an option to create multiple schemas for the data stored in HDFS instead of deleting the data every time whenever schema updates

When to Choose External Table:

  • If processing data available in HDFS
  • Useful when the files are being used outside of Hive

Sample code Snippet for External Table

1. Create External table

Hive>CREATE EXTERNAL TABLE guruhive_external(id INT,Name STRING)

Row format delimited

Fields terminated by '\t'

LOCATION '/user/guru99hive/guruhive_external;

2. If we are not specifying the location at the time of table creation, we can load the data manually

Hive>LOAD DATA INPATH '/user/guru99hive/data.txt' INTO TABLE guruhive_external;

3. Display the content of the table

Hive>select * from guruhive_external;

4. To drop the internal table

Hive>DROP TABLE guruhive_external;

From the following screen shot, we can observe the output

Data operations in Hive

In above code, we do following things

  • Create the External table
  • Load the data into internal table
  • Display the content of the table
  • Dropping external table

Difference between Internal Vs External tables

Feature

Internal

External

Schema

Data on Schema

Schema on Data

Storage Location

/usr/hive/warehouse

HDFS location

Data availability

Within local file system

Within HDFS

Tables, Partitions, and Buckets are the parts of Hive data modeling.

Partitions

Partitions:

Hive organizes tables into partitions. It is one of the ways of dividing tables into different parts based on partition keys. Partition is helpful when the table has one or more Partition keys. Partition keys are basic elements for determining how the data is stored in the table.

For Example: -

"Client having Some E –commerce data which belongs to India operations in which each state (38 states) operations mentioned in as a whole. If we take state column as partition key and perform partitions on that India data as a whole, we can able to get Number of partitions (38 partitions) which is equal to number of states (38) present in India. Such that each state data can be viewed separately in partitions tables.

Sample Code Snippet for partitions

  1. Creation of Table all states

create table all states(state string, District string,Enrolments string)

row format delimited

fields terminated by ',';

  1. Loading data into created table all states

Load data local inpath '/home/hduser/Desktop/AllStates.csv' into table allstates;

  1. Creation of partition table

create table state_part(District string,Enrolments string) PARTITIONED BY(state string);

  1. set hive.exec.dynamic.partition.mode=nonstrict

    For partition we have to set this property

  2. Loading data into partition table

INSERT OVERWRITE TABLE state_part PARTITION(state)

SELECT district,enrolments,state from allstates;

  1. Actual processing and formation of partition tables based on state as partition key
  2. There is going to 38 partition outputs in HDFS storage with the file name as state name. We will check this in this step

The following screen shots will show u the execution of above mentioned code

Data operations in Hive

Data operations in Hive

Data operations in Hive

Data operations in Hive

From the above code, we do following things

  1. Creation of table all states with 3 column names such as state, district, and enrollment
  2. Loading data into table all states
  3. Creation of partition table with state as partition key
  4. In this step Setting partition mode as non-strict( This mode will activate dynamic partition mode)
  5. Loading data into partition tablestate_part
  6. Actual processing and formation of partition tables based on state as partition key
  7. There is going to 38 partition outputs in HDFS storage with the file name as state name. We will check this in this step. In This step, we seeing the 38 partition outputs in HDFS

Buckets

Buckets used for efficient querying. Buckets in hive provide an effective way of segregating hive tables data into multiple files or directories.

  • The data i.e. present in that partitions can be divided further into Buckets
  • The division is performed based on Hash of particular columns that we selected in the table.
  • Buckets use some form of Hashing algorithm at back end to read each record and place it into buckets
  • In Hive, we have to enable buckets by using the set.hive.enforce.bucketing=true;

Step 1) Creating Bucket as shown below.

Data operations in Hive

From the above screen shot

  • We are creating sample_bucket with column names such as first_name, job_id, department, salary and country
  • We are creating 4 buckets overhere.
  • Once the data get loaded it automatically, place the data into 4 buckets

Step 2) Loading Data into table sample bucket

Assuming that"Employees table" already created in Hive system. In this step, we will see the loading of Data from employees table into table sample bucket.

Before we start moving employees data into buckets, make sure that it consist of column names such as first_name, job_id, department, salary and country.

Here we are loading data into sample bucket from employees table.

Data operations in Hive

Step 3)Displaying 4 buckets that created in Step 1

Data operations in Hive

From the above screenshot, we can see that the data from the employees table is transferred into 4 buckets created in step 1.

Views and Indexes

Views:

Views are Similar to tables, which are generated based on the requirements.

  • We can save any result set data as a view in Hive
  • Usage is similar to as views used in SQL
  • All type of DML operations can be performed on a view

Creation of View:

Syntax:

Create VIEW < VIEWNAME> AS SELECT

Example:

Hive>Create VIEW Sample_ViewAS SELECT * FROM employees WHERE salary>25000

In this example, we are creating view Sample_View where it will display all the row values with salary field greater than 25000.

Index:

Indexes are pointers to particular column name of a table.

  • The user has to manually define the index
  • Wherever we are creating index, it means that we are creating pointer to particular column name of table
  • Any Changes made to the column present in tables are stored using the index value created on the column name.

Syntax:

Create INDEX < INDEX_NAME> ON TABLE < TABLE_NAME(column names)>

Example:

Create INDEX sample_Index ON TABLE guruhive_internaltable(id)

Here we are creating index on table guruhive_internaltable for column name id.