Hive Queries Examples: Order By, Group By & Cluster By

⚡ Smart Summary

Hive queries use ORDER BY, GROUP BY, SORT BY, CLUSTER BY and DISTRIBUTE BY clauses to sort, group and spread rows across reducers, and each clause is demonstrated here on a single sample employees table.

🧱 Sample table first: employees_guru is created with six columns and loaded from Employees.txt before any clause runs.
🔢 ORDER BY totals: ORDER BY sends the whole result set to one reducer, which guarantees total order but slows large queries.
🔤 String sorting: A string column such as Department is returned in lexicographical order rather than numeric order.
📊 GROUP BY counts: Pairing GROUP BY with count(*) returns one row per department with its employee total.
🔀 SORT BY is per reducer: SORT BY orders rows inside each reducer, so multiple reducers produce partially ordered output.
🎯 CLUSTER BY combines: CLUSTER BY acts as DISTRIBUTE BY plus SORT BY, while DISTRIBUTE BY alone routes matching keys to one reducer unsorted.

Hive provides a SQL-type querying language for ETL purposes on top of the Hadoop file system.

Hive Query Language (HiveQL) provides a SQL-type environment in Hive for working with tables, databases and queries.

Different types of clauses are associated with Hive to perform different types of data manipulation and querying, and Hive provides JDBC connectivity for better connections with nodes outside the environment.

Hive queries provide the following features:

Data modeling such as creation of databases, tables, etc.
ETL functionalities such as extraction, transformation, and loading data into tables
Joins to merge different data tables
User specific custom scripts for ease of code
Faster querying tool on top of Hadoop

Creating Table in Hive

Before starting with the main topic of this tutorial, we will first create a table to use as a reference for the sections that follow.

Here in this tutorial, we are going to create the table “employees_guru” with 6 columns, as the screenshot below shows.

From the above screen shot,

We are creating the table “employees_guru” with 6 column values such as Id, Name, Age, Address, Salary, Department, which belong to the employees present in the organization “guru.”
Here in this step we are loading data into the employees_guru table. The data that we are going to load is placed in the Employees.txt file.

Order by query

The ORDER BY syntax in HiveQL is similar to the syntax of ORDER BY in the SQL language.

ORDER BY is the clause we use with the “SELECT” statement in Hive queries to sort data. It uses columns on Hive tables for sorting the particular column values mentioned with ORDER BY, and the query displays the results in ascending or descending order of those values.

If the mentioned ORDER BY field is a string, then it displays the result in lexicographical order. At the back end, the whole result set has to be passed on to a single reducer.

That single reducer also makes ORDER BY expensive on a large table, so in strict mode (hive.mapred.mode=strict) Hive rejects an ORDER BY that carries no LIMIT clause.

The screenshot below shows the ORDER BY query and its sorted rows.

From the above screen shot, we can observe the following:

It is the query performed on the “employees_guru” table with the ORDER BY clause and Department as the defined ORDER BY column name. “Department” is a string, so it displays results based on lexicographical order.
This is the actual output for the query. The results are displayed based on the Department column, such as ADMIN, Finance and so on, in order.

Query:

SELECT * FROM employees_guru ORDER BY Department;

Group by query

The GROUP BY clause uses columns on Hive tables for grouping the particular column values mentioned with GROUP BY, and the query selects and displays the results grouped by those values.

For example, the screenshot below displays the total count of employees present in each department. Here we have “Department” as the GROUP BY value.

From the above screenshot, we will observe the following:

It is the query that is performed on the “employees_guru” table with the GROUP BY clause and Department as the defined GROUP BY column name.
The output shown here is the department name and the employee count in the different departments. All the employees belonging to a specific department are grouped and displayed, so each result row is a department name with its total number of employees.

Query:

SELECT Department, count(*) FROM employees_guru GROUP BY Department;

Sort by

The SORT BY clause performs on column names of Hive tables to sort the output. We can mention DESC for sorting in descending order and ASC for ascending order of the sort.

SORT BY sorts the rows before feeding them to the reducer, so the ordering is guaranteed inside each reducer rather than across the whole result. Sorting always depends on the column type.

For instance, if the column type is numeric it sorts in numeric order, and if the column type is string it sorts in lexicographical order.

The screenshot below shows the SORT BY query using DESC.

From the above screen shot we can observe the following:

It is the query performed on the table “employees_guru” with the SORT BY clause and “Id” as the defined SORT BY column name. We used the keyword DESC.
So the output displayed is in descending order of “Id”.

Query:

SELECT * from employees_guru SORT BY Id DESC;

Cluster By

CLUSTER BY is used as an alternative for both the DISTRIBUTE BY and SORT BY clauses in HiveQL.

The CLUSTER BY clause is used on tables present in Hive. Hive uses the columns in CLUSTER BY to distribute the rows among reducers, and CLUSTER BY columns go to multiple reducers. It also ensures the sorting order of the values present in those multiple reducers.

For example, the CLUSTER BY clause is mentioned on the Id column name of the employees_guru table. Executing this query gives results to multiple reducers at the back end, but at the front end it is an alternative clause for both SORT BY and DISTRIBUTE BY.

This is the back-end process when we perform a query with SORT BY, GROUP BY, or CLUSTER BY in terms of the MapReduce framework. So if we want to store results into multiple reducers, we go with CLUSTER BY.

The CLUSTER BY grammar accepts column names only, so ASC and DESC cannot be attached to it; a descending result needs DISTRIBUTE BY with a separate SORT BY … DESC.

The screenshot below shows the CLUSTER BY query on Id.

From the above screen shot we are getting the following observations:

It is the query that performs the CLUSTER BY clause on the Id field value. Here it is going to get a sort on the Id values.
It displays the Id and Name values present in employees_guru in sorted order.

Query:

SELECT  Id, Name from employees_guru CLUSTER BY Id;

Distribute By

The DISTRIBUTE BY clause is used on tables present in Hive. Hive uses the columns in DISTRIBUTE BY to distribute the rows among reducers, so all rows that share the same DISTRIBUTE BY column value go to the same reducer.

It ensures that each of the N reducers receives a non-overlapping set of the column values
It does not sort the output of each reducer, and matching rows are not guaranteed to sit next to each other

The screenshot below shows the DISTRIBUTE BY query on Id.

From the above screenshot, we can observe the following:

The DISTRIBUTE BY clause is performed on the Id of the “employees_guru” table.
The output shows Id and Name. At the back end, rows with the same Id go to the same reducer.

Query:

SELECT  Id, Name from employees_guru DISTRIBUTE BY Id;

The four clauses are easy to confuse, so the table below compares them.

Clause	Reducers	What it guarantees
ORDER BY	One	Total order across the whole result
SORT BY	Many	Ordering within each reducer only
DISTRIBUTE BY	Many	Same key reaches the same reducer, unsorted
CLUSTER BY	Many	DISTRIBUTE BY plus SORT BY, ascending

FAQs

A single reducer has to sort every row, which can run for hours on a large table. LIMIT bounds that work. Setting hive.mapred.mode to nonstrict removes the restriction entirely.

Set mapreduce.job.reduces before the query to fix the count, otherwise Hive estimates it from input size. ORDER BY ignores the setting, because a total order always collapses onto one reducer.

No. The clause accepts column names only and always sorts ascending. Write DISTRIBUTE BY on the partition column with a separate SORT BY column DESC instead; the two columns may also differ.

They are related but not identical. Bucketing stores rows permanently in a fixed number of files, while CLUSTER BY distributes and sorts rows only for the duration of one query.

Machine learning assistants read the EXPLAIN plan and flag causes such as an unbounded ORDER BY, a missing partition filter or a skewed key. Confirm every suggestion against the actual runtime.

It drafts these patterns well from a short comment. Verify anything engine specific, because it readily mixes in Spark SQL or Presto syntax that Hive rejects, such as DESC after CLUSTER BY.

A plain text file named Employees.txt holds the six column values and is loaded into the table before the first query runs. Any delimited file with matching columns works the same way.

The semantics are identical, because they are HiveQL language features rather than engine features. Only the physical plan changes — the engines schedule sort and shuffle stages differently, so runtimes vary while results do not.

Hive Queries Examples: Order By, Group By & Cluster By

Creating Table in Hive

Order by query

Group by query

Sort by

Cluster By

Distribute By

FAQs

Summarize this post with:

Sign up for the newsletter

Creating Table in Hive

Order by query

Group by query

RELATED ARTICLES

Sort by

Cluster By

Distribute By

FAQs

Summarize this post with:

Sign up for the newsletter