Hadoop Pig Tutorial: What is Apache Pig? Architecture, Example

⚡ Smart Summary

Apache Pig is a high-level platform for analysing large data sets on Hadoop, pairing the Pig Latin data-flow language with a runtime that compiles each script into MapReduce, Tez or Spark jobs automatically.

🔘 Two components: Pig Latin supplies the language and the Pig runtime turns each script into cluster jobs.
☑️ Execution types: Current releases offer six exectypes selected with the -x flag, from local to Spark.
✅ Prerequisites: A working Hadoop installation plus Java, with HADOOP_HOME and JAVA_HOME exported, must exist first.
🧪 Worked example: A four-statement script loads SalesJan2009.csv, groups it by country and stores the product counts.
🛠️ Data model: Scalar types plus tuple, bag and map let Pig read nested and loosely structured files.
📈 Current version: Apache Pig 0.18.0 arrived in September 2025 with Hadoop 3, Tez, Spark and Python 3 support.

This tutorial starts with the idea behind Apache Pig, then walks through installing it and running a complete Pig Latin script end to end.

What is Apache Pig?

Pig is a high-level programming language useful for analyzing large data sets. Pig was a result of development effort at Yahoo!

In a MapReduce framework, programs need to be translated into a series of Map and Reduce stages. However, this is not a programming model which data analysts are familiar with. So, in order to bridge this gap, an abstraction called Pig was built on top of Hadoop.

Apache Pig enables people to focus more on analyzing bulk data sets and to spend less time writing MapReduce programs. Similar to pigs, who eat anything, the Apache Pig programming language is designed to work upon any kind of data. That is why the name, Pig! The project mascot below makes the same point.

The project is still active. Apache Pig 0.18.0 was released on 15 September 2025 and adds support for Hadoop 3, Tez 0.10, Hive 3, Spark 3, HBase 2 and Python 3, according to the Apache Pig releases page. The walkthrough below was originally written against the much older 0.12.1 line, so a few steps carry a note where a current release behaves differently.

Pig Architecture

The architecture of Pig consists of two components:

Pig Latin, which is a language
A runtime environment, for running Pig Latin programs.

A Pig Latin program consists of a series of operations or transformations which are applied to the input data to produce output. These operations describe a data flow which is translated into an executable representation by the Hadoop Pig execution environment. Underneath, the results of these transformations are a series of MapReduce jobs which a programmer is unaware of. So, in a way, Pig in Hadoop allows the programmer to focus on data rather than on the nature of execution.

Pig Latin is a comparatively rigid language which uses familiar keywords from data processing, e.g., Join, Group and Filter. The diagram below traces a script from the Grunt shell through the parser, optimizer and compiler on its way to the execution engine.

Execution modes

Pig in Hadoop has two execution modes, and the installation walkthrough further down uses both of them:

Local mode: In this mode, the Hadoop Pig language runs in a single JVM and makes use of the local file system. This mode is suitable only for analysis of small datasets using Pig in Hadoop.
MapReduce mode: In this mode, queries written in Pig Latin are translated into MapReduce jobs and are run on a Hadoop cluster (the cluster may be pseudo-distributed or fully distributed). MapReduce mode with a fully distributed cluster is useful for running Pig on large datasets.

Those two are the classic pair. Current releases actually expose six execution types, or exectypes, each selected with the same -x flag, as documented in the Pig 0.18.0 Getting Started guide.

Exectype	Command	Where the work runs
Local	pig -x local	A single JVM against the local file system
Tez local	pig -x tez_local	A single JVM using the Tez runtime (experimental)
Spark local	pig -x spark_local	A single JVM using the Spark runtime (experimental)
MapReduce	pig or pig -x mapreduce	A Hadoop cluster with HDFS; this is the default
Tez	pig -x tez	A Hadoop cluster running the Tez engine
Spark	pig -x spark	A Spark, YARN or Mesos cluster with HDFS

Pig Latin Data Types and Operators

Before writing a script it helps to know what Pig can hold and what it can do to it. The data model is fully nested, which is what lets Pig read log files and other loosely structured input that a relational table would reject.

Pig Latin offers two families of types:

Scalar types: int, long, float, double, chararray, bytearray, boolean, datetime, biginteger and bigdecimal. The example script later on declares every column as chararray.
Complex types: a tuple is an ordered set of fields and behaves like a row; a bag is an unordered collection of tuples and behaves like a table; a map is a set of key and value pairs whose key must be a chararray.

The operators fall into a small number of jobs, and a handful of them carry almost every pipeline:

Operator	What it does
LOAD / STORE	Read a relation from the file system and write the result back
FILTER	Keep only the tuples that match a condition
FOREACH … GENERATE	Work column by column, building new fields
GROUP / COGROUP	Group one relation, or two or more relations, by a key
JOIN	Combine relations with an inner or outer join
UNION / SPLIT	Merge relations together, or partition one into several
ORDER BY / DISTINCT / LIMIT	Sort, de-duplicate and cap the number of tuples
DUMP / DESCRIBE / EXPLAIN / ILLUSTRATE	Inspect results, schemas, execution plans and sample runs

The four inspection operators have Grunt shortcuts as well, which saves typing during debugging: \d for DUMP, \de for DESCRIBE, \e for EXPLAIN and \i for ILLUSTRATE.

Prerequisites

Pig is a client-side tool. It parses a script, plans the work and submits jobs, but it does not provide storage or a cluster of its own, so a working Hadoop installation has to exist first. The Apache requirements list is short.

Component	Requirement	Why it is needed
Hadoop	Hadoop 2.x or 3.x, with HADOOP_HOME exported	Supplies HDFS and the execution engine. Without HADOOP_HOME, Pig 0.18.0 falls back to an embedded Hadoop 2.7.3
Java	Java 1.7 or later, with JAVA_HOME set to the installation root	Pig and the Hadoop daemons are Java programs
Python (optional)	Python 2.7	Only needed for streaming Python user defined functions
Ant (optional)	Ant 1.8	Only needed when building or recompiling Pig from source

Two practical points sit alongside that list. First, run everything as a Linux user that already has a home directory in HDFS and permission to write to it; this tutorial uses hduser, the account created during Hadoop setup. Second, start the cluster before launching Pig in MapReduce mode, because Pig fails immediately if the NameNode and ResourceManager are down. If Hadoop is not installed yet, work through how to install Hadoop first.

How to Download and Install Pig

Now in this Apache Pig tutorial, we will learn how to download and install Pig:

Before we start with the actual process, ensure you have Hadoop installed. Change user to ‘hduser’ (the id used while configuring Hadoop; you can switch to the userid used during your own Hadoop config).

Step 1) Download the stable latest release of Pig Hadoop from any one of the mirror sites available at

https://pig.apache.org/releases.html

Select the tar.gz file (and not src.tar.gz) to download, as shown below.

Step 2) Once the download is complete, navigate to the directory containing the downloaded tar file and move the tar to the location where you want to set up Pig Hadoop. In this case, we will move to /usr/local.

Move to the directory that will contain the Pig Hadoop files.

cd /usr/local

Extract the contents of the tar file as below.

sudo tar -xvf pig-0.12.1.tar.gz

Step 3) Modify ~/.bashrc to add Pig related environment variables.

Open the ~/.bashrc file in any text editor of your choice and make the modifications below.

export PIG_HOME=<Installation directory of Pig>
export PATH=$PIG_HOME/bin:$HADOOP_HOME/bin:$PATH

Step 4) Now, source this environment configuration using the command below.

. ~/.bashrc

Step 5) We need to recompile Pig to support Hadoop 2.2.0.

Here are the steps to do this.

Go to the Pig home directory.

cd $PIG_HOME

Install Ant.

sudo apt-get install ant

Note: The download will start and will consume time as per your internet speed.

Recompile Pig.

sudo ant clean jar-all -Dhadoopversion=23

Please note that in this recompilation process multiple components are downloaded, so the system should be connected to the internet.

Also, in case this process gets stuck somewhere and you do not see any movement on the command prompt for more than 20 minutes, then press Ctrl + c and rerun the same command.

In our case, it takes 20 minutes.

This recompilation step belongs to the 0.12.1 era, when the shipped jars were built against Hadoop 1 and the -Dhadoopversion=23 flag switched the Ant build to the Hadoop 0.23 and 2.x line. Apache Pig 0.18.0 ships binaries that already run on Hadoop 2.7 and above as well as Hadoop 3, so on a current release you can normally unpack the tarball and skip straight to the test below.

Step 6) Test the Pig installation using the command

pig -help

Example Pig Script

With Pig installed, the rest of this Apache Pig tutorial runs a complete script. We will use Pig Scripts to find the number of products sold in each country.

Input: Our input data set is a CSV file, SalesJan2009.csv.

Step 1) Start Hadoop.

$HADOOP_HOME/sbin/start-dfs.sh

$HADOOP_HOME/sbin/start-yarn.sh

Step 2) Pig in Big Data takes a file from HDFS in MapReduce mode and stores the results back to HDFS.

Copy the file SalesJan2009.csv (stored on the local file system at ~/input/SalesJan2009.csv) to the HDFS (Hadoop Distributed File System) home directory.

Here in this Apache Pig example, the file is in the folder input. If the file is stored in some other location, give that name instead.

$HADOOP_HOME/bin/hdfs dfs -copyFromLocal ~/input/SalesJan2009.csv /

Verify whether the file was actually copied or not.

$HADOOP_HOME/bin/hdfs dfs -ls /

Step 3) Pig Configuration.

First, navigate to $PIG_HOME/conf and keep a copy of the original properties file.

cd $PIG_HOME/conf

sudo cp pig.properties pig.properties.original

Open pig.properties using a text editor of your choice, and specify the log file path using pig.logfile.

sudo gedit pig.properties

The logger will make use of this file to log errors.

Step 4) Run the command ‘pig’, which will start the Pig command prompt, an interactive shell for Pig queries.

pig

Step 5) In the Grunt command prompt for Pig, execute the Pig commands below in order.

— A. Load the file containing data.

salesTable = LOAD '/SalesJan2009.csv' USING PigStorage(',') AS (Transaction_date:chararray,Product:chararray,Price:chararray,Payment_Type:chararray,Name:chararray,City:chararray,State:chararray,Country:chararray,Account_Created:chararray,Last_Login:chararray,Latitude:chararray,Longitude:chararray);

Press Enter after this command.

— B. Group data by the field Country.

GroupByCountry = GROUP salesTable BY Country;

— C. For each tuple in ‘GroupByCountry’, generate the resulting string of the form Name of Country: No. of products sold.

CountByCountry = FOREACH GroupByCountry GENERATE CONCAT((chararray)$0,CONCAT(':',(chararray)COUNT($1)));

Press Enter after this command.

— D. Store the results of the data flow in the directory ‘pig_output_sales’ on HDFS.

STORE CountByCountry INTO 'pig_output_sales' USING PigStorage('\t');

This command will take some time to execute. Once done, you should see the following screen.

Step 6) The result can be seen through the command interface as

$HADOOP_HOME/bin/hdfs dfs -cat pig_output_sales/part-r-00000

Results can also be seen via a web interface.

Open http://localhost:50070/ in a web browser.

Port 50070 is the NameNode web UI on Hadoop 2. Hadoop 3 moved the same page to port 9870, so use that address instead if your cluster runs Hadoop 3.

Now select ‘Browse the filesystem’ and navigate up to /user/hduser/pig_output_sales.

Open part-r-00000 to read the country and product-count pairs the script produced.

Apache Pig vs Hive vs MapReduce

Pig is rarely evaluated on its own. The usual question is whether a job belongs in Pig, in Hive or in a hand-written MapReduce program, since all three end up as jobs on the same cluster.

Aspect	MapReduce (Java)	Apache Pig	Apache Hive
Interface	Java API	Pig Latin, a data-flow scripting language	HiveQL, a SQL dialect
Level of abstraction	Low	Medium	High
Typical user	Java developer	Data engineer or researcher	Analyst comfortable with SQL
Data it suits	Anything, handled manually	Structured and semi-structured; the schema is optional at load time	Structured tables registered in a metastore
Code volume	Most lines	Fewer lines	Fewest lines for a query
Runs as	A MapReduce job	Compiles to MapReduce, Tez or Spark jobs	Compiles to MapReduce, Tez or Spark jobs
Best at	Full control and custom logic	Multi-step ETL pipelines	Ad hoc querying and reporting

In practice the split is straightforward. Reach for Hive when the data already looks like a table and the question looks like SQL. Reach for Pig when the input is messy, when the pipeline is a chain of transformations rather than a single query, or when the same script has to branch and rejoin. Reach for MapReduce only when neither abstraction can express the logic, because the line count grows quickly.

Pig also sits comfortably beside the rest of the ecosystem. Sqoop and Flume land the raw data, Pig or Hive shape it, and a scheduler such as Apache Oozie chains the steps into a repeatable pipeline. For the wider picture of where these tools fit, see the guide to big data and the roundup of big data tools; for a commercial ETL alternative to hand-written scripts, see Talend.

FAQs

Yes. Apache Pig 0.18.0 was released on 15 September 2025 and brought support for Hadoop 3, Tez 0.10, Hive 3, Spark 3, HBase 2 and Python 3. Development is slower than in its Yahoo! years, but the project is not retired.

AI assistants draft transformation logic from a plain description, suggest join keys, flag schema drift between runs, and summarise cluster logs into a probable cause. Treat the output as a first draft that still needs profiling against real data.

Copilot produces plausible Pig Latin quickly because the syntax is well represented in public repositories. It does invent function names and mismatch field positions, so run DESCRIBE and ILLUSTRATE on a sample before trusting a generated script.

Pig only executes when a statement forces materialisation. A chain of LOAD and FOREACH statements is validated but never run until a DUMP or STORE follows, so a script that ends after a transformation finishes silently.

DUMP prints a relation to the terminal and is meant for testing. STORE writes the relation to the file system through a store function such as PigStorage. Production scripts should always end with STORE.

No. The AS clause is optional. Without it, every field is loaded as bytearray and referenced by position, such as $0 and $1. Declaring a schema simply makes scripts readable and lets Pig type-check earlier.

Most new pipelines start on Spark, which has a larger community and richer libraries. Pig remains sensible where scripts already exist, where the team prefers a data-flow syntax, or where Pig runs on Spark through the spark exectype.

Sqoop imports relational tables and Flume streams log data into HDFS. Pig then transforms whatever has landed. Oozie chains the import, transform and export steps into one scheduled workflow so the pipeline repeats without manual runs.

Hadoop Pig Tutorial: What is Apache Pig? Architecture, Example

What is Apache Pig?

Pig Architecture

Execution modes

Pig Latin Data Types and Operators

Prerequisites

How to Download and Install Pig

Example Pig Script

Apache Pig vs Hive vs MapReduce

FAQs

Summarize this post with:

Sign up for the newsletter

What is Apache Pig?

Pig Architecture

Execution modes

RELATED ARTICLES

Pig Latin Data Types and Operators

Prerequisites

How to Download and Install Pig

Example Pig Script

Apache Pig vs Hive vs MapReduce

FAQs

Summarize this post with:

Sign up for the newsletter