Pandas Cheat Sheet for Data Science in Python

What is Pandas Cheat Sheet?

Pandas library has many functions, but some of these are confusing for some people. We have here provided a helpful resource available called the Python Pandas Cheat Sheet. It explains the basics of Pandas in a simple and concise manner.

👉 Download the PDF of Cheat Sheet here

Whether you are a newbie or experienced with Pandas, this cheat sheet can serve as a useful reference guide. It covers a variety of topics, including working with Series and DataFrame data structures, selecting and ordering data, and applying functions to your data.

In summary, this Pandas Python Cheat Sheet is a good resource for anyone looking to learn more about using Python for Data Science. It is a handy reference tool. It can help you improve your data analysis skills and work more efficiently with Pandas.

Explaining important functions in Pandas:

To start working with pandas functions, you need to install and import pandas. There are two commands to do this:

Step 1) # Install Pandas

Pip install pandas

Step 2) # Import Pandas

Import pandas as pd

Now, you can start working with Pandas functions. We will work to manipulate, analyze and clean the data. Here are some important functions of pandas.

Pandas Data Structures

As we have already discussed that Pandas has two data structures called Series and DataFrames. Both are labeled arrays and can hold any data type. There is The only difference that Series is a one-dimensional array, and DataFrame is two-dimensional array.

1. Series

It is a one-dimensional labeled array. It can hold any data type.

s = pd.Series([2, -4, 6, 3, None], index=['A', 'B', 'C', 'D', 'E'])

2. DataFrame

It is a two-dimensional labeled array. It can hold any data type and different sizes of columns.

data = {'RollNo' : [101, 102, 75, 99],
        'Name' : ['Mithlesh', 'Ram', 'Rudra', 'Mithlesh'],
        'Course' : ['Nodejs', None, 'Nodejs', 'JavaScript']
}
df = pd.DataFrame(data, columns=['RollNo', 'Name', 'Course'])
df.head()

Importing Data

Pandas have the ability to import or read various types of files in your Notebook.

Here are some examples given below.

# Import a CSV file pd
pd.read_csv(filename)

# Import a TSV file
pd.read_table(filename)

# Import a Excel file pd
pd.read_excel(filename)

# Import a SQL table/database
pd.read_sql(query, connection_object)

# Import a JSON file
pd.read_json(json_string)

# Import a HTML file
pd.read_html(url)

# From clipboard to read_table()
pd.read_clipboard()

# From dict
pd.DataFrame(dict)

Selection

You can select elements by its location or index. You can select rows, columns, and distinct values using these techniques.

1. Series

# Accessing one element from Series
s['D']

# Accessing all elements between two given indices
s['A':'C']

# Accessing all elements from starting till given index
s[:'C']

# Accessing all elements from given index till end
s['B':]

2. DataFrame

# Accessing one column df
df['Name']

# Accessing rows from after given row
df[1:]

# Accessing till before given row
df[:1]

# Accessing rows between two given rows
df[1:2]

Selecting by Boolean Indexing and Setting

1. By Position

df.iloc[0, 1]

df.iat[0, 1]

2. By Label

df.loc[[0],  ['Name']]

3. By Label/Position

df.loc[2] # Both are same
df.iloc[2]

4. Boolean Indexing

# Series s where value is > 1
s[(s > 0)]

# Series s where value is <-2 or >1
s[(s < -2) | ~(s > 1)]

# Use filter to adjust DataFrame
df[df['RollNo']>100]

# Set index a of Series s to 6
s['D'] = 10
s.head()

Data Cleaning

For data cleaning purposes, you can perform the following operations:

  • Rename columns using the rename() method.
  • Update values using the at[] or iat[] method to access and modify specific elements.
  • Create a copy of a Series or data frame using the copy() method.
  • Check for NULL values using the isnull() method, and drop them using the dropna() method.
  • Check for duplicate values using the duplicated() method. Drop them using the drop_duplicates() method.
  • Replace NULL values using the fill () method with a specified value.
  • Replace values using the replace() method.
  • Sort values using the sort_values() method.
  • Rank values using the rank() method.
# Renaming columns
df.columns = ['a','b','c']
df.head()

# Mass renaming of columns
df = df.rename(columns={'RollNo': 'ID', 'Name': 'Student_Name'})

# Or use this edit in same DataFrame instead of in copy
df.rename(columns={'RollNo': 'ID', 'Name': 'Student_Name'}, inplace=True)
df.head()

# Counting duplicates in a column
df.duplicated(subset='Name')

# Removing entire row that has duplicate in given column
df.drop_duplicates(subset=['Name'])

# You can choose which one keep - by default is first
df.drop_duplicates(subset=['Name'], keep='last')

# Checks for Null Values
s.isnull()

# Checks for non-Null Values - reverse of isnull()
s.notnull()

# Checks for Null Values df
df.isnull()

# Checks for non-Null Values - reverse of isnull()
df.notnull()

# Drops all rows that contain null values
df.dropna()

# Drops all columns that contain null values
df.dropna(axis=1)

# Replaces all null values with 'Guru99'
df.fillna('Guru99')

# Replaces all null values with the mean
s.fillna(s.mean())

# Converts the datatype of the Series to float
s.astype(float)

# Replaces all values equal to 6 with 'Six'
s.replace(6,'Six')

# Replaces all 2 with 'Two' and 6 with 'Six'
s.replace([2,6],['Two','Six'])

# Drop from rows (axis=0)
s.drop(['B',  'D'])

# Drop from columns(axis=1)
df.drop('Name', axis=1)

# Sort by labels with axis
df.sort_index()

# Sort by values with axis
df.sort_values(by='RollNo')

# Ranking entries
df.rank()

# s1 is pointing to same Series as s
s1 = s

# s_copy of s, but not pointing same Series
s_copy = s.copy()

# df1 is pointing to same DataFrame as df
df1 = s

# df_copy of df, but not pointing same DataFrame
df_copy = df.copy()

Retrieving Information

You can perform these operation to retrieve information:

  • Use shape attribute to get the number of rows and columns.
  • Use the head() or tail() method to obtain the first or last few rows as a sample.
  • Use the info(), describe(), or dtypes method to obtain information about the data type, count, mean, standard deviation, minimum, and maximum values.
  • Use the count(), min(), max(), sum(), mean(), and median() methods to obtain specific statistical information for values.
  • Use the loc[] method to obtain a row.
  • Use the groupby() method to apply the GROUP BY function to group similar values in a column of a DataFrame.

1. Basic information

# Counting all elements in Series
len(s)

# Counting all elements in DataFrame
len(df)

# Prints number of rows and columns in dataframe
df.shape

# Prints first 10 rows by default, if no value set
df.head(10)

# Prints last 10 rows by default, if no value set
df.tail(10)

# For counting non-Null values column-wise
df.count()

# For range of index df
df.index

# For name of attributes/columns
df.columns

# Index, Data Type and Memory information
df.info()

# Datatypes of each column
df.dtypes

# Summary statistics for numerical columns
df.describe()

2. Summary

# For adding all values column-wise
df.sum()

# For min column-wise
df.min()

# For max column-wise
df.max()

# For mean value in number column
df.mean()

# For median value in number column
df.median()

# Count non-Null values
s.count()

# Count non-Null values
df.count()

# Return Series of given column
df['Name'].tolist()

# Name of columns
df.columns.tolist()

# Creating subset
df[['Name', 'Course']]

# Return number of values in each group
df.groupby('Name').count()

Applying Functions

# Define function
f = lambda x: x*5

# Apply this function on given Series - For each value
s.apply(f)

# Apply this function on given DataFrame - For each value
df.apply(f)

1. Internal Data Alignment

# NA values for indices that don't overlap
s2 = pd.Series([8, -1, 4],  index=['A',  'C',  'D'])
s + s2

2. Arithmetic Operations with Fill Methods

# Fill values that don't overlap
s.add(s2, fill_value=0)

3. Filter, Sort and Group By

These following functions can be used for filtering, sorting, and grouping by Series and DataFrame.

# Filter rows where column is greater than 100
df[df['RollNo']>100]

# Filter rows where 70 < column < 101
df[(df['RollNo'] > 70) & (df['RollNo'] < 101)]

# Sorts values in ascending order
s.sort_values()

# Sorts values in descending order
s.sort_values(ascending=False)

# Sorts values by RollNo in ascending order
df.sort_values('RollNo')

# Sorts values by RollNo in descending order
df.sort_values('RollNo', ascending=False)

Exporting Data

Pandas has the ability to export or write data in various formats. Here are some examples given below.

# Export as a CSV file df
df.to_csv(filename)

# Export as a Excel file df
df.to_excel(filename)

# Export as a SQL table df
df.to_sql(table_name, connection_object)

# Export as a JSON file
df.to_json(filename)

# Export as a HTML table
df.to_html(filename)

# Write to the clipboard
df.to_clipboard()

Conclusion:

Pandas is open-source library in Python for working with data sets. Its ability to analyze, clean, explore, and manipulate data. It is an important tool for data scientists. Pandas is built on top of Numpy. It is used with other programs like Matplotlib and Scikit-learn. Pandas Cheat Sheet is a helpful resource for beginners and experienced users. It covers topics such as data structures, data selection, importing data, Boolean indexing, dropping values, sorting, and data cleaning. We have also prepared pandas cheat sheet pdf for article. Pandas is a library in Python and data science uses this library for working with pandas dataframes and series. We have discussed various pandas commands in this cheatsheet.

Colab of Cheat Sheet

My Colab Exercise file for Pandas – Pandas Cheat Sheet – Python for Data Science.ipynb