R Aggregate Function: summarise() & group

⚡ Smart Summary

Aggregate Function in R with dplyr collapses many rows into one statistic per group. This walkthrough pairs group_by() with summarise() on the Lahman batting data to compute means, sums, spreads, counts, and positional values across 104,324 observations.

🧮 Core Verb: summarise(df, name = function(column)) returns one row, and group_by() first turns that into one row per group.
📊 Statistic Families: mean and median describe centre, sd and IQR describe spread, min and max describe range.
🔢 Counting: n() counts rows in the current group and n_distinct() counts unique values within it.
📍 Positional Helpers: first(), last(), and nth() extract a specific observation from each group.
🔄 Multi-Column Work: across() applies one function to many columns without repeating the code for each.
🔓 Grouping Hygiene: Always call ungroup() before a calculation that must run on the whole table again.

Dataset Used in This Tutorial

Summarising a variable gives you a first sense of the data, and summarising it by group tells you far more about how the values are distributed. This tutorial shows how to aggregate a dataset by group with the dplyr library.

This tutorial uses the Lahman batting dataset. After importing it and keeping only the nine columns listed below, the working data frame holds 104,324 observations and 9 variables:

playerID: Player ID code. Factor
yearID: Year. Factor
teamID: Team. factor
lgID: League. Factor: AA AL FL NL PL UA
AB: At bats. Numeric
G: Games: number of games by a player. Numeric
R: Runs. Numeric
HR: Homeruns. Numeric
SH: Sacrifice hits. Numeric

Before you perform summary, you will do the following steps to prepare the data:

Step 1: Import the data
Step 2: Select the relevant variables
Step 3: Sort the data

library(dplyr)

# Step 1
data <- read.csv("https://raw.githubusercontent.com/guru99-edu/R-Programming/master/lahman-batting.csv") %>%

# Step 2
select(c(playerID, yearID, AB, teamID, lgID, G, R, HR, SH))  %>% 

# Step 3
arrange(playerID, teamID, yearID)

A good practice when you import a dataset is to use the glimpse() function to have an idea about the structure of the dataset.

# Structure of the data
glimpse(data)

Output:

Observations: 104,324
Variables: 9
$ playerID <fctr> aardsda01, aardsda01, aardsda01, aardsda01, aardsda01, a...
$ yearID   <int> 2015, 2008, 2007, 2006, 2012, 2013, 2009, 2010, 2004, 196...
$ AB       <int> 1, 1, 0, 2, 0, 0, 0, 0, 0, 603, 600, 606, 547, 516, 495, ...
$ teamID   <fctr> ATL, BOS, CHA, CHN, NYA, NYN, SEA, SEA, SFN, ATL, ATL, A...
$ lgID     <fctr> NL, AL, AL, NL, AL, NL, AL, AL, NL, NL, NL, NL, NL, NL, ...
$ G        <int> 33, 47, 25, 45, 1, 43, 73, 53, 11, 158, 155, 160, 147, 15...
$ R        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 117, 113, 84, 100, 103, 95, 75...
$ HR       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 44, 39, 29, 44, 38, 47, 34, 40...
$ SH       <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 6, ...

summarise()

The syntax of summarise() is basic and consistent with the other verbs included in the dplyr library.

summarise(df, variable_name=condition) 
arguments: 
- `df`: Dataset used to construct the summary statistics 
- `variable_name=condition`: Formula to create the new variable

Look at the code below:

summarise(data, mean_run =mean(R))

Code Explanation

summarise(data, mean_run = mean(R)): Creates a variable named mean_run which is the average of the column run from the dataset data.

Output:

##   mean_run
## 1 19.20114

You can add as many variables as you want. You return the average games played and the average sacrifice hits.

summarise(data, mean_games = mean(G),
    mean_SH = mean(SH, na.rm = TRUE))

Code Explanation

mean_SH = mean(SH, na.rm = TRUE): Summarize a second variable. You set na.rm = TRUE because the column SH contains missing observations.

Output:

##   mean_games  mean_SH
## 1   51.98361 2.340085

group_by() vs No group_by()

Used alone, summarise() collapses the whole data frame into a single row, as the examples above show. Its real power appears once you add group_by(): the dplyr library then applies the same calculation separately to every group and returns one row per group.

Note that, group_by works perfectly with all the other verbs (i.e. mutate(), filter(), arrange(), …).

It is convenient to use the pipeline operator when you have more than one step. You can compute the average homerun by baseball league.

data %>%
	group_by(lgID) %>%
	summarise(mean_run = mean(HR))

Code Explanation

data: Dataset used to construct the summary statistics
group_by(lgID): Split the data by the values of lgID before summarising
summarise(mean_run = mean(HR)): Compute the average homerun

Output:

## 
# A tibble: 7 x 2
##     lgID  mean_run
##   <fctr>     <dbl>
## 1     AA 0.9166667
## 2     AL 3.1270988
## 3     FL 1.3131313
## 4     NL 2.8595953
## 5     PL 2.5789474
## 6     UA 0.6216216
## 7   <NA> 0.2867133

The pipe works with ggplot() as well, so the whole chain from raw data to finished chart stays in one statement. The code below combines group_by(), summarise() and ggplot() to show the average home runs per league as a bar chart.

You will do the following step:

Step 1: Select data frame
Step 2: Group data
Step 3: Summarize the data
Step 4: Plot the summary statistics

library(ggplot2)
# Step 1
data %>% 
#Step 2
group_by(lgID) %>% 
#Step 3
summarise(mean_home_run = mean(HR)) %>% 
#Step 4
ggplot(aes(x = lgID, y = mean_home_run, fill = lgID)) +
    geom_bar(stat = "identity") +
    theme_classic() +
    labs(
        x = "baseball league",
        y = "Average home run",
        title = paste(
            "Example group_by() with summarise()"
        )
    )

Output:

Functions You Can Use Inside summarise()

The verb summarise() is compatible with almost all the functions in R. Here is a short list of useful functions you can use together with summarise():

Objective	Function	Description
Basic	mean()	Average of vector x
	median()	Median of vector x
	sum()	Sum of vector x
variation	sd()	standard deviation of vector x
	IQR()	Interquartile of vector x
Range	min()	Minimum of vector x
	max()	Maximum of vector x
	quantile()	Quantile of vector x
Position	first()	Use with group_by() First observation of the group
	last()	Use with group_by(). Last observation of the group
	nth()	Use with group_by(). nth observation of the group
Count	n()	Use with group_by(). Count the number of rows
	n_distinct()	Use with group_by(). Count the number of distinct observations

The sections below work through an example for each function in the table above. If you are new to the pipe and the other verbs, start with the select, filter and arrange tutorial.

Basic function

In the previous example, you didn’t store the summary statistic in a data frame.

You can generate a data frame from a summary in two steps:

Step 1: Store the data frame for further use
Step 2: Use the dataset to create a line plot

Step 1) You compute the average number of games played by year.

## Mean
ex1 <- data %>%
	group_by(yearID) %>%
	summarise(mean_game_year = mean(G))
head(ex1)

Code Explanation

The summary statistic of batting dataset is stored in the data frame ex1.

Output:

## # A tibble: 6 x 2
##   yearID mean_game_year
##    <int>          <dbl>
## 1   1871       23.42308
## 2   1872       18.37931
## 3   1873       25.61538
## 4   1874       39.05263
## 5   1875       28.39535
## 6   1876       35.90625

Step 2) You show the summary statistic with a line plot and see the trend.

# Plot the graph
ggplot(ex1, aes(x = yearID, y = mean_game_year)) +
    geom_line() +
    theme_classic() +
    labs(
        x = "Year",
        y = "Average games played",
        title = paste(
            "Average games played from 1871 to 2016"
        )
    )

Output:

Subsetting

The function summarise() is compatible with subsetting.

## Subsetting + Median
data %>%
group_by(lgID) %>%
summarise(median_at_bat_league = median(AB), 
	#Compute the median without the zero 
	median_at_bat_league_no_zero = median(AB[AB > 0]))

Code Explanation

median_at_bat_league_no_zero = median(AB[AB > 0]): The variable AB contains lots of 0. You can compare the median of the at bat variable with and without 0.

Output:

## # A tibble: 7 x 3
##     lgID median_at_bat_league median_at_bat_league_no_zero
##   <fctr>                <dbl>                        <dbl>
## 1     AA                  130                          131
## 2     AL                   38                           85
## 3     FL                   88                           97
## 4     NL                   56                           67
## 5     PL                  238                          238
## 6     UA                   35                           35
## 7   <NA>                  101                          101

Sum

Another useful function to aggregate the variable is sum().

You can check which leagues have the more homeruns.

## Sum
data %>%
	group_by(lgID) %>%
	summarise(sum_homerun_league = sum(HR))

Output:

## # A tibble: 7 x 2
##     lgID sum_homerun_league
##   <fctr>              <int>
## 1     AA                341
## 2     AL              29426
## 3     FL                130
## 4     NL              29817
## 5     PL                 98
## 6     UA                 46
## 7   <NA>                 41

Standard deviation

Spread in the data is computed with the standard deviation or sd() in R.

# Spread
data %>%
	group_by(teamID) %>%
	summarise(sd_at_bat_league = sd(HR))

Output:

## # A tibble: 148 x 2
##    teamID sd_at_bat_league
##    <fctr>            <dbl>
##  1    ALT               NA
##  2    ANA        8.7816395
##  3    ARI        6.0765503
##  4    ATL        8.5363863
##  5    BAL        7.7350173
##  6    BFN        1.3645163
##  7    BFP        0.4472136
##  8    BL1        0.6992059
##  9    BL2        1.7106757
## 10    BL3        1.0000000
## # ... with 138 more rows

The spread in home runs differs widely between teams, and ALT returns NA because that team has only one row, so no standard deviation can be computed.

Minimum and maximum

You can access the minimum and the maximum of a vector with the function min() and max().

The code below returns the lowest and highest number of games in a season played by a player.

# Min and max
data %>%
	group_by(playerID) %>%
	summarise(min_G = min(G),
    max_G = max(G))

Output:

## # A tibble: 10,395 x 3
##     playerID min_G max_G
##       <fctr>       <int>
##  1 aardsda01    53    73
##  2 aaronha01   120   156
##  3  aasedo01    24    66
##  4  abadfe01    18    18
##  5 abadijo01    11    11
##  6 abbated01     3   153
##  7 abbeybe01    11    11
##  8 abbeych01    80   132
##  9 abbotgl01     5    23
## 10 abbotji01    13    29
## # ... with 10,385 more rows

Count

Counting observations per group is always worth doing. In R, n() returns the number of rows in the current group.

For instance, the code below computes the number of years played by each player.

# count observations
data %>%
	group_by(playerID) %>%
	summarise(number_year = n()) %>%
	arrange(desc(number_year))

Output:

## # A tibble: 10,395 x 2
##     playerID number_year
##       <fctr>       <int>
##  1 pennohe01          11
##  2 joosted01          10
##  3 mcguide01          10
##  4  rosepe01          10
##  5 davisha01           9
##  6 johnssi01           9
##  7  kaatji01           9
##  8 keelewi01           9
##  9 marshmi01           9
## 10 quirkja01           9
## # ... with 10,385 more rows

First and last

You can select the first, last or nth position of a group.

For instance, you can find the first and last year of each player.

# first and last
data %>%
	group_by(playerID) %>%
	summarise(first_appearance = first(yearID),
		last_appearance = last(yearID))

Output:

## # A tibble: 10,395 x 3
##     playerID first_appearance last_appearance
##       <fctr>            <int>           <int>
##  1 aardsda01             2009            2010
##  2 aaronha01             1973            1975
##  3  aasedo01             1986            1990
##  4  abadfe01             2016            2016
##  5 abadijo01             1875            1875
##  6 abbated01             1905            1897
##  7 abbeybe01             1894            1894
##  8 abbeych01             1895            1897
##  9 abbotgl01             1973            1979
## 10 abbotji01             1992            1996
## # ... with 10,385 more rows

nth observation

The nth() function complements first() and last(). It returns the observation at the index you pass within each group.

For instance, you can filter only the second year that a team played.

# nth
data %>%
	group_by(teamID) %>%
	summarise(second_game = nth(yearID, 2)) %>%
	arrange(second_game)

Output:

## # A tibble: 148 x 2
##    teamID second_game
##    <fctr>       <int>
##  1    BS1        1871
##  2    CH1        1871
##  3    FW1        1871
##  4    NY2        1871
##  5    RC1        1871
##  6    BR1        1872
##  7    BR2        1872
##  8    CL1        1872
##  9    MID        1872
## 10    TRO        1872
## # ... with 138 more rows

Distinct number of observations

n() returns the number of rows in the current group. The related n_distinct() counts unique values instead.

In the next example, you add up the total of players a team recruited during the all periods.

# distinct values
data %>%
	group_by(teamID) %>%
	summarise(number_player = n_distinct(playerID)) %>%
	arrange(desc(number_player))

Code Explanation

group_by(teamID): Group by team
summarise(number_player = n_distinct(playerID)): Count the distinct players per team
arrange(desc(number_player)): Sort the data by the number of player

Output:

## # A tibble: 148 x 2
##    teamID number_player
##    <fctr>         <int>
##  1    CHN           751
##  2    SLN           729
##  3    PHI           699
##  4    PIT           683
##  5    CIN           679
##  6    BOS           647
##  7    CLE           646
##  8    CHA           636
##  9    DET           623
## 10    NYA           612
## # ... with 138 more rows

Multiple groups

A summary statistic can be realized among multiple groups.

# Multiple groups
data %>%
	group_by(yearID, teamID) %>%
	summarise(mean_games = mean(G)) %>%
	arrange(desc(teamID, yearID))

Code Explanation

group_by(yearID, teamID): Group by year and team
summarise(mean_games = mean(G)): Average the number of games played
arrange(desc(teamID, yearID)): Sort the data by team and year

Output:

## # A tibble: 2,829 x 3
## # Groups:   yearID [146]
##    yearID teamID mean_games
##     <int> <fctr>      <dbl>
##  1   1884    WSU   20.41667
##  2   1891    WS9   46.33333
##  3   1886    WS8   22.00000
##  4   1887    WS8   51.00000
##  5   1888    WS8   27.00000
##  6   1889    WS8   52.42857
##  7   1884    WS7    8.00000
##  8   1875    WS6   14.80000
##  9   1873    WS5   16.62500
## 10   1872    WS4    4.20000
## # ... with 2,819 more rows

Filter

Before you intend to do an operation, you can filter the dataset. The dataset starts in 1871, and the analysis does not need the years prior to 1980.

# Filter
data %>%
	filter(yearID > 1980) %>%
	group_by(yearID) %>%
	summarise(mean_game_year = mean(G))

Code Explanation

filter(yearID > 1980): Filter the data to show only the relevant years (i.e. after 1980)
group_by(yearID): Group by year
summarise(mean_game_year = mean(G)): Summarize the data

Output:

## # A tibble: 36 x 2
##    yearID mean_game_year
##     <int>          <dbl>
##  1   1981       40.64583
##  2   1982       56.97790
##  3   1983       60.25128
##  4   1984       62.97436
##  5   1985       57.82828
##  6   1986       58.55340
##  7   1987       48.74752
##  8   1988       52.57282
##  9   1989       58.16425
## 10   1990       52.91556
## # ... with 26 more rows

Ungroup

Finally, you must remove the grouping before changing the level at which a calculation runs. Without ungroup(), the second summarise() would still operate per player.

# Ungroup the data
data %>%
	filter(HR > 0) %>%
	group_by(playerID) %>%
	summarise(average_HR_game = sum(HR) / sum(G)) %>%
	ungroup() %>%
	summarise(total_average_homerun = mean(average_HR_game))

Code Explanation

filter(HR >0) : Exclude zero homerun
group_by(playerID): group by player
summarise(average_HR_game = sum(HR)/sum(G)): Compute average homerun by player
ungroup(): remove the grouping
summarise(total_average_homerun = mean(average_HR_game)): Summarize the data

Output:

## # A tibble: 1 x 1
##   total_average_homerun
##                   <dbl>
## 1            0.06882226

How to Summarise Multiple Columns with across()

Every example above names one column at a time. When the same statistic is needed for many columns, across() applies it to all of them in a single call.

# One function, several columns
data %>%
    group_by(lgID) %>%
    summarise(across(c(R, HR, G), mean, .names = "mean_{.col}"))

# Every numeric column at once
data %>%
    group_by(lgID) %>%
    summarise(across(where(is.numeric), mean, na.rm = TRUE))

# Several functions applied to several columns
data %>%
    group_by(teamID) %>%
    summarise(across(c(R, HR), list(avg = mean, total = sum)))

Three arguments control the behaviour:

.cols: the columns to act on, written with the same helpers as select(), including where(is.numeric).
.fns: one function, or a named list of functions to apply to each column.
.names: the naming pattern for the output, where {.col} is the column and {.fn} the function name.

across() replaced the older summarise_all(), summarise_at() and summarise_if() helpers in dplyr 1.0.0. Those still run but are superseded, so new code should use across().

Common summarise() Errors and How to Fix Them

Four problems account for most failed aggregations.

The result is NA. mean() and sum() return NA if any value is missing. Add na.rm = TRUE, as the mean_SH example above does.
The “summarise() has grouped output” message. Grouping by two variables leaves the result grouped by the first one. Pass .groups = “drop” to remove the grouping, or call ungroup() afterwards.
A later calculation gives the wrong level. A data frame stays grouped until you ungroup it, so a second summarise() still runs per group. This is exactly what the ungroup() example demonstrates.
The output has more rows than expected. summarise() returns one row per group, so an unintended grouping variable multiplies the output. Check the “Groups:” line printed above any tibble.

# Drop the grouping as part of the call
data %>%
    group_by(yearID, teamID) %>%
    summarise(mean_games = mean(G), .groups = "drop")

summarise() in R: Function Reference

To return a summary by group, split the data first:

# group by X1, X2, X3
group_by(df, X1, X2, X3)

and remove the grouping when the calculation changes level:

ungroup(df)

The table below lists every function used with summarise() in this tutorial:

Method	Function	Code
mean	mean	summarise(df, mean_x1 = mean(x1))
median	median	summarise(df, median_x1 = median(x1))
sum	sum	summarise(df, sum_x1 = sum(x1))
standard deviation	sd	summarise(df, sd_x1 = sd(x1))
interquartile	IQR	summarise(df, interquartile_x1 = IQR(x1))
minimum	min	summarise(df, minimum_x1 = min(x1))
maximum	max	summarise(df, maximum_x1 = max(x1))
quantile	quantile	summarise(df, quantile_x1 = quantile(x1, probs = 0.75))
first observation	first	summarise(df, first_x1 = first(x1))
last observation	last	summarise(df, last_x1 = last(x1))
nth observation	nth	summarise(df, nth_x1 = nth(x1, 2))
number of occurrence	n	summarise(df, n_x1 = n())
number of distinct occurrence	n_distinct	summarise(df, n_distinct_x1 = n_distinct(x1))

FAQs

summarise() collapses each group into a single row of statistics. mutate() keeps every row and adds a new column. Use mutate() when the group statistic must sit alongside the original observations.

Grouping by two or more variables leaves the result grouped by all but the last. dplyr warns you because later verbs would silently run per group. Pass .groups = “drop” to avoid it.

n() counts every row in the current group, including repeats. n_distinct() counts unique values in a column. Counting players per team needs n_distinct(), because one player appears in many seasons.

Group-level aggregates such as customer averages, counts, and recency values are among the strongest predictors in tabular AI models. group_by() with summarise() is the standard way to build them reproducibly.

Yes. AI assistants can rewrite aggregate() and tapply() as group_by() and summarise() pipelines. Compare row counts on both versions, because base R and dplyr handle missing groups differently.

R Aggregate Function: summarise() & group_by() Example

Dataset Used in This Tutorial

summarise()

group_by() vs No group_by()

Functions You Can Use Inside summarise()

Basic function

Subsetting

Sum

Standard deviation

Minimum and maximum

Count

First and last

nth observation

Distinct number of observations

Multiple groups

Filter

Ungroup

How to Summarise Multiple Columns with across()

Common summarise() Errors and How to Fix Them

summarise() in R: Function Reference

FAQs

Summarize this post with:

Sign up for the newsletter

Dataset Used in This Tutorial

summarise()

group_by() vs No group_by()

RELATED ARTICLES

Functions You Can Use Inside summarise()

Basic function

Subsetting

Sum

Standard deviation

Minimum and maximum

Count

First and last

nth observation

Distinct number of observations

Multiple groups

Filter

Ungroup

How to Summarise Multiple Columns with across()

Common summarise() Errors and How to Fix Them

summarise() in R: Function Reference

FAQs

Summarize this post with:

Sign up for the newsletter