Merge Data Frames in R: Full and Partial Match

⚡ Smart Summary

Merging data frames in R joins two tables on one or more shared key columns. The merge() function covers inner, left, right, and full joins, and this walkthrough demonstrates both a full match and a partial match.

🔑 Key Columns: merge() joins on by, or on by.x and by.y when the key carries a different name in each frame.
🔗 Full Match: The default inner join keeps only the rows whose key appears in both data frames.
🧩 Partial Match: Setting all.x = TRUE keeps every row of the first frame and fills the gaps with NA.
📐 Dimension Check: Compare dim() before and after a merge to catch rows lost or duplicated by the join.
⚙️ Join Control: The all, all.x and all.y flags select inner, left, right, or full outer behaviour.
🛠️ Modern Alternative: dplyr names each join in the verb itself and preserves the original row order.

Very often, we have data from multiple sources. To perform an analysis, we need to merge two dataframes together with one or more common key variables.

Types of Merge (Join) in R

Before working through the examples, it helps to know that merge() performs every join type an SQL user expects. The behaviour is controlled by the all, all.x and all.y arguments rather than by a different function name.

Join type	Rows kept	merge() call	dplyr equivalent
Inner join (default)	Only keys present in both frames	merge(x, y, by = “k”)	inner_join(x, y, by = “k”)
Left outer join	All rows of x, NA where y has no match	merge(x, y, by = “k”, all.x = TRUE)	left_join(x, y, by = “k”)
Right outer join	All rows of y, NA where x has no match	merge(x, y, by = “k”, all.y = TRUE)	right_join(x, y, by = “k”)
Full outer join	Every row from both frames	merge(x, y, by = “k”, all = TRUE)	full_join(x, y, by = “k”)
Cross join	Every combination of rows	merge(x, y, by = NULL)	cross_join(x, y)

Two details are worth remembering before the first call:

Leaving out by makes R merge on every column name the two frames share, which is rarely what you want.
merge() sorts the result by the key column; pass sort = FALSE to keep the incoming order.

Full match

A full match returns only the values that have a counterpart in the destination table. Rows without a match are dropped from the new data frame. A partial match, by contrast, keeps those rows and fills the missing columns with NA.

We will see a simple inner join. The inner join keyword selects records that have matching values in both tables. To join two datasets, we can use merge() function. We will use three arguments :

merge(x, y, by.x = x, by.y = y)

Arguments:
-x: The origin data frame
-y: The data frame to merge
-by.x: The column used for merging in x data frame. Column x to merge on
-by.y: The column used for merging in y data frame. Column y to merge on

Example:

Create First Dataset with variables

surname
nationality

Create Second Dataset with variables

surname
movies

The common key variable is surname. We can merge both data frames and check that the dimensionality is 7×3.

We add stringsAsFactors=FALSE in the data frame because we do not want R to convert the strings into a factor; the variable should stay a character vector. Since R 4.0.0 this is already the default for data.frame(), so the argument is optional in new code and harmless in old code.

# Create origin dataframe(

producers <- data.frame(   
    surname =  c("Spielberg","Scorsese","Hitchcock","Tarantino","Polanski"),    
    nationality = c("US","US","UK","US","Poland"),    
    stringsAsFactors=FALSE)

# Create destination dataframe
movies <- data.frame(    
    surname = c("Spielberg",
		"Scorsese",
                "Hitchcock",
              	"Hitchcock",
                "Spielberg",
                "Tarantino",
                "Polanski"),    
    title = c("Super 8",
    		"Taxi Driver",
    		"Psycho",
    		"North by Northwest",
    		"Catch Me If You Can",
    		"Reservoir Dogs","Chinatown"),                
     		stringsAsFactors=FALSE)

# Merge two datasets
m1 <- merge(producers, movies, by.x = "surname")
m1
dim(m1)

Output:

surname		nationality		title
1 Hitchcock		UK		Psycho
2 Hitchcock		UK		North by Northwest
3 Polanski		Poland		Chinatown
4 Scorsese		US		Taxi Driver
5 Spielberg		US		Super 8
6 Spielberg		US		Catch Me If You Can
7 Tarantino		US		Reservoir Dogs

Let’s merge data frames when the common key variables have different names.

We change surname to name in the movies data frame. We use the function identical(x1, x2) to check if both dataframes are identical.

# Change name of ` movies ` dataframe
colnames(movies)[colnames(movies) == 'surname'] <- 'name'
# Merge with different key value
m2 <- merge(producers, movies, by.x = "surname", by.y = "name")
# Print head of the data
head(m2)

Output:

##surname     nationality		title
## 1 Hitchcock          UK		Psycho
## 2 Hitchcock          UK		North by Northwest
## 3 Polanski          Poland		Chinatown
## 4 Scorsese           US		Taxi Driver
## 5 Spielberg          US		Super 8
## 6 Spielberg          US		Catch Me If You Can

# Check if data are identical
identical(m1, m2)

Output:

## [1] TRUE

This shows that merge operation is performed even if the column names are different.

Partial match

The inner join above quietly dropped anything without a counterpart. It is not surprising that two dataframes do not have the same common key variables. In the full matching, the dataframe returns only rows found in both x and y data frame. With partial merging, it is possible to keep the rows with no matching rows in the other data frame. These rows will have NA in those columns that are usually filled with values from y. We can do that by setting all.x= TRUE.

For instance, we can add a new producer, Lucas, in the producer data frame without the movie references in movies data frame. If we set all.x= FALSE, R will join only the matching values in both data set. In our case, the producer Lucas will not be joined into the result, because he is missing from one dataset.

Let’s see the dimension of each output when we specify all.x= TRUE and when we don’t.

# Create a new producer
add_producer <-  c('Lucas', 'US')
#  Append it to the ` producer` dataframe
producers <- rbind(producers, add_producer)
# Use a partial merge 
m3 <-merge(producers, movies, by.x = "surname", by.y = "name", all.x = TRUE)
m3

Output:

The console output below shows the extra Lucas row, with NA standing in for the missing film title.

# Compare the dimension of each data frame
dim(m1)

Output:

## [1] 7 3

dim(m2)

Output:

## [1] 7 3

dim(m3)

Output:

## [1] 8 3

As we can see, the new data frame is 8×3, compared with 7×3 for m1 and m2. R fills the title column with NA for Lucas, the producer that has no matching row in the movies data frame.

merge() vs dplyr Joins in R

Base R solves every join with one function and a set of flags. The dplyr package instead gives each join its own verb, which many teams find easier to read in a pipeline.

Criteria	Base merge()	dplyr joins
Choosing the join	all.x / all.y / all flags	Named in the verb: left_join(), full_join()
Row order	Sorted by the key unless sort = FALSE	Order of the left table is preserved
Speed on large frames	Slower	Generally faster
Missing key column	Fails late or merges on the wrong columns	Errors immediately
Dependencies	Base R, nothing to install	Requires the dplyr package

The two merges shown above translate directly into dplyr verbs:

library(dplyr)
# Same inner join as merge(producers, movies, by.x = "surname", by.y = "name")
m4 <- inner_join(producers, movies, by = c("surname" = "name"))
# Same partial merge as all.x = TRUE
m5 <- left_join(producers, movies, by = c("surname" = "name"))

Base merge() is the safer choice inside a package that should carry no dependencies. dplyr is the better choice in an interactive analysis, where readability and speed matter more.

Common Merge Errors in R and How to Fix Them

Most merge problems announce themselves as an unexpected row count rather than an error message, so checking dim() after every join is a habit worth building.

The result has more rows than either input. A key that repeats in both frames produces a many-to-many join, and R returns every combination. Inspect the keys with table(duplicated(x$k)) before merging.
The result is empty. The key columns usually differ in type, or one carries trailing spaces. Run str() on both frames and clean the key with trimws() before the merge.
Columns come back as title.x and title.y. Both frames carry a non-key column with the same name. Pass suffixes = c(“_producer”, “_film”) to make the origin obvious.
The rows are no longer in their original order. merge() sorts on the key by default. Add sort = FALSE, or use an explicit sort afterwards.
A key comparison fails on factors. Compare characters, not factor levels: wrap the key in as.character() before merging frames created under older R defaults.

When the same merge has to run repeatedly, wrap it in a user-defined function so the arguments cannot drift between runs.

FAQs

R merges on every column name the two frames share. With one shared column that is convenient; with several it silently narrows the result, and with none it returns a cross join. Naming the key explicitly avoids all three surprises.

Not in a single merge() call. Chain the calls, or fold a list with Reduce(function(x, y) merge(x, y, by = “k”), list(df1, df2, df3)). Verify the row count after each step, because duplicate keys compound quickly.

Both frames carried a non-key column with the same name, so R disambiguates them. Pass suffixes = c(“_producer”, “_film”) to give the copies meaningful names, or drop the redundant column before merging.

No. merge() sorts the result on the key column by default. Pass sort = FALSE to keep the incoming order, or use a dplyr join, which preserves the order of the left-hand frame.

Supply a vector: merge(x, y, by = c(“surname”, “year”)). When the names differ, pass matching vectors to by.x and by.y. Both vectors must list the columns in the same order.

Use by = “row.names”, or the shorthand by = 0. R adds the row names back as a column called Row.names, which you usually want to rename or drop before continuing the analysis.

Describe the two frames and the row counts, and an AI assistant will point at duplicated keys as the likely cause and suggest a duplicated() check. Confirm the diagnosis against your own data before changing the join.

Yes. RStudio Desktop 2023.09.0 and later ships an opt-in GitHub Copilot integration that drafts a merge from a comment. Check the join type it picks, since a wrong flag changes the row count silently.

Merge Data Frames in R: Full and Partial Match

Types of Merge (Join) in R

Full match

Partial match

merge() vs dplyr Joins in R

Common Merge Errors in R and How to Fix Them

FAQs

Summarize this post with:

Sign up for the newsletter

Types of Merge (Join) in R

Full match

RELATED ARTICLES

Partial match

merge() vs dplyr Joins in R

Common Merge Errors in R and How to Fix Them

FAQs

Summarize this post with:

Sign up for the newsletter