How to use merge in R

base-r
merge
Master merge in R programming with clear examples. Complete tutorial covering syntax, use cases, and best practices.
Published

February 22, 2026

Introduction

The merge() function in R combines two data frames based on common columns or row names. It’s essential when you need to join datasets that share key variables, similar to SQL joins in database operations.

Getting Started

library(tidyverse)
data(mtcars)

Example 1: Basic Usage

The Problem

We need to combine two separate datasets about cars - one with performance metrics and another with manufacturer details. This demonstrates the fundamental merge operation.

Step 1: Create sample datasets

Let’s create two related datasets from the mtcars data.

# Performance data
performance <- mtcars |>
  select(mpg, hp, qsec) |>
  rownames_to_column("model")

head(performance, 3)

This creates our first dataset with car models and their performance metrics.

Step 2: Create manufacturer data

Now we’ll create a second dataset with manufacturer information.

# Manufacturer data
manufacturers <- data.frame(
  model = c("Mazda RX4", "Datsun 710", "Hornet 4 Drive", 
            "Valiant", "Merc 240D"),
  manufacturer = c("Mazda", "Nissan", "Dodge", "Plymouth", "Mercedes"),
  country = c("Japan", "Japan", "USA", "USA", "Germany")
)

We now have manufacturer details for some of the cars in our performance dataset.

Step 3: Perform basic merge

Let’s combine these datasets using the model column as our key.

# Inner join - only matching records
merged_data <- merge(performance, manufacturers, 
                     by = "model")

print(merged_data)

The merge created a new dataset containing only cars that appear in both datasets, combining all their information.

Example 2: Practical Application

The Problem

You’re analyzing car sales data where customer information and purchase details are stored separately. You need to combine them to analyze customer purchasing patterns while handling cases where not all records match.

Step 1: Create customer dataset

We’ll simulate customer data with some cars from mtcars.

# Customer purchase data
customers <- data.frame(
  car_model = c("Mazda RX4", "Honda Civic", "Merc 240D", 
                "Fiat 128", "Toyota Corona"),
  customer_id = c(101, 102, 103, 104, 105),
  purchase_date = c("2023-01-15", "2023-02-20", "2023-01-30", 
                   "2023-03-10", "2023-02-05")
)

This represents our sales records with customer information.

Step 2: Prepare car specifications

Let’s extract relevant car specifications for merging.

# Car specifications
car_specs <- mtcars |>
  select(mpg, cyl, hp) |>
  rownames_to_column("car_model") |>
  head(8)

We now have technical specifications for several car models.

Step 3: Perform left join

Use merge to keep all customer records, even if car specs are missing.

# Left join - keep all customers
sales_analysis <- merge(customers, car_specs, 
                       by = "car_model", 
                       all.x = TRUE)

print(sales_analysis)

This preserves all customer records while adding available car specifications, with NA values where specs don’t exist.

Step 4: Handle different join types

Demonstrate various merge types for different analytical needs.

# Right join - all car specs
right_merge <- merge(customers, car_specs, 
                    by = "car_model", all.y = TRUE)

# Full outer join - all records
full_merge <- merge(customers, car_specs, 
                   by = "car_model", all = TRUE)

Different join types help answer different business questions about your data relationships.

Step 5: Merge with different column names

Handle cases where key columns have different names in each dataset.

# Rename for demonstration
customers_renamed <- customers |>
  rename(model_name = car_model)

# Merge with different column names
final_merge <- merge(customers_renamed, car_specs,
                    by.x = "model_name", 
                    by.y = "car_model")

This shows how to merge when your key columns don’t share the same name across datasets.

Summary

  • merge() combines data frames based on common columns, similar to SQL joins
  • Use by parameter to specify which columns to join on
  • all.x = TRUE performs left joins, all.y = TRUE performs right joins
  • all = TRUE creates full outer joins including all records from both datasets
  • Use by.x and by.y when key columns have different names in each dataset