How to use merge in R
Introduction
The merge() function in R combines two data frames based on common columns or row names. It’s essential when you need to join datasets that share key variables, similar to SQL joins in database operations.
Getting Started
library(tidyverse)
data(mtcars)Example 1: Basic Usage
The Problem
We need to combine two separate datasets about cars - one with performance metrics and another with manufacturer details. This demonstrates the fundamental merge operation.
Step 1: Create sample datasets
Let’s create two related datasets from the mtcars data.
# Performance data
performance <- mtcars |>
select(mpg, hp, qsec) |>
rownames_to_column("model")
head(performance, 3)This creates our first dataset with car models and their performance metrics.
Step 2: Create manufacturer data
Now we’ll create a second dataset with manufacturer information.
# Manufacturer data
manufacturers <- data.frame(
model = c("Mazda RX4", "Datsun 710", "Hornet 4 Drive",
"Valiant", "Merc 240D"),
manufacturer = c("Mazda", "Nissan", "Dodge", "Plymouth", "Mercedes"),
country = c("Japan", "Japan", "USA", "USA", "Germany")
)We now have manufacturer details for some of the cars in our performance dataset.
Step 3: Perform basic merge
Let’s combine these datasets using the model column as our key.
# Inner join - only matching records
merged_data <- merge(performance, manufacturers,
by = "model")
print(merged_data)The merge created a new dataset containing only cars that appear in both datasets, combining all their information.
Example 2: Practical Application
The Problem
You’re analyzing car sales data where customer information and purchase details are stored separately. You need to combine them to analyze customer purchasing patterns while handling cases where not all records match.
Step 1: Create customer dataset
We’ll simulate customer data with some cars from mtcars.
# Customer purchase data
customers <- data.frame(
car_model = c("Mazda RX4", "Honda Civic", "Merc 240D",
"Fiat 128", "Toyota Corona"),
customer_id = c(101, 102, 103, 104, 105),
purchase_date = c("2023-01-15", "2023-02-20", "2023-01-30",
"2023-03-10", "2023-02-05")
)This represents our sales records with customer information.
Step 2: Prepare car specifications
Let’s extract relevant car specifications for merging.
# Car specifications
car_specs <- mtcars |>
select(mpg, cyl, hp) |>
rownames_to_column("car_model") |>
head(8)We now have technical specifications for several car models.
Step 3: Perform left join
Use merge to keep all customer records, even if car specs are missing.
# Left join - keep all customers
sales_analysis <- merge(customers, car_specs,
by = "car_model",
all.x = TRUE)
print(sales_analysis)This preserves all customer records while adding available car specifications, with NA values where specs don’t exist.
Step 4: Handle different join types
Demonstrate various merge types for different analytical needs.
# Right join - all car specs
right_merge <- merge(customers, car_specs,
by = "car_model", all.y = TRUE)
# Full outer join - all records
full_merge <- merge(customers, car_specs,
by = "car_model", all = TRUE)Different join types help answer different business questions about your data relationships.
Step 5: Merge with different column names
Handle cases where key columns have different names in each dataset.
# Rename for demonstration
customers_renamed <- customers |>
rename(model_name = car_model)
# Merge with different column names
final_merge <- merge(customers_renamed, car_specs,
by.x = "model_name",
by.y = "car_model")This shows how to merge when your key columns don’t share the same name across datasets.
Summary
merge()combines data frames based on common columns, similar to SQL joins- Use
byparameter to specify which columns to join on all.x = TRUEperforms left joins,all.y = TRUEperforms right joins
all = TRUEcreates full outer joins including all records from both datasetsUse
by.xandby.ywhen key columns have different names in each dataset