How to do inner join with dplyr in R
Introduction
Inner joins combine two datasets by keeping only rows that have matching values in both tables. This is essential when you need to merge related data from different sources while excluding unmatched records.
Getting Started
library(tidyverse)
library(palmerpenguins)Example 1: Basic Usage
The Problem
We need to combine penguin data with additional information about their species. Let’s create separate datasets and join them together.
Step 1: Create the main dataset
We’ll start with basic penguin measurements from the Palmer penguins dataset.
penguins_basic <- penguins |>
select(species, island, bill_length_mm, body_mass_g) |>
slice_head(n = 8) |>
na.omit()This creates a simplified dataset with key penguin characteristics and removes any missing values.
Step 2: Create the reference dataset
Now we’ll create a separate table with additional species information.
species_info <- tibble(
species = c("Adelie", "Chinstrap", "Gentoo"),
diet = c("Krill", "Fish", "Krill"),
avg_lifespan = c(15, 20, 13)
)This reference table contains supplementary data about each penguin species that we want to add to our main dataset.
Step 3: Perform the inner join
We’ll combine both datasets using the species column as our joining key.
joined_data <- penguins_basic |>
inner_join(species_info, by = "species")
print(joined_data)The inner join successfully merges the datasets, adding diet and lifespan information to each penguin record based on matching species.
Example 2: Practical Application
The Problem
Imagine you’re analyzing car performance data where engine specifications are stored separately from basic car information. You need to combine these datasets to analyze the relationship between engine specs and fuel efficiency.
Step 1: Prepare the main car dataset
We’ll extract basic car information and create a unique identifier for joining.
cars_basic <- mtcars |>
rownames_to_column("car_model") |>
select(car_model, mpg, hp, wt) |>
slice_head(n = 6)This creates our primary dataset with fuel efficiency and basic performance metrics for six car models.
Step 2: Create engine specifications dataset
Now we’ll simulate a separate database table containing detailed engine information.
engine_specs <- tibble(
car_model = c("Mazda RX4", "Mazda RX4 Wag", "Datsun 710",
"Hornet 4 Drive", "Hornet Sportabout"),
engine_type = c("Rotary", "Rotary", "Inline-4", "V6", "V8"),
displacement = c(2.6, 2.6, 2.4, 3.2, 5.2)
)Notice this table has fewer cars than our main dataset, which will demonstrate how inner join excludes unmatched records.
Step 3: Join the datasets
We’ll combine the datasets using car model names as the joining key.
complete_car_data <- cars_basic |>
inner_join(engine_specs, by = "car_model")
print(complete_car_data)The result contains only five cars because “Valiant” from our main dataset had no matching engine specification record.
Step 4: Analyze the joined data
Now we can perform analysis using data from both original datasets.
complete_car_data |>
group_by(engine_type) |>
summarise(avg_mpg = mean(mpg),
avg_hp = mean(hp))This analysis combines fuel efficiency from the first dataset with engine types from the second dataset to show performance patterns by engine type.
Summary
- Inner joins keep only rows with matching values in both datasets, excluding unmatched records
- Use
inner_join(dataset2, by = "column_name")to specify the joining column - The joining column must exist in both datasets with the same name (or use different syntax for different names)
- Inner joins are perfect when you only need complete records with data from both sources
Always verify your results since inner joins can reduce your total row count by excluding unmatched data