How to do inner join with dplyr in R

dplyr inner_join()
Learn how to perform do inner join with dplyr in R. Step-by-step statistical tutorial with examples.
Published

October 7, 2023

Introduction

Inner joins combine two datasets by keeping only rows that have matching values in both tables. This is essential when you need to merge related data from different sources while excluding unmatched records.

Getting Started

library(tidyverse)
library(palmerpenguins)

Example 1: Basic Usage

The Problem

We need to combine penguin data with additional information about their species. Let’s create separate datasets and join them together.

Step 1: Create the main dataset

We’ll start with basic penguin measurements from the Palmer penguins dataset.

penguins_basic <- penguins |>
  select(species, island, bill_length_mm, body_mass_g) |>
  slice_head(n = 8) |>
  na.omit()

This creates a simplified dataset with key penguin characteristics and removes any missing values.

Step 2: Create the reference dataset

Now we’ll create a separate table with additional species information.

species_info <- tibble(
  species = c("Adelie", "Chinstrap", "Gentoo"),
  diet = c("Krill", "Fish", "Krill"),
  avg_lifespan = c(15, 20, 13)
)

This reference table contains supplementary data about each penguin species that we want to add to our main dataset.

Step 3: Perform the inner join

We’ll combine both datasets using the species column as our joining key.

joined_data <- penguins_basic |>
  inner_join(species_info, by = "species")

print(joined_data)

The inner join successfully merges the datasets, adding diet and lifespan information to each penguin record based on matching species.

Example 2: Practical Application

The Problem

Imagine you’re analyzing car performance data where engine specifications are stored separately from basic car information. You need to combine these datasets to analyze the relationship between engine specs and fuel efficiency.

Step 1: Prepare the main car dataset

We’ll extract basic car information and create a unique identifier for joining.

cars_basic <- mtcars |>
  rownames_to_column("car_model") |>
  select(car_model, mpg, hp, wt) |>
  slice_head(n = 6)

This creates our primary dataset with fuel efficiency and basic performance metrics for six car models.

Step 2: Create engine specifications dataset

Now we’ll simulate a separate database table containing detailed engine information.

engine_specs <- tibble(
  car_model = c("Mazda RX4", "Mazda RX4 Wag", "Datsun 710", 
                "Hornet 4 Drive", "Hornet Sportabout"),
  engine_type = c("Rotary", "Rotary", "Inline-4", "V6", "V8"),
  displacement = c(2.6, 2.6, 2.4, 3.2, 5.2)
)

Notice this table has fewer cars than our main dataset, which will demonstrate how inner join excludes unmatched records.

Step 3: Join the datasets

We’ll combine the datasets using car model names as the joining key.

complete_car_data <- cars_basic |>
  inner_join(engine_specs, by = "car_model")

print(complete_car_data)

The result contains only five cars because “Valiant” from our main dataset had no matching engine specification record.

Step 4: Analyze the joined data

Now we can perform analysis using data from both original datasets.

complete_car_data |>
  group_by(engine_type) |>
  summarise(avg_mpg = mean(mpg),
            avg_hp = mean(hp))

This analysis combines fuel efficiency from the first dataset with engine types from the second dataset to show performance patterns by engine type.

Summary

  • Inner joins keep only rows with matching values in both datasets, excluding unmatched records
  • Use inner_join(dataset2, by = "column_name") to specify the joining column
  • The joining column must exist in both datasets with the same name (or use different syntax for different names)
  • Inner joins are perfect when you only need complete records with data from both sources
  • Always verify your results since inner joins can reduce your total row count by excluding unmatched data