How to multiple linear regression in R

statistics
multiple linear regression
Learn how to perform multiple linear regression in R. Step-by-step statistical tutorial with examples.
Published

February 21, 2026

Introduction

Multiple linear regression is a statistical technique used to model the relationship between one continuous dependent variable and two or more independent variables. Unlike simple linear regression which uses only one predictor, multiple regression allows you to examine how several factors simultaneously influence your outcome variable.

This technique is particularly useful when you want to understand complex relationships in your data, make predictions based on multiple factors, or control for confounding variables. Common applications include predicting house prices based on size, location, and age, or analyzing how various factors affect student performance. Multiple regression helps quantify each variable’s unique contribution while accounting for the effects of other predictors in the model.

Getting Started

First, let’s load the necessary packages for our analysis:

library(tidyverse)
library(palmerpenguins)

Example 1: Basic Usage

Let’s start with a simple multiple regression using the mtcars dataset to predict fuel efficiency (mpg) based on weight (wt) and horsepower (hp):

data(mtcars)

model1 <- lm(mpg ~ wt + hp, data = mtcars)

summary(model1)

We can also examine the model’s performance and check assumptions:

par(mfrow = c(2, 2))
plot(model1)

anova(model1)

To make predictions with our model:

new_cars <- data.frame(wt = c(3.0, 3.5), hp = c(150, 200))
predict(model1, new_cars)

Example 2: Practical Application

Now let’s work with a more comprehensive example using the Palmer Penguins dataset. We’ll predict penguin body mass based on multiple morphological measurements:

penguins_clean <- penguins |>
  filter(!is.na(bill_length_mm), 
         !is.na(bill_depth_mm),
         !is.na(flipper_length_mm),
         !is.na(body_mass_g)) |>
  select(body_mass_g, bill_length_mm, bill_depth_mm, 
         flipper_length_mm, species, sex)

model2 <- penguins_clean |>
  lm(body_mass_g ~ bill_length_mm + bill_depth_mm + flipper_length_mm + species + sex, 
     data = _)

summary(model2)

Let’s examine which variables contribute most to the model:

model_stats <- penguins_clean |>
  summarise(
    model = list(lm(body_mass_g ~ bill_length_mm + bill_depth_mm + flipper_length_mm + species + sex)),
    .groups = "drop"
  ) |>
  mutate(
    r_squared = map_dbl(model, ~ summary(.x)$r.squared),
    adj_r_squared = map_dbl(model, ~ summary(.x)$adj.r.squared)
  )

model_stats |>
  select(r_squared, adj_r_squared)

We can also create a more streamlined model by comparing different variable combinations:

models_comparison <- penguins_clean |>
  summarise(
    model_simple = list(lm(body_mass_g ~ flipper_length_mm, data = cur_data())),
    model_full = list(lm(body_mass_g ~ bill_length_mm + bill_depth_mm + flipper_length_mm + species + sex, data = cur_data())),
    .groups = "drop"
  ) |>
  pivot_longer(cols = starts_with("model_"), 
               names_to = "model_type", 
               values_to = "model") |>
  mutate(
    aic = map_dbl(model, AIC),
    r_squared = map_dbl(model, ~ summary(.x)$r.squared)
  )

models_comparison |>
  select(model_type, aic, r_squared)

To visualize residuals and check model assumptions:

penguins_clean |>
  mutate(
    predicted = predict(model2),
    residuals = residuals(model2)
  ) |>
  ggplot(aes(x = predicted, y = residuals)) +
  geom_point(alpha = 0.7) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red", linewidth = 0.8) +
  labs(title = "Residuals vs Fitted Values",
       x = "Fitted Values",
       y = "Residuals") +
  theme_minimal()

Residuals versus fitted values diagnostic plot in R for a multiple linear regression model predicting penguin body mass, with a dashed zero reference line drawn in ggplot2

Summary

Multiple linear regression in R is straightforward using the lm() function with the formula syntax y ~ x1 + x2 + x3. Key takeaways include: always check your data for missing values before modeling, use summary() to examine coefficient significance and model fit (R-squared), and validate assumptions through residual plots. The tidyverse approach with pipes makes data preparation and model comparison more intuitive, while functions like predict(), AIC(), and anova() help evaluate model performance. Remember that multiple regression assumes linear relationships, independence of observations, and homoscedasticity of residuals.