How to multiple linear regression in R

statistics

multiple linear regression

Learn how to perform multiple linear regression in R. Step-by-step statistical tutorial with examples.

Published

February 21, 2026

Introduction

Multiple linear regression is a statistical technique used to model the relationship between one continuous dependent variable and two or more independent variables. Unlike simple linear regression which uses only one predictor, multiple regression allows you to examine how several factors simultaneously influence your outcome variable.

This technique is particularly useful when you want to understand complex relationships in your data, make predictions based on multiple factors, or control for confounding variables. Common applications include predicting house prices based on size, location, and age, or analyzing how various factors affect student performance. Multiple regression helps quantify each variable’s unique contribution while accounting for the effects of other predictors in the model.

Getting Started

First, let’s load the necessary packages for our analysis:

library(tidyverse)
library(palmerpenguins)

Example 1: Basic Usage

Let’s start with a simple multiple regression using the mtcars dataset to predict fuel efficiency (mpg) based on weight (wt) and horsepower (hp):

data(mtcars)

model1 <- lm(mpg ~ wt + hp, data = mtcars)

summary(model1)

We can also examine the model’s performance and check assumptions:

par(mfrow = c(2, 2))
plot(model1)

anova(model1)

To make predictions with our model:

new_cars <- data.frame(wt = c(3.0, 3.5), hp = c(150, 200))
predict(model1, new_cars)

Example 2: Practical Application

Now let’s work with a more comprehensive example using the Palmer Penguins dataset. We’ll predict penguin body mass based on multiple morphological measurements:

penguins_clean <- penguins |>
  filter(!is.na(bill_length_mm), 
         !is.na(bill_depth_mm),
         !is.na(flipper_length_mm),
         !is.na(body_mass_g)) |>
  select(body_mass_g, bill_length_mm, bill_depth_mm, 
         flipper_length_mm, species, sex)

model2 <- penguins_clean |>
  lm(body_mass_g ~ bill_length_mm + bill_depth_mm + flipper_length_mm + species + sex, 
     data = _)

summary(model2)

Let’s examine which variables contribute most to the model:

model_stats <- penguins_clean |>
  summarise(
    model = list(lm(body_mass_g ~ bill_length_mm + bill_depth_mm + flipper_length_mm + species + sex)),
    .groups = "drop"
  ) |>
  mutate(
    r_squared = map_dbl(model, ~ summary(.x)$r.squared),
    adj_r_squared = map_dbl(model, ~ summary(.x)$adj.r.squared)
  )

model_stats |>
  select(r_squared, adj_r_squared)

We can also create a more streamlined model by comparing different variable combinations:

models_comparison <- penguins_clean |>
  summarise(
    model_simple = list(lm(body_mass_g ~ flipper_length_mm, data = cur_data())),
    model_full = list(lm(body_mass_g ~ bill_length_mm + bill_depth_mm + flipper_length_mm + species + sex, data = cur_data())),
    .groups = "drop"
  ) |>
  pivot_longer(cols = starts_with("model_"), 
               names_to = "model_type", 
               values_to = "model") |>
  mutate(
    aic = map_dbl(model, AIC),
    r_squared = map_dbl(model, ~ summary(.x)$r.squared)
  )

models_comparison |>
  select(model_type, aic, r_squared)

To visualize residuals and check model assumptions:

penguins_clean |>
  mutate(
    predicted = predict(model2),
    residuals = residuals(model2)
  ) |>
  ggplot(aes(x = predicted, y = residuals)) +
  geom_point(alpha = 0.7) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red", linewidth = 0.8) +
  labs(title = "Residuals vs Fitted Values",
       x = "Fitted Values",
       y = "Residuals") +
  theme_minimal()

Residuals versus fitted values diagnostic plot in R for a multiple linear regression model predicting penguin body mass, with a dashed zero reference line drawn in ggplot2

Summary

Multiple linear regression in R is straightforward using the `lm()` function with the formula syntax `y ~ x1 + x2 + x3`. Key takeaways include: always check your data for missing values before modeling, use `summary()` to examine coefficient significance and model fit (R-squared), and validate assumptions through residual plots. The `tidyverse` approach with pipes makes data preparation and model comparison more intuitive, while functions like `predict()`, `AIC()`, and `anova()` help evaluate model performance. Remember that multiple regression assumes linear relationships, independence of observations, and homoscedasticity of residuals.

--- title: "How to multiple linear regression in R" description: "Learn how to perform multiple linear regression in R. Step-by-step statistical tutorial with examples." date: 2026-02-21 categories: ['statistics', 'multiple linear regression'] format: html: code-fold: false code-tools: true --- ## Introduction Multiple linear regression is a statistical technique used to model the relationship between one continuous dependent variable and two or more independent variables. Unlike simple linear regression which uses only one predictor, multiple regression allows you to examine how several factors simultaneously influence your outcome variable. This technique is particularly useful when you want to understand complex relationships in your data, make predictions based on multiple factors, or control for confounding variables. Common applications include predicting house prices based on size, location, and age, or analyzing how various factors affect student performance. Multiple regression helps quantify each variable's unique contribution while accounting for the effects of other predictors in the model. ## Getting Started First, let's load the necessary packages for our analysis: ```r library(tidyverse) library(palmerpenguins) ``` ## Example 1: Basic Usage Let's start with a simple multiple regression using the `mtcars` dataset to predict fuel efficiency (`mpg`) based on weight (`wt`) and horsepower (`hp`): ```r data(mtcars) model1 <- lm(mpg ~ wt + hp, data = mtcars) summary(model1) ``` We can also examine the model's performance and check assumptions: ```r par(mfrow = c(2, 2)) plot(model1) anova(model1) ``` To make predictions with our model: ```r new_cars <- data.frame(wt = c(3.0, 3.5), hp = c(150, 200)) predict(model1, new_cars) ``` ## Example 2: Practical Application Now let's work with a more comprehensive example using the Palmer Penguins dataset. We'll predict penguin body mass based on multiple morphological measurements: ```r penguins_clean <- penguins |> filter(!is.na(bill_length_mm), !is.na(bill_depth_mm), !is.na(flipper_length_mm), !is.na(body_mass_g)) |> select(body_mass_g, bill_length_mm, bill_depth_mm, flipper_length_mm, species, sex) model2 <- penguins_clean |> lm(body_mass_g ~ bill_length_mm + bill_depth_mm + flipper_length_mm + species + sex, data = _) summary(model2) ``` Let's examine which variables contribute most to the model: ```r model_stats <- penguins_clean |> summarise( model = list(lm(body_mass_g ~ bill_length_mm + bill_depth_mm + flipper_length_mm + species + sex)), .groups = "drop" ) |> mutate( r_squared = map_dbl(model, ~ summary(.x)$r.squared), adj_r_squared = map_dbl(model, ~ summary(.x)$adj.r.squared) ) model_stats |> select(r_squared, adj_r_squared) ``` We can also create a more streamlined model by comparing different variable combinations: ```r models_comparison <- penguins_clean |> summarise( model_simple = list(lm(body_mass_g ~ flipper_length_mm, data = cur_data())), model_full = list(lm(body_mass_g ~ bill_length_mm + bill_depth_mm + flipper_length_mm + species + sex, data = cur_data())), .groups = "drop" ) |> pivot_longer(cols = starts_with("model_"), names_to = "model_type", values_to = "model") |> mutate( aic = map_dbl(model, AIC), r_squared = map_dbl(model, ~ summary(.x)$r.squared) ) models_comparison |> select(model_type, aic, r_squared) ``` To visualize residuals and check model assumptions: ```r penguins_clean |> mutate( predicted = predict(model2), residuals = residuals(model2) ) |> ggplot(aes(x = predicted, y = residuals)) + geom_point(alpha = 0.7) + geom_hline(yintercept = 0, linetype = "dashed", color = "red", linewidth = 0.8) + labs(title = "Residuals vs Fitted Values", x = "Fitted Values", y = "Residuals") + theme_minimal() ``` ![Residuals versus fitted values diagnostic plot in R for a multiple linear regression model predicting penguin body mass, with a dashed zero reference line drawn in ggplot2](/images/statistics/multiple-linear-regression-in-r-residuals-vs-fitted-ggplot.png) ## Summary Multiple linear regression in R is straightforward using the [`lm()`](/statistics/how-to-linear-regression-basics-in-r.html) function with the formula syntax `y ~ x1 + x2 + x3`. Key takeaways include: always check your data for missing values before modeling, use `summary()` to examine coefficient significance and model fit (R-squared), and validate assumptions through residual plots. The `tidyverse` approach with pipes makes data preparation and model comparison more intuitive, while functions like `predict()`, `AIC()`, and `anova()` help evaluate model performance. Remember that multiple regression assumes linear relationships, independence of observations, and homoscedasticity of residuals. --- ## Related Posts - [How to Extract p-values from multiple simple linear regression models](/statistics/extract-p-values-from-multiple-simple-linear-regression-models.html) - [How to extract residuals from a linear regression model](/statistics/extract-residuals-from-a-linear-regression-model.html) - [How to get p-value from linear regression model](/statistics/get-p-value-from-linear-regression-model.html) - [How to apply a function on multiple columns using across()](/dplyr/apply-a-function-on-multiple-columns-using-across.html) - [dplyr case_when() to create new variable using multiple conditions](/dplyr/dplyr-case_when-to-create-new-variable-using-multiple-conditions.html)