Introduction to logistic regression in R

statistics

logistic regression

Learn introduction to logistic regression in r with clear examples and explanations.

Published

March 26, 2026

Introduction

Logistic regression is a statistical method used to model binary outcomes (yes/no, true/false, 1/0) based on one or more predictor variables. Unlike linear regression which predicts continuous values, logistic regression predicts the probability that an observation belongs to a particular category. This tutorial demonstrates how to perform logistic regression in R using the Palmer Penguins dataset to predict whether a penguin is a Gentoo species.

Setup and Data Preparation

Let’s start by loading the necessary packages and examining our data:

library(palmerpenguins)
library(tidyverse)
library(broom)
theme_set(theme_bw(16))

First, let’s look at the structure of our penguin data:

penguins |>
  head()

The dataset contains information about three penguin species with various physical measurements.

Now we’ll prepare our data by removing missing values and creating a binary outcome variable:

penguins <- penguins |>
  drop_na() |>
  mutate(is_gentoo = species == "Gentoo")

penguins |> head()

We created is_gentoo as our binary response variable, which will be TRUE for Gentoo penguins and FALSE for all others.

Logistic Regression with Single Predictor

Let’s build our first logistic regression model using body mass as a predictor:

logistic_model_sp <- glm(
  is_gentoo ~ body_mass_g,
  data = penguins,
  family = binomial(link = "logit")
)

The glm() function with family = binomial(link = "logit") specifies that we want logistic regression. Let’s examine the model:

logistic_model_sp

For a more detailed summary of the model results:

summary(logistic_model_sp)

The summary shows coefficient estimates, standard errors, and p-values for our predictors.

Interpreting Coefficients

The broom package provides a cleaner way to extract model results:

logistic_model_sp |> tidy()

This gives us a tidy data frame with our model coefficients and statistics.

To interpret the coefficient as an odds ratio, we exponentiate it:

exp(coefficients(logistic_model_sp)[2])

This odds ratio tells us how much the odds of being a Gentoo penguin change for each 1-gram increase in body mass.

Multiple Predictor Model

Now let’s create a model with multiple predictors:

logistic_model <- glm(
  is_gentoo ~ bill_length_mm + body_mass_g,
  data = penguins,
  family = binomial(link = "logit")
)

This model includes both bill length and body mass as predictors.

logistic_model |> tidy()

Both predictors show their individual effects while controlling for the other variable.

Let’s get the odds ratio for bill length:

exp(coefficients(logistic_model)[2])

This represents the change in odds for each 1mm increase in bill length, holding body mass constant.

Alternative Single Predictor Model

Let’s try a different single predictor - flipper length:

logistic_model <- glm(
  is_gentoo ~ flipper_length_mm,
  data = penguins,
  family = binomial(link = "logit")
)

logistic_model |> tidy()

Flipper length appears to be a strong predictor based on the coefficient and p-value.

The odds ratio for flipper length:

exp(coefficients(logistic_model)[2])

This shows how flipper length affects the odds of a penguin being a Gentoo species.

Summary

Logistic regression in R is straightforward using the glm() function with the binomial family. Key points to remember:

Use family = binomial(link = "logit") for logistic regression
Coefficients represent log-odds; exponentiate them to get odds ratios
The broom::tidy() function provides clean model output
Multiple predictors can be included using the + operator in the formula

Logistic regression is particularly useful when you need to predict binary outcomes and understand which variables influence the probability of success.

--- title: "Introduction to logistic regression in R" description: "Learn introduction to logistic regression in r with clear examples and explanations." date: 2026-03-26 categories: ['statistics', 'logistic regression'] format: html: code-fold: false code-tools: true --- ## Introduction Logistic regression is a statistical method used to model binary outcomes (yes/no, true/false, 1/0) based on one or more predictor variables. Unlike linear regression which predicts continuous values, logistic regression predicts the probability that an observation belongs to a particular category. This tutorial demonstrates how to perform logistic regression in R using the Palmer Penguins dataset to predict whether a penguin is a Gentoo species. ## Setup and Data Preparation Let's start by loading the necessary packages and examining our data: ```r library(palmerpenguins) library(tidyverse) library(broom) theme_set(theme_bw(16)) ``` First, let's look at the structure of our penguin data: ```r penguins |> head() ``` The dataset contains information about three penguin species with various physical measurements. Now we'll prepare our data by removing missing values and creating a binary outcome variable: ```r penguins <- penguins |> drop_na() |> mutate(is_gentoo = species == "Gentoo") ``` ```r penguins |> head() ``` We created `is_gentoo` as our binary response variable, which will be TRUE for Gentoo penguins and FALSE for all others. ## Logistic Regression with Single Predictor Let's build our first logistic regression model using body mass as a predictor: ```r logistic_model_sp <- glm( is_gentoo ~ body_mass_g, data = penguins, family = binomial(link = "logit") ) ``` The `glm()` function with `family = binomial(link = "logit")` specifies that we want logistic regression. Let's examine the model: ```r logistic_model_sp ``` For a more detailed summary of the model results: ```r summary(logistic_model_sp) ``` The summary shows coefficient estimates, standard errors, and p-values for our predictors. ## Interpreting Coefficients The `broom` package provides a cleaner way to extract model results: ```r logistic_model_sp |> tidy() ``` This gives us a tidy data frame with our model coefficients and statistics. To interpret the coefficient as an odds ratio, we exponentiate it: ```r exp(coefficients(logistic_model_sp)[2]) ``` This odds ratio tells us how much the odds of being a Gentoo penguin change for each 1-gram increase in body mass. ## Multiple Predictor Model Now let's create a model with multiple predictors: ```r logistic_model <- glm( is_gentoo ~ bill_length_mm + body_mass_g, data = penguins, family = binomial(link = "logit") ) ``` This model includes both bill length and body mass as predictors. ```r logistic_model |> tidy() ``` Both predictors show their individual effects while controlling for the other variable. Let's get the odds ratio for bill length: ```r exp(coefficients(logistic_model)[2]) ``` This represents the change in odds for each 1mm increase in bill length, holding body mass constant. ## Alternative Single Predictor Model Let's try a different single predictor - flipper length: ```r logistic_model <- glm( is_gentoo ~ flipper_length_mm, data = penguins, family = binomial(link = "logit") ) ``` ```r logistic_model |> tidy() ``` Flipper length appears to be a strong predictor based on the coefficient and p-value. The odds ratio for flipper length: ```r exp(coefficients(logistic_model)[2]) ``` This shows how flipper length affects the odds of a penguin being a Gentoo species. ## Summary Logistic regression in R is straightforward using the `glm()` function with the binomial family. Key points to remember: - Use `family = binomial(link = "logit")` for logistic regression - Coefficients represent log-odds; exponentiate them to get odds ratios - The `broom::tidy()` function provides clean model output - Multiple predictors can be included using the `+` operator in the formula Logistic regression is particularly useful when you need to predict binary outcomes and understand which variables influence the probability of success.