How to Perform Logistic Regression in R

statistics
logistic regression
Complete guide to logistic regression in R using glm(). Learn binary classification with worked examples, model interpretation, predictions, and accuracy evaluation.
Published

February 21, 2026

Introduction

Logistic regression is a statistical method used to predict binary outcomes (yes/no, success/failure, 0/1) based on one or more predictor variables. Unlike linear regression which predicts continuous values, logistic regression uses the logistic function to model the probability that an instance belongs to a particular category.

You would use logistic regression when your dependent variable is categorical with two levels, such as predicting whether an email is spam, if a customer will purchase a product, or whether a medical treatment will be successful. It’s particularly valuable because it provides both predictions and probabilities, along with interpretable coefficients that show how each predictor affects the odds of the outcome.

The Math Behind Logistic Regression

Logistic regression models the probability of the outcome using the logistic (sigmoid) function:

P(Y=1) = 1 / (1 + e^(-z))

Where z is the linear combination of predictors:

z = β₀ + β₁X₁ + β₂X₂ + ... + βₙXₙ

The key insight is that we’re modeling the log-odds (also called logit):

log(P / (1-P)) = β₀ + β₁X₁ + β₂X₂ + ...

This means: - Each coefficient βᵢ represents the change in log-odds for a one-unit increase in Xᵢ - To get odds ratios, exponentiate the coefficients: exp(βᵢ) - An odds ratio of 2 means the odds double for each unit increase in the predictor

Getting Started

library(tidyverse)
library(palmerpenguins)

Example 1: Basic Usage

The Problem

We want to predict penguin sex (male/female) based on body mass. This is a binary classification problem - perfect for logistic regression.

Preparing the data

First, load the data and remove rows with missing values in our variables of interest:

data(penguins)

penguins_clean <- penguins |>
  filter(!is.na(sex), !is.na(body_mass_g))

Fitting the model

Use glm() with family = binomial to fit a logistic regression:

model_basic <- glm(sex ~ body_mass_g,
                   data = penguins_clean,
                   family = binomial)

summary(model_basic)

The family = binomial argument tells R we want logistic regression (not linear regression).

Making predictions

Use predict() with type = "response" to get probabilities (not log-odds):

penguins_clean <- penguins_clean |>
  mutate(
    predicted_prob = predict(model_basic, type = "response"),
    predicted_sex = ifelse(predicted_prob > 0.5, "male", "female")
  )

Probabilities > 0.5 are classified as “male”, otherwise “female”.

Checking accuracy

accuracy <- mean(penguins_clean$sex == penguins_clean$predicted_sex)
print(paste("Accuracy:", round(accuracy, 3)))

This tells us what percentage of predictions were correct.

Example 2: Practical Application

The Problem

A single predictor (body mass) might not be enough. Let’s build a better model using multiple measurements: bill length, bill depth, flipper length, body mass, and species.

Preparing the data

Remove rows with missing values in any of our predictors:

penguins_full <- penguins |>
  filter(
    !is.na(sex),
    !is.na(bill_length_mm),
    !is.na(bill_depth_mm),
    !is.na(flipper_length_mm),
    !is.na(body_mass_g)
  )

Fitting the multiple logistic regression

Add multiple predictors separated by +:

model_full <- glm(
  sex ~ bill_length_mm + bill_depth_mm + flipper_length_mm + body_mass_g + species,
  data = penguins_full,
  family = binomial
)

summary(model_full)

The summary shows coefficients for each predictor. Positive coefficients increase the probability of “male”, negative coefficients decrease it.

Making predictions and evaluating

penguins_results <- penguins_full |>
  mutate(
    predicted_prob = predict(model_full, type = "response"),
    predicted_sex = ifelse(predicted_prob > 0.5, "male", "female"),
    correct_prediction = sex == predicted_sex
  )

Calculating accuracy

accuracy_full <- mean(penguins_results$correct_prediction)
print(paste("Full model accuracy:", round(accuracy_full, 3)))

The multi-predictor model should have higher accuracy than the single-predictor model.

Creating a confusion matrix

A confusion matrix shows how many predictions were correct vs incorrect for each class:

confusion_matrix <- penguins_results |>
  count(sex, predicted_sex) |>
  pivot_wider(names_from = predicted_sex, values_from = n, values_fill = 0)

print(confusion_matrix)

This shows true positives, true negatives, false positives, and false negatives.

Visualizing predicted probabilities

penguins_results |>
  ggplot(aes(x = predicted_prob, fill = sex)) +
  geom_histogram(alpha = 0.7, bins = 20, position = "identity") +
  labs(
    title = "Distribution of Predicted Probabilities by Actual Sex",
    x = "Predicted Probability of Being Male",
    y = "Count"
  ) +
  theme_minimal()

Histogram in R of predicted probabilities from a logistic regression model built with glm() showing clear separation of male and female penguins in ggplot2

A good model shows clear separation: actual males should have high predicted probabilities, females should have low ones.

Interpreting Odds Ratios

To understand coefficient magnitudes, calculate odds ratios:

# Get odds ratios from model coefficients
odds_ratios <- exp(coef(model_full))
print(round(odds_ratios, 3))

Interpretation example: If the odds ratio for body_mass_g is 1.005, it means for each additional gram of body mass, the odds of being male increase by 0.5%. For a 100g increase, odds increase by approximately 65% (1.005^100 ≈ 1.65).

Assumptions and Limitations

Key assumptions of logistic regression:

  1. Binary outcome: The dependent variable must have exactly two categories
  2. Independence: Observations should be independent of each other
  3. No multicollinearity: Predictors should not be highly correlated with each other
  4. Linear relationship: The log-odds should be linearly related to continuous predictors
  5. Large sample size: General rule is at least 10-20 events per predictor variable

Check for multicollinearity:

# Check variance inflation factors
library(car)
vif(model_full)

VIF values above 5-10 indicate problematic multicollinearity.

Don’t use logistic regression when:

  • Your outcome has more than 2 categories (use multinomial logistic regression instead)
  • You have very few events relative to predictors (causes separation problems)
  • Relationships between predictors and log-odds are highly non-linear

Common Mistakes

1. Forgetting family = binomial

# Wrong - fits linear regression
model_wrong <- glm(sex ~ body_mass_g, data = penguins_clean)

# Correct - fits logistic regression
model_right <- glm(sex ~ body_mass_g, data = penguins_clean, family = binomial)

2. Using predictions without type = "response"

# Wrong - returns log-odds (hard to interpret)
predict(model_basic)

# Correct - returns probabilities (0 to 1)
predict(model_basic, type = "response")

3. Interpreting coefficients as probabilities

Coefficients are in log-odds scale, not probability scale. A coefficient of 0.5 doesn’t mean “50% increase in probability” - it means the log-odds increase by 0.5.

4. Ignoring class imbalance

If one class is much more common, the model may predict only the majority class. Consider: - Adjusting the classification threshold - Using stratified sampling - Using metrics like AUC-ROC instead of accuracy

Summary

  • Use glm(y ~ x, family = binomial) for logistic regression
  • Use predict(model, type = "response") to get probabilities (0-1), not log-odds
  • Classify using a threshold: ifelse(prob > 0.5, "positive", "negative")
  • Multiple predictors usually improve accuracy
  • Evaluate with accuracy and confusion matrix
  • Visualize predicted probabilities to assess model quality