Logistic Regression with Single Predictor in R

glm

logistic regression

Complete guide to logistic regression with single predictor in R programming. Tutorial with practical examples and code.

Published

September 18, 2023

Introduction

Logistic regression is used when you want to predict a binary outcome (yes/no, success/failure) based on one predictor variable. Unlike linear regression, it models the probability of an event occurring using the logistic function, which ensures predictions stay between 0 and 1.

Getting Started

library(tidyverse)
library(palmerpenguins)

Example 1: Basic Usage

The Problem

We want to predict whether a penguin is from the Adelie species based on its bill length. This is a binary classification problem where we’re modeling the probability of being Adelie versus not Adelie.

Step 1: Prepare the Data

We’ll create a binary outcome variable and examine our data structure.

penguins_clean <- penguins |>
  filter(!is.na(bill_length_mm), !is.na(species)) |>
  mutate(is_adelie = ifelse(species == "Adelie", 1, 0))

head(penguins_clean)

We now have a dataset with a binary variable is_adelie where 1 means Adelie and 0 means other species.

Step 2: Visualize the Relationship

Before modeling, let’s see how bill length relates to species.

ggplot(penguins_clean, aes(x = bill_length_mm, y = is_adelie)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "glm", method.args = list(family = "binomial")) +
  labs(x = "Bill Length (mm)", y = "Probability of being Adelie")

Logistic regression sigmoid curve in R showing probability of a penguin being Adelie species as a function of bill length, fitted with a single predictor glm() model in ggplot2

The smooth curve shows the logistic relationship - shorter bills are associated with higher probability of being Adelie.

Step 3: Fit the Logistic Model

We use glm() with family = “binomial” to fit our logistic regression.

model <- glm(is_adelie ~ bill_length_mm, 
            data = penguins_clean, 
            family = binomial)

summary(model)

The negative coefficient indicates that as bill length increases, the probability of being Adelie decreases.

Step 4: Make Predictions

Let’s predict probabilities for specific bill lengths.

new_data <- data.frame(bill_length_mm = c(35, 40, 45, 50))
predictions <- predict(model, new_data, type = "response")
data.frame(bill_length = new_data$bill_length_mm, 
          probability_adelie = round(predictions, 3))

These probabilities show how bill length affects the likelihood of a penguin being Adelie.

Example 2: Practical Application

The Problem

A researcher wants to determine if a car’s weight can predict whether it has high fuel efficiency (mpg > 20). This helps understand the relationship between vehicle weight and fuel economy for purchasing decisions.

Step 1: Create Binary Outcome

We’ll transform the continuous mpg variable into a binary high/low efficiency indicator.

mtcars_binary <- mtcars |>
  mutate(high_mpg = ifelse(mpg > 20, 1, 0))

table(mtcars_binary$high_mpg)

We have 14 cars with high efficiency (>20 mpg) and 18 with lower efficiency.

Step 2: Explore the Data

Let’s visualize how weight relates to fuel efficiency.

ggplot(mtcars_binary, aes(x = wt, y = high_mpg)) +
  geom_point() +
  geom_smooth(method = "glm", method.args = list(family = "binomial")) +
  labs(x = "Weight (1000 lbs)", y = "Probability of High MPG")

Single-predictor logistic regression curve in R showing probability of high fuel efficiency (mpg > 20) as a function of car weight from the mtcars dataset, fitted with glm() in ggplot2

Heavier cars clearly have lower probability of achieving high fuel efficiency.

Step 3: Build the Model

We fit a logistic regression to quantify this relationship.

weight_model <- glm(high_mpg ~ wt, 
                   data = mtcars_binary, 
                   family = binomial)

summary(weight_model)

The significant negative coefficient confirms that weight strongly predicts lower fuel efficiency.

Step 4: Calculate Odds Ratios

Odds ratios help interpret the practical impact of weight changes.

exp(coef(weight_model))
exp(confint(weight_model))

For each additional 1000 lbs, the odds of high fuel efficiency multiply by this factor (less than 1 means decreased odds).

Step 5: Evaluate Model Performance

We’ll check how well our model classifies cars.

predicted_probs <- predict(weight_model, type = "response")
predicted_class <- ifelse(predicted_probs > 0.5, 1, 0)
confusion_matrix <- table(Actual = mtcars_binary$high_mpg, 
                         Predicted = predicted_class)
print(confusion_matrix)

This confusion matrix shows our model’s accuracy in predicting high versus low fuel efficiency.

Summary

Logistic regression models binary outcomes using predictor variables and the logistic function
Use glm() with family = binomial to fit logistic regression models in R
Coefficients represent changes in log-odds; negative coefficients decrease probability of success
Convert coefficients using exp() to get odds ratios for easier interpretation
Always visualize your data first and evaluate model performance with predicted probabilities

--- title: "Logistic Regression with Single Predictor in R" description: "Complete guide to logistic regression with single predictor in R programming. Tutorial with practical examples and code." date: 2023-09-18 categories: ['glm', 'logistic regression'] format: html: code-fold: false code-tools: true --- ## Introduction Logistic regression is used when you want to predict a binary outcome (yes/no, success/failure) based on one predictor variable. Unlike linear regression, it models the probability of an event occurring using the logistic function, which ensures predictions stay between 0 and 1. ## Getting Started ```r library(tidyverse) library(palmerpenguins) ``` ## Example 1: Basic Usage ### The Problem We want to predict whether a penguin is from the Adelie species based on its bill length. This is a binary classification problem where we're modeling the probability of being Adelie versus not Adelie. ### Step 1: Prepare the Data We'll create a binary outcome variable and examine our data structure. ```r penguins_clean <- penguins |> filter(!is.na(bill_length_mm), !is.na(species)) |> mutate(is_adelie = ifelse(species == "Adelie", 1, 0)) head(penguins_clean) ``` We now have a dataset with a binary variable `is_adelie` where 1 means Adelie and 0 means other species. ### Step 2: Visualize the Relationship Before modeling, let's see how bill length relates to species. ```r ggplot(penguins_clean, aes(x = bill_length_mm, y = is_adelie)) + geom_point(alpha = 0.6) + geom_smooth(method = "glm", method.args = list(family = "binomial")) + labs(x = "Bill Length (mm)", y = "Probability of being Adelie") ``` ![Logistic regression sigmoid curve in R showing probability of a penguin being Adelie species as a function of bill length, fitted with a single predictor glm() model in ggplot2](/images/statistics/logistic-regression-single-predictor-in-r-adelie-sigmoid-ggplot.png) The smooth curve shows the logistic relationship - shorter bills are associated with higher probability of being Adelie. ### Step 3: Fit the Logistic Model We use [`glm()`](/statistics/how-to-logistic-regression-in-r.html) with family = "binomial" to fit our logistic regression. ```r model <- glm(is_adelie ~ bill_length_mm, data = penguins_clean, family = binomial) summary(model) ``` The negative coefficient indicates that as bill length increases, the probability of being Adelie decreases. ### Step 4: Make Predictions Let's predict probabilities for specific bill lengths. ```r new_data <- data.frame(bill_length_mm = c(35, 40, 45, 50)) predictions <- predict(model, new_data, type = "response") data.frame(bill_length = new_data$bill_length_mm, probability_adelie = round(predictions, 3)) ``` These probabilities show how bill length affects the likelihood of a penguin being Adelie. ## Example 2: Practical Application ### The Problem A researcher wants to determine if a car's weight can predict whether it has high fuel efficiency (mpg > 20). This helps understand the relationship between vehicle weight and fuel economy for purchasing decisions. ### Step 1: Create Binary Outcome We'll transform the continuous mpg variable into a binary high/low efficiency indicator. ```r mtcars_binary <- mtcars |> mutate(high_mpg = ifelse(mpg > 20, 1, 0)) table(mtcars_binary$high_mpg) ``` We have 14 cars with high efficiency (>20 mpg) and 18 with lower efficiency. ### Step 2: Explore the Data Let's visualize how weight relates to fuel efficiency. ```r ggplot(mtcars_binary, aes(x = wt, y = high_mpg)) + geom_point() + geom_smooth(method = "glm", method.args = list(family = "binomial")) + labs(x = "Weight (1000 lbs)", y = "Probability of High MPG") ``` ![Single-predictor logistic regression curve in R showing probability of high fuel efficiency (mpg > 20) as a function of car weight from the mtcars dataset, fitted with glm() in ggplot2](/images/statistics/logistic-regression-single-predictor-in-r-mtcars-sigmoid-ggplot.png) Heavier cars clearly have lower probability of achieving high fuel efficiency. ### Step 3: Build the Model We fit a logistic regression to quantify this relationship. ```r weight_model <- glm(high_mpg ~ wt, data = mtcars_binary, family = binomial) summary(weight_model) ``` The significant negative coefficient confirms that weight strongly predicts lower fuel efficiency. ### Step 4: Calculate Odds Ratios Odds ratios help interpret the practical impact of weight changes. ```r exp(coef(weight_model)) exp(confint(weight_model)) ``` For each additional 1000 lbs, the odds of high fuel efficiency multiply by this factor (less than 1 means decreased odds). ### Step 5: Evaluate Model Performance We'll check how well our model classifies cars. ```r predicted_probs <- predict(weight_model, type = "response") predicted_class <- ifelse(predicted_probs > 0.5, 1, 0) confusion_matrix <- table(Actual = mtcars_binary$high_mpg, Predicted = predicted_class) print(confusion_matrix) ``` This confusion matrix shows our model's accuracy in predicting high versus low fuel efficiency. ## Summary - Logistic regression models binary outcomes using predictor variables and the logistic function - Use `glm()` with `family = binomial` to fit logistic regression models in R - Coefficients represent changes in log-odds; negative coefficients decrease probability of success - Convert coefficients using `exp()` to get odds ratios for easier interpretation - Always visualize your data first and evaluate model performance with predicted probabilities --- ## Related Posts - [How to logistic regression in R](/statistics/how-to-logistic-regression-in-r.html) - [How to Extract p-values from multiple simple linear regression models](/statistics/extract-p-values-from-multiple-simple-linear-regression-models.html) - [How to extract residuals from a linear regression model](/statistics/extract-residuals-from-a-linear-regression-model.html) - [How to use select() in R](/dplyr/how-to-use-select-in-r.html) - [How to replace NA in a column with specific value](/dplyr/how-to-replace-na-in-a-column-with-specific-value.html)