Logistic Regression with Single Predictor in R

glm

logistic regression

Published

September 18, 2023

In this post, we will learn how to perform a simple logistic regression using Generalized Linear Models (glm) in R. We will work with logistic regression model between a binary categorical variable as response variable and a single numerical predictor.

Data for logistic regression

Let us load the packages needed.

library(tidyverse)
library(palmerpenguins)
theme_set(theme_bw(16))

We will use two variables from Palmer penguins dataset. To simplify let us use dataset from one of the Penguin species and select just two variables, sex and body mass. Here, sex is a categorial variable and body mass is numerical variable.

df %
  drop_na() %>%
  filter(species=="Gentoo")   %>%
  select(sex, body_mass_g)

df %>% head()

# A tibble: 6 × 2
  sex    body_mass_g
          
1 female        4500
2 male          5700
3 female        4450
4 male          5700
5 male          5400
6 female        4550

Variables for logistic regression

In this logistical regression model, we are interested in classifying the sex of penguins using body mass. Let us make a simple boxplot to see how well sex variable is associated with body mass of penguins.

df   %>%
  ggplot(aes(x=sex, y= body_mass_g, color=sex))+
  geom_boxplot(outlier.shape = NA)+
  geom_jitter(width=0.1)+
  theme(legend.position = "none")+
  scale_color_brewer(palette="Set1")
ggsave("penguins_sex_vs_body_mass_boxplot.png")

Logistic regression with single predictor Logistic regression to classify Penguin sex with one predictor ### Building Logistic Regression with glm()

To build logistic regression model we will use glm() function from stats package that has implementations of GLMs. glm() function takes three key arguments,

formula, where we tell glm() the variables and how we want to model family name, here we specify the exponential family that we want to use, “biomial” for logistic regression data: a dataframe with all the variables specified in the formula

# single predictor
glm_res_sp %
  glm(formula = sex ~ body_mass_g,
      family = binomial)

By printing the resulting object, we can see the basic information regarding the model we built. We get the formula we used and the coeffients from the logistic regression model.

glm_res_sp

Call:  glm(formula = sex ~ body_mass_g, family = binomial, data = .)

Coefficients:
(Intercept)  body_mass_g  
  -55.03337      0.01089  

Degrees of Freedom: 118 Total (i.e. Null);  117 Residual
Null Deviance:      164.9 
Residual Deviance: 45.1     AIC: 49.1

Results from Logistic Regression with single predictor uslng glm()

To fully investigate the results, we need to use summary() function on the result object. And this gives us the estimated coefficients and the p-value of the association between the two variables. We can see that

glm_res_sp %>% summary()

Call:
glm(formula = sex ~ body_mass_g, family = binomial, data = .)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -55.033367  11.651375  -4.723 2.32e-06 ***
body_mass_g   0.010888   0.002311   4.711 2.47e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 164.893  on 118  degrees of freedom
Residual deviance:  45.105  on 117  degrees of freedom
AIC: 49.105

Number of Fisher Scoring iterations: 7

Cleaning Logistic Regression results with broom

The results from statistical models like glm() are often hard to read. So we will use broom package’s tidy() function to convert the resulting statistical object to a tidy dataframe.

glm_res_sp %>%
   tidy()

# A tibble: 2 × 5
  term        estimate std.error statistic    p.value
                            
1 (Intercept) -55.0     11.7         -4.72 0.00000232
2 body_mass_g   0.0109   0.00231      4.71 0.00000247

We can also visualize the logistic regression model by plotting the two variables and a logistic curve. Note here we use ggplot2’s geom_smooth() function to make the plot.

df %>%
  mutate(sex=as.numeric(sex)-1) %>%
  ggplot(aes(x=body_mass_g, y=sex)) + 
  geom_point() +
  geom_smooth(method="glm",
              se=TRUE,
              method.args = list(family=binomial))
ggsave("plotting_logistic_regression_model_with_glm.png")

Plotting logistic regression model with geom_smooth()

--- title: "Logistic Regression with Single Predictor in R" date: 2023-09-18 categories: ['glm', 'logistic regression'] format: html: code-fold: false code-tools: true --- In this post, we will learn how to perform a simple logistic regression using Generalized Linear Models (glm) in R. We will work with logistic regression model between a binary categorical variable as response variable and a single numerical predictor. ### Data for logistic regression Let us load the packages needed. ```r library(tidyverse) library(palmerpenguins) theme_set(theme_bw(16)) ``` We will use two variables from Palmer penguins dataset. To simplify let us use dataset from one of the Penguin species and select just two variables, sex and body mass. Here, sex is a categorial variable and body mass is numerical variable. ```r df % drop_na() %>% filter(species=="Gentoo") %>% select(sex, body_mass_g) ``` ```r df %>% head() # A tibble: 6 × 2 sex body_mass_g 1 female 4500 2 male 5700 3 female 4450 4 male 5700 5 male 5400 6 female 4550 ``` ### Variables for logistic regression In this logistical regression model, we are interested in classifying the sex of penguins using body mass. Let us make a simple boxplot to see how well sex variable is associated with body mass of penguins. ```r df %>% ggplot(aes(x=sex, y= body_mass_g, color=sex))+ geom_boxplot(outlier.shape = NA)+ geom_jitter(width=0.1)+ theme(legend.position = "none")+ scale_color_brewer(palette="Set1") ggsave("penguins_sex_vs_body_mass_boxplot.png") ``` ![Logistic regression with single predictor](https://rstats101.com/wp-content/uploads/2023/09/penguins_sex_vs_body_mass_boxplot.png) Logistic regression to classify Penguin sex with one predictor ### Building Logistic Regression with glm() To build logistic regression model we will use glm() function from stats package that has implementations of GLMs. glm() function takes three key arguments, **formula**, where we tell glm() the variables and how we want to model **family name**, here we specify the exponential family that we want to use, "biomial" for logistic regression **data**: a dataframe with all the variables specified in the formula ```r # single predictor glm_res_sp % glm(formula = sex ~ body_mass_g, family = binomial) ``` By printing the resulting object, we can see the basic information regarding the model we built. We get the formula we used and the coeffients from the logistic regression model. ```r glm_res_sp Call: glm(formula = sex ~ body_mass_g, family = binomial, data = .) Coefficients: (Intercept) body_mass_g -55.03337 0.01089 Degrees of Freedom: 118 Total (i.e. Null); 117 Residual Null Deviance: 164.9 Residual Deviance: 45.1 AIC: 49.1 ``` ### Results from Logistic Regression with single predictor uslng glm() To fully investigate the results, we need to use summary() function on the result object. And this gives us the estimated coefficients and the p-value of the association between the two variables. We can see that ```r glm_res_sp %>% summary() ``` ```r Call: glm(formula = sex ~ body_mass_g, family = binomial, data = .) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -55.033367 11.651375 -4.723 2.32e-06 *** body_mass_g 0.010888 0.002311 4.711 2.47e-06 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 164.893 on 118 degrees of freedom Residual deviance: 45.105 on 117 degrees of freedom AIC: 49.105 Number of Fisher Scoring iterations: 7 ``` ### Cleaning Logistic Regression results with broom The results from statistical models like glm() are often hard to read. So we will use [broom package's tidy() function](https://rstats101.com/simple-linear-regression-with-tidyverse/) to convert the resulting statistical object to a tidy dataframe. ```r glm_res_sp %>% tidy() # A tibble: 2 × 5 term estimate std.error statistic p.value 1 (Intercept) -55.0 11.7 -4.72 0.00000232 2 body_mass_g 0.0109 0.00231 4.71 0.00000247 ``` We can also visualize the logistic regression model by plotting the two variables and a logistic curve. Note here we use ggplot2's geom_smooth() function to make the plot. ```r df %>% mutate(sex=as.numeric(sex)-1) %>% ggplot(aes(x=body_mass_g, y=sex)) + geom_point() + geom_smooth(method="glm", se=TRUE, method.args = list(family=binomial)) ggsave("plotting_logistic_regression_model_with_glm.png") ``` ![Plotting logistic regression model with geom_smooth()](https://rstats101.com/wp-content/uploads/2023/09/plotting_logistic_regression_model_with_glm.png) Plotting logistic regression model with geom_smooth()