How to select only numeric columns in a dataframe

dplyr select()
Learn how to select only numeric columns in a dataframe with this comprehensive R tutorial. Includes practical examples and code snippets.
Published

May 12, 2022

Introduction

Selecting only numeric columns from a dataframe is a common data preprocessing task in R. This technique is particularly useful when you need to perform mathematical operations, create correlation matrices, or prepare data for statistical analysis that requires only numerical variables.

Getting Started

library(tidyverse)
library(palmerpenguins)

Example 1: Basic Usage

The Problem

We have a dataframe with mixed column types, but we only want to work with the numeric columns. Let’s explore different methods to extract just the numerical data from the penguins dataset.

Step 1: Examine the data structure

First, let’s look at what types of columns we’re working with.

# Load the penguins data and examine its structure
data(penguins)
str(penguins)

This shows us that penguins contains both numeric columns (bill_length_mm, bill_depth_mm, etc.) and non-numeric columns (species, island, sex).

Step 2: Use select() with where()

The most modern approach uses select() combined with where() to filter columns by type.

# Select only numeric columns using where()
numeric_penguins <- penguins |>
  select(where(is.numeric))

# Check the result
names(numeric_penguins)

This creates a new dataframe containing only the four numeric columns: bill_length_mm, bill_depth_mm, flipper_length_mm, and body_mass_g.

Step 3: Alternative method with select_if()

You can also use the older select_if() function for the same result.

# Alternative approach using select_if
numeric_penguins_alt <- penguins |>
  select_if(is.numeric)

# Verify both methods give same result
identical(numeric_penguins, numeric_penguins_alt)

Both methods produce identical results, but select(where()) is the preferred modern syntax.

Example 2: Practical Application

The Problem

Imagine you’re conducting a correlation analysis of penguin physical measurements. You need to remove all categorical variables and missing values, then calculate correlations between the remaining numeric variables to understand relationships between different body measurements.

Step 1: Select numeric columns and remove missing values

We’ll combine numeric selection with data cleaning to prepare for analysis.

# Select numeric columns and remove rows with any missing values
clean_numeric <- penguins |>
  select(where(is.numeric)) |>
  na.omit()

# Check dimensions before and after cleaning
c(original = nrow(penguins), cleaned = nrow(clean_numeric))

This removes rows containing missing values, giving us a clean dataset with only complete numeric observations.

Step 2: Calculate correlation matrix

Now we can easily compute correlations between all numeric variables.

# Calculate correlation matrix for numeric columns
correlation_matrix <- clean_numeric |>
  cor()

# Display the correlation matrix
round(correlation_matrix, 2)

The correlation matrix reveals relationships between penguin measurements, such as the strong positive correlation between flipper length and body mass.

Step 3: Create a summary of numeric variables

We can quickly generate summary statistics for all numeric columns.

# Generate summary statistics for numeric columns only
numeric_summary <- clean_numeric |>
  summary()

# Display the summary
numeric_summary

This provides min, max, mean, median, and quartiles for each numeric variable, giving us a comprehensive overview of the data distribution.

Summary

  • Use select(where(is.numeric)) to extract only numeric columns from a dataframe using modern tidyverse syntax
  • The where() function allows you to select columns based on their data type or other properties
  • select_if(is.numeric) is an alternative older syntax that achieves the same result
  • Combining numeric selection with other data cleaning operations like na.omit() creates analysis-ready datasets
  • This technique is essential for correlation analysis, mathematical operations, and statistical modeling that requires purely numeric data