How to extract a column of a dataframe as a vector in R

dplyr
dplyr pull()
Learn how to perform extract a column of a dataframe as a vector in R. Step-by-step statistical tutorial with examples.
Published

April 15, 2022

Introduction

Extracting a column from a dataframe as a vector is a fundamental operation in R data analysis. This technique is essential when you need to perform statistical calculations, create plots, or pass column data to functions that require vector inputs rather than dataframe columns.

Getting Started

library(tidyverse)
library(palmerpenguins)

Example 1: Basic Usage

The Problem

We have the penguins dataset and want to extract the body mass column as a standalone vector. This allows us to use the data with functions that specifically require vector inputs.

Step 1: View the Data Structure

First, let’s examine our dataset to understand what we’re working with.

# Look at the structure of penguins data
head(penguins)
str(penguins$body_mass_g)

This shows us the penguins dataframe and confirms that body_mass_g is a numeric column.

Step 2: Extract Using Dollar Sign Notation

The most common method uses the $ operator to extract a column as a vector.

# Extract body mass as a vector
body_mass_vector <- penguins$body_mass_g

# Check the class and length
class(body_mass_vector)
length(body_mass_vector)

This creates a numeric vector containing all body mass values, including any NA values.

Step 3: Extract Using Bracket Notation

Alternative methods include using single or double brackets for extraction.

# Using single brackets (returns dataframe)
mass_df <- penguins["body_mass_g"]

# Using double brackets (returns vector)
mass_vector <- penguins[["body_mass_g"]]
class(mass_vector)

Double brackets return a vector, while single brackets maintain the dataframe structure.

Step 4: Extract Using dplyr’s pull()

The modern tidyverse approach uses the pull() function for vector extraction.

# Extract using pull() function
mass_pulled <- penguins |> 
  pull(body_mass_g)

# Verify it's identical to $ notation
identical(mass_pulled, penguins$body_mass_g)

The pull() function integrates seamlessly with pipe workflows and returns a vector.

Example 2: Practical Application

The Problem

We want to calculate summary statistics for penguin flipper lengths, but only for Adelie penguins. We need to filter the data first, then extract the flipper length column as a vector for statistical analysis.

Step 1: Filter and Extract in One Pipeline

We’ll combine filtering with vector extraction using dplyr functions.

# Filter for Adelie penguins and extract flipper length
adelie_flippers <- penguins |> 
  filter(species == "Adelie") |> 
  pull(flipper_length_mm)

head(adelie_flippers)

This creates a vector containing only flipper lengths from Adelie penguins, removing other species.

Step 2: Remove Missing Values

Clean the vector by removing NA values for accurate calculations.

# Remove NA values from the vector
adelie_flippers_clean <- adelie_flippers[!is.na(adelie_flippers)]

# Check the difference
length(adelie_flippers)
length(adelie_flippers_clean)

This comparison shows how many missing values were removed from our vector.

Step 3: Perform Statistical Analysis

Now use the clean vector for statistical calculations that require vector input.

# Calculate summary statistics
mean_flipper <- mean(adelie_flippers_clean)
sd_flipper <- sd(adelie_flippers_clean)
quantiles <- quantile(adelie_flippers_clean, c(0.25, 0.5, 0.75))

print(paste("Mean:", round(mean_flipper, 2)))

These calculations demonstrate why extracting vectors is useful for statistical functions.

Step 4: Alternative Using Base R

Compare with a base R approach for the same result.

# Base R equivalent
adelie_base <- penguins[penguins$species == "Adelie" & 
                       !is.na(penguins$flipper_length_mm), 
                       "flipper_length_mm"]

# Verify identical results
identical(adelie_flippers_clean, adelie_base)

Both approaches yield the same vector, but dplyr offers more readable syntax.

Summary

  • Use $ notation (e.g., df$column) for quick vector extraction in interactive analysis
  • Use pull() function when working within tidyverse pipelines for better code readability
  • Double brackets [[]] work well for programmatic extraction, especially with variable column names
  • Always consider whether you need to handle missing values before performing calculations
  • Vector extraction is essential for statistical functions, plotting, and passing data to other R functions