How to select columns that starts with a prefix/string in R

base R startsWith()
dplyr starts_with()
Learn how to perform select columns that starts with a prefix/string in R. Step-by-step statistical tutorial with examples.
Published

June 23, 2022

Introduction

When working with large datasets, you often need to select multiple columns that share a common naming pattern or prefix. The dplyr package provides several powerful functions like starts_with(), select(), and related helpers that make this task simple and efficient. This approach is particularly useful when dealing with survey data, time series, or any dataset with systematically named variables.

Getting Started

library(tidyverse)
library(palmerpenguins)

Example 1: Basic Usage

The Problem

We want to select only the columns from the penguins dataset that start with “bill” to focus our analysis on bill-related measurements.

Step 1: Examine the dataset structure

First, let’s look at what columns are available in our dataset.

# View column names
colnames(penguins)

# Quick glimpse of the data
glimpse(penguins)

This shows us all available columns, including bill_length_mm and bill_depth_mm which start with our target prefix.

Step 2: Select columns with starts_with()

Now we’ll select only columns that begin with “bill”.

# Select columns starting with "bill"
bill_data <- penguins |>
  select(starts_with("bill"))

head(bill_data)

The starts_with("bill") function identifies and selects both bill-related columns, creating a focused dataset with just these measurements.

Step 3: Combine with other selections

We can combine prefix selection with other column selection methods.

# Select bill columns plus species
combined_data <- penguins |>
  select(species, starts_with("bill"))

glimpse(combined_data)

This creates a dataset with the species identifier plus all bill measurements, perfect for species-specific bill analysis.

Example 2: Practical Application

The Problem

Imagine you’re analyzing car performance data and want to focus on all efficiency-related metrics. In the mtcars dataset, we’ll select columns starting with “m” (which includes mpg) and combine this with a real-world analysis scenario.

Step 1: Create a sample dataset with multiple prefixes

Let’s expand mtcars to demonstrate working with multiple column prefixes.

# Create enhanced dataset with prefixed columns
enhanced_cars <- mtcars |>
  mutate(
    perf_speed = hp / wt,
    perf_efficiency = mpg / wt,
    spec_displacement = disp
  )

We’ve added performance and specification columns with clear prefixes to simulate a real-world scenario.

Step 2: Select performance metrics

Now we’ll select all columns that start with “perf” to focus on performance analysis.

# Select all performance-related columns
performance_data <- enhanced_cars |>
  select(starts_with("perf"))

head(performance_data, 3)

This gives us a clean dataset containing only the performance metrics we calculated.

Step 3: Multiple prefix selection

We can select columns matching multiple prefixes in a single operation.

# Select multiple prefixes at once
multi_select <- enhanced_cars |>
  select(starts_with("perf"), starts_with("spec"), mpg)

glimpse(multi_select)

This approach lets us gather related variables from different prefix groups while maintaining a clean, focused dataset.

Step 4: Advanced filtering with patterns

For more complex scenarios, we can combine starts_with() with other selection helpers.

# Select columns starting with 'm' or 'c'
pattern_select <- mtcars |>
  select(starts_with("m") | starts_with("c"))

colnames(pattern_select)

This demonstrates how to use logical operators to create more sophisticated column selection rules.

Summary

  • Use starts_with("prefix") within select() to choose columns beginning with specific text
  • Combine starts_with() with other column names or selection helpers for flexible data subset creation
  • Multiple prefixes can be selected using multiple starts_with() calls separated by commas
  • This method works seamlessly with the pipe operator |> for clean, readable code chains