How to select columns that starts with a prefix/string in R
Introduction
When working with large datasets, you often need to select multiple columns that share a common naming pattern or prefix. The dplyr package provides several powerful functions like starts_with(), select(), and related helpers that make this task simple and efficient. This approach is particularly useful when dealing with survey data, time series, or any dataset with systematically named variables.
Getting Started
library(tidyverse)
library(palmerpenguins)Example 1: Basic Usage
The Problem
We want to select only the columns from the penguins dataset that start with “bill” to focus our analysis on bill-related measurements.
Step 1: Examine the dataset structure
First, let’s look at what columns are available in our dataset.
# View column names
colnames(penguins)
# Quick glimpse of the data
glimpse(penguins)This shows us all available columns, including bill_length_mm and bill_depth_mm which start with our target prefix.
Step 2: Select columns with starts_with()
Now we’ll select only columns that begin with “bill”.
# Select columns starting with "bill"
bill_data <- penguins |>
select(starts_with("bill"))
head(bill_data)The starts_with("bill") function identifies and selects both bill-related columns, creating a focused dataset with just these measurements.
Step 3: Combine with other selections
We can combine prefix selection with other column selection methods.
# Select bill columns plus species
combined_data <- penguins |>
select(species, starts_with("bill"))
glimpse(combined_data)This creates a dataset with the species identifier plus all bill measurements, perfect for species-specific bill analysis.
Example 2: Practical Application
The Problem
Imagine you’re analyzing car performance data and want to focus on all efficiency-related metrics. In the mtcars dataset, we’ll select columns starting with “m” (which includes mpg) and combine this with a real-world analysis scenario.
Step 1: Create a sample dataset with multiple prefixes
Let’s expand mtcars to demonstrate working with multiple column prefixes.
# Create enhanced dataset with prefixed columns
enhanced_cars <- mtcars |>
mutate(
perf_speed = hp / wt,
perf_efficiency = mpg / wt,
spec_displacement = disp
)We’ve added performance and specification columns with clear prefixes to simulate a real-world scenario.
Step 2: Select performance metrics
Now we’ll select all columns that start with “perf” to focus on performance analysis.
# Select all performance-related columns
performance_data <- enhanced_cars |>
select(starts_with("perf"))
head(performance_data, 3)This gives us a clean dataset containing only the performance metrics we calculated.
Step 3: Multiple prefix selection
We can select columns matching multiple prefixes in a single operation.
# Select multiple prefixes at once
multi_select <- enhanced_cars |>
select(starts_with("perf"), starts_with("spec"), mpg)
glimpse(multi_select)This approach lets us gather related variables from different prefix groups while maintaining a clean, focused dataset.
Step 4: Advanced filtering with patterns
For more complex scenarios, we can combine starts_with() with other selection helpers.
# Select columns starting with 'm' or 'c'
pattern_select <- mtcars |>
select(starts_with("m") | starts_with("c"))
colnames(pattern_select)This demonstrates how to use logical operators to create more sophisticated column selection rules.
Summary
- Use
starts_with("prefix")withinselect()to choose columns beginning with specific text - Combine
starts_with()with other column names or selection helpers for flexible data subset creation - Multiple prefixes can be selected using multiple
starts_with()calls separated by commas - This method works seamlessly with the pipe operator
|>for clean, readable code chains Perfect for analyzing related variables in large datasets with systematic naming conventions