How to select columns that starts with a prefix/string in R

base R startsWith()

dplyr starts_with()

Learn how to perform select columns that starts with a prefix/string in R. Step-by-step statistical tutorial with examples.

Published

June 23, 2022

Introduction

When working with large datasets, you often need to select multiple columns that share a common naming pattern or prefix. The dplyr package provides several powerful functions like starts_with(), select(), and related helpers that make this task simple and efficient. This approach is particularly useful when dealing with survey data, time series, or any dataset with systematically named variables.

Getting Started

library(tidyverse)
library(palmerpenguins)

Example 1: Basic Usage

The Problem

We want to select only the columns from the penguins dataset that start with “bill” to focus our analysis on bill-related measurements.

Step 1: Examine the dataset structure

First, let’s look at what columns are available in our dataset.

# View column names
colnames(penguins)

# Quick glimpse of the data
glimpse(penguins)

This shows us all available columns, including bill_length_mm and bill_depth_mm which start with our target prefix.

Step 2: Select columns with starts_with()

Now we’ll select only columns that begin with “bill”.

# Select columns starting with "bill"
bill_data <- penguins |>
  select(starts_with("bill"))

head(bill_data)

The starts_with("bill") function identifies and selects both bill-related columns, creating a focused dataset with just these measurements.

Step 3: Combine with other selections

We can combine prefix selection with other column selection methods.

# Select bill columns plus species
combined_data <- penguins |>
  select(species, starts_with("bill"))

glimpse(combined_data)

This creates a dataset with the species identifier plus all bill measurements, perfect for species-specific bill analysis.

Example 2: Practical Application

The Problem

Imagine you’re analyzing car performance data and want to focus on all efficiency-related metrics. In the mtcars dataset, we’ll select columns starting with “m” (which includes mpg) and combine this with a real-world analysis scenario.

Step 1: Create a sample dataset with multiple prefixes

Let’s expand mtcars to demonstrate working with multiple column prefixes.

# Create enhanced dataset with prefixed columns
enhanced_cars <- mtcars |>
  mutate(
    perf_speed = hp / wt,
    perf_efficiency = mpg / wt,
    spec_displacement = disp
  )

We’ve added performance and specification columns with clear prefixes to simulate a real-world scenario.

Step 2: Select performance metrics

Now we’ll select all columns that start with “perf” to focus on performance analysis.

# Select all performance-related columns
performance_data <- enhanced_cars |>
  select(starts_with("perf"))

head(performance_data, 3)

This gives us a clean dataset containing only the performance metrics we calculated.

Step 3: Multiple prefix selection

We can select columns matching multiple prefixes in a single operation.

# Select multiple prefixes at once
multi_select <- enhanced_cars |>
  select(starts_with("perf"), starts_with("spec"), mpg)

glimpse(multi_select)

This approach lets us gather related variables from different prefix groups while maintaining a clean, focused dataset.

Step 4: Advanced filtering with patterns

For more complex scenarios, we can combine starts_with() with other selection helpers.

# Select columns starting with 'm' or 'c'
pattern_select <- mtcars |>
  select(starts_with("m") | starts_with("c"))

colnames(pattern_select)

This demonstrates how to use logical operators to create more sophisticated column selection rules.

Summary

Use starts_with("prefix") within select() to choose columns beginning with specific text
Combine starts_with() with other column names or selection helpers for flexible data subset creation
Multiple prefixes can be selected using multiple starts_with() calls separated by commas
This method works seamlessly with the pipe operator |> for clean, readable code chains
Perfect for analyzing related variables in large datasets with systematic naming conventions

--- title: "How to select columns that starts with a prefix/string in R" description: "Learn how to perform select columns that starts with a prefix/string in R. Step-by-step statistical tutorial with examples." date: 2022-06-23 categories: ['base R startsWith()', 'dplyr starts_with()'] format: html: code-fold: false code-tools: true --- ## Introduction When working with large datasets, you often need to select multiple columns that share a common naming pattern or prefix. The `dplyr` package provides several powerful functions like [`starts_with()`](/dplyr/how-to-use-startswith-in-r.html), [`select()`](/dplyr/how-to-use-select-in-r.html), and related helpers that make this task simple and efficient. This approach is particularly useful when dealing with survey data, time series, or any dataset with systematically named variables. ## Getting Started ```r library(tidyverse) library(palmerpenguins) ``` ## Example 1: Basic Usage ### The Problem We want to select only the columns from the penguins dataset that start with "bill" to focus our analysis on bill-related measurements. ### Step 1: Examine the dataset structure First, let's look at what columns are available in our dataset. ```r # View column names colnames(penguins) # Quick glimpse of the data glimpse(penguins) ``` This shows us all available columns, including `bill_length_mm` and `bill_depth_mm` which start with our target prefix. ### Step 2: Select columns with starts_with() Now we'll select only columns that begin with "bill". ```r # Select columns starting with "bill" bill_data <- penguins |> select(starts_with("bill")) head(bill_data) ``` The `starts_with("bill")` function identifies and selects both bill-related columns, creating a focused dataset with just these measurements. ### Step 3: Combine with other selections We can combine prefix selection with other column selection methods. ```r # Select bill columns plus species combined_data <- penguins |> select(species, starts_with("bill")) glimpse(combined_data) ``` This creates a dataset with the species identifier plus all bill measurements, perfect for species-specific bill analysis. ## Example 2: Practical Application ### The Problem Imagine you're analyzing car performance data and want to focus on all efficiency-related metrics. In the mtcars dataset, we'll select columns starting with "m" (which includes mpg) and combine this with a real-world analysis scenario. ### Step 1: Create a sample dataset with multiple prefixes Let's expand mtcars to demonstrate working with multiple column prefixes. ```r # Create enhanced dataset with prefixed columns enhanced_cars <- mtcars |> mutate( perf_speed = hp / wt, perf_efficiency = mpg / wt, spec_displacement = disp ) ``` We've added performance and specification columns with clear prefixes to simulate a real-world scenario. ### Step 2: Select performance metrics Now we'll select all columns that start with "perf" to focus on performance analysis. ```r # Select all performance-related columns performance_data <- enhanced_cars |> select(starts_with("perf")) head(performance_data, 3) ``` This gives us a clean dataset containing only the performance metrics we calculated. ### Step 3: Multiple prefix selection We can select columns matching multiple prefixes in a single operation. ```r # Select multiple prefixes at once multi_select <- enhanced_cars |> select(starts_with("perf"), starts_with("spec"), mpg) glimpse(multi_select) ``` This approach lets us gather related variables from different prefix groups while maintaining a clean, focused dataset. ### Step 4: Advanced filtering with patterns For more complex scenarios, we can combine starts_with() with other selection helpers. ```r # Select columns starting with 'm' or 'c' pattern_select <- mtcars |> select(starts_with("m") | starts_with("c")) colnames(pattern_select) ``` This demonstrates how to use logical operators to create more sophisticated column selection rules. ## Summary - Use `starts_with("prefix")` within `select()` to choose columns beginning with specific text - Combine `starts_with()` with other column names or selection helpers for flexible data subset creation - Multiple prefixes can be selected using multiple `starts_with()` calls separated by commas - This method works seamlessly with the pipe operator `|>` for clean, readable code chains - Perfect for analyzing related variables in large datasets with systematic naming conventions --- ## Related Posts - [dplyr contains(): select columns that contains a string](/dplyr/dplyr-contains-select-columns-that-contains-a-string.html) - [dplyr ends_with(): select columns that end with a suffix](/dplyr/select-columns-that-end-with-a-suffix.html) - [How to select only numeric columns in a dataframe](/dplyr/select-all-numeric-columns-in-a-dataframe.html) - [tidyr's separate_delim_wider(): Split a string into columns](/tidyr/tidyrs-separate_delim_wider-split-a-string-into-columns.html) - [tidyr unite(): combine multiple columns into one](/tidyr/tidyr-unite-combine-multiple-columns-into-one.html)