How to use starts_with() in R

dplyr
dplyr starts_with()
Learn how to use starts_with() in R with practical examples. Step-by-step guide with code you can copy and run immediately.
Published

February 21, 2026

Introduction

The starts_with() function is a powerful helper function in dplyr that allows you to select columns based on their name patterns. It’s particularly useful when working with datasets that have systematic naming conventions, such as columns prefixed with dates, categories, or measurement types. This function streamlines data manipulation by eliminating the need to manually specify each column name.

Getting Started

library(tidyverse)
library(palmerpenguins)

Example 1: Basic Usage

The Problem

Imagine you have a dataset with multiple columns and want to select only those that start with specific letters or prefixes. Manually typing each column name would be tedious and error-prone.

Step 1: Examine the dataset structure

Let’s first look at the column names in the penguins dataset.

data(penguins)
colnames(penguins)
head(penguins, 3)

We can see columns like “bill_length_mm”, “bill_depth_mm”, and “body_mass_g” that follow naming patterns.

Step 2: Select columns starting with “bill”

We’ll use starts_with() to select only columns beginning with “bill”.

penguins |>
  select(starts_with("bill")) |>
  head(5)

This returns only the two bill-related columns: bill_length_mm and bill_depth_mm.

Step 3: Combine with other selection methods

You can mix starts_with() with other column selection approaches.

penguins |>
  select(species, starts_with("bill"), body_mass_g) |>
  head(4)

This selects the species column, both bill columns, and body_mass_g, giving us a focused subset of the data.

Example 2: Practical Application

The Problem

You’re analyzing car performance data and need to create a summary focusing only on efficiency-related metrics. The mtcars dataset contains various measurements, but you specifically want columns related to miles per gallon and similar efficiency measures.

Step 1: Create extended dataset with prefixed columns

Let’s add some efficiency-related columns with consistent prefixes to demonstrate the concept.

mtcars_extended <- mtcars |>
  mutate(
    efficiency_mpg = mpg,
    efficiency_ratio = mpg / wt,
    performance_hp = hp,
    performance_speed = mpg * hp / 100
  )

Now we have columns with “efficiency_” and “performance_” prefixes for systematic selection.

Step 2: Analyze efficiency metrics

Select and summarize all efficiency-related columns.

efficiency_summary <- mtcars_extended |>
  select(starts_with("efficiency")) |>
  summarise(across(everything(), 
                   list(mean = mean, sd = sd)))

This creates a summary with means and standard deviations for all efficiency columns, making analysis more systematic.

Step 3: Compare different metric categories

Create separate summaries for different measurement categories.

performance_data <- mtcars_extended |>
  select(starts_with("performance")) |>
  slice_head(n = 6)

print(performance_data)

This approach allows you to quickly isolate and analyze specific categories of measurements without manually specifying each column.

Step 4: Use in data transformation pipelines

Apply transformations to groups of similarly-named columns.

mtcars_scaled <- mtcars_extended |>
  mutate(across(starts_with("efficiency"), scale)) |>
  select(starts_with("efficiency"))

head(mtcars_scaled, 4)

The starts_with() function works seamlessly with across() to apply transformations to column groups, making data preprocessing more efficient.

Summary

  • starts_with() is a dplyr helper function that selects columns based on name prefixes, eliminating manual column specification
  • It works perfectly with select() to choose specific column groups and integrates seamlessly with the pipe operator |>
  • The function can be combined with other selection methods and used within across() for applying transformations to column groups
  • It’s particularly valuable for datasets with systematic naming conventions, making data analysis workflows more readable and maintainable
  • This approach reduces errors and makes code more adaptable when column names follow consistent patterns