How to get top and bottom rows of each group in R
Introduction
The slice_max() function in dplyr is a powerful tool for selecting the top n rows with the highest values from your data. Unlike simple sorting, slice_max() lets you efficiently extract just the records you need, making it perfect for finding top performers, highest scores, or maximum values within groups. This function is especially useful when working with grouped data where you want to find the top entries for each category.
Getting Started
First, let’s load the tidyverse package and create some sample data to work with:
library(tidyverse)We’ll create a dataset with symbols and values to demonstrate different uses of slice_max():
set.seed(2024)
df <- tibble(
symbol = sample(letters, 10),
value = rnorm(10, mean = 5, sd = 10)
)
dfThis gives us a dataset with 10 random symbols and their corresponding numeric values, including both positive and negative numbers.
Basic Sorting vs slice_max()
Let’s first see what our data looks like when sorted by value:
df |> arrange(value)The arrange() function sorts all rows, but what if we only want the top 3 highest values? This is where slice_max() becomes useful.
Finding Top Values
To get the 3 rows with the highest values, we use slice_max():
df |> slice_max(value, n = 3)This returns only the top 3 rows with the highest values, which is much more efficient than sorting the entire dataset when you only need the top entries.
Working with Grouped Data
slice_max() becomes even more powerful when combined with group_by(). Let’s first create a grouping variable:
df_grouped <- df |>
mutate(direction = ifelse(value > 0, "positive", "negative"))
df_groupedNow we can find the top values within each group:
df_grouped |>
group_by(direction) |>
slice_max(value, n = 2)This gives us the top 2 highest values for both positive and negative numbers separately.
Handling Ties
When there are tied values, slice_max() includes all tied observations by default. You can control this behavior:
# Include all ties (default)
df |> slice_max(value, n = 3)
# Keep exactly n rows, breaking ties randomly
df |> slice_max(value, n = 3, with_ties = FALSE)The with_ties parameter determines whether to include all rows that tie for the nth position.
Practical Example with Real Data
Let’s use the penguins data to see a more realistic example:
library(palmerpenguins)
penguins |>
filter(!is.na(body_mass_g)) |>
group_by(species) |>
slice_max(body_mass_g, n = 2)This finds the 2 heaviest penguins of each species, which is useful for understanding the size distribution across different penguin types.
Summary
slice_max() is an efficient way to extract the top n rows based on a specific variable, especially when you don’t need to sort your entire dataset. It works particularly well with grouped data, allowing you to find top values within each category. Remember to handle missing values appropriately and consider the with_ties parameter when exact row counts matter. This function is a great alternative to combining arrange() and head(), providing cleaner and more intuitive code for common data analysis tasks.