How to select top and bottom rows by a column simultaneously
Introduction
When analyzing data, you often need to examine both extremes of a distribution - the highest and lowest values in a column. Instead of running separate queries, dplyr allows you to efficiently select top and bottom rows simultaneously. This technique is particularly useful for identifying outliers, comparing extremes, or creating summary reports that highlight contrasts in your data.
Getting Started
library(tidyverse)Example 1: Basic Usage
The Problem
We want to find both the heaviest and lightest penguins from our dataset to understand the weight distribution extremes. This requires selecting rows from both ends of the weight spectrum in a single operation.
Step 1: Load and examine the data
First, let’s load the penguins dataset and take a quick look at the structure.
library(palmerpenguins)
data(penguins)
head(penguins)
glimpse(penguins$body_mass_g)This shows us the penguin data structure and confirms we have body mass measurements to work with.
Step 2: Use slice_min() and slice_max() with bind_rows()
We’ll select the 3 lightest and 3 heaviest penguins by combining two slice functions.
lightest <- penguins |>
slice_min(body_mass_g, n = 3, na.rm = TRUE)
heaviest <- penguins |>
slice_max(body_mass_g, n = 3, na.rm = TRUE)
extremes <- bind_rows(lightest, heaviest)This approach creates two separate dataframes and combines them, giving us 6 penguins representing both weight extremes.
Step 3: View the results
Let’s examine our combined results to see the weight distribution.
extremes |>
select(species, island, body_mass_g) |>
arrange(body_mass_g)The output shows penguins ordered from lightest to heaviest, clearly displaying the weight range extremes in our dataset.
Example 2: Practical Application
The Problem
A car dealership wants to create a marketing report showing both the most fuel-efficient and least fuel-efficient vehicles in their inventory. They need to identify the top 5 cars in each category based on miles per gallon (mpg) to highlight their range of offerings to different customer segments.
Step 1: Prepare the dataset with car names
We’ll add car names to make our results more meaningful for the marketing report.
cars_data <- mtcars |>
rownames_to_column("car_model") |>
select(car_model, mpg, cyl, hp, wt)
head(cars_data)This creates a cleaner dataset with car names as a column, making our results easier to interpret.
Step 2: Create a custom function for top and bottom selection
We’ll build a reusable function that combines the selection process into one operation.
select_extremes <- function(data, column, n = 5) {
top_values <- data |>
slice_max({{ column }}, n = n, with_ties = FALSE)
bottom_values <- data |>
slice_min({{ column }}, n = n, with_ties = FALSE)
bind_rows(bottom_values, top_values)
}This function encapsulates our logic and can be reused for any column and number of extreme values.
Step 3: Apply the function and create the report
Now we’ll use our function to generate the fuel efficiency report.
fuel_extremes <- cars_data |>
select_extremes(mpg, n = 5)
fuel_extremes |>
mutate(category = ifelse(mpg < 20, "Low MPG", "High MPG")) |>
select(car_model, mpg, category, hp, wt) |>
arrange(mpg)The results show 10 vehicles split between fuel-efficient and gas-guzzling categories, perfect for targeted marketing campaigns.
Step 4: Add percentage-based selection
For more flexible analysis, we can select extremes based on percentages rather than fixed numbers.
cars_data |>
filter(
mpg <= quantile(mpg, 0.2) |
mpg >= quantile(mpg, 0.8)
) |>
arrange(mpg)This approach selects the bottom 20% and top 20% of vehicles by fuel efficiency, automatically adjusting to dataset size.
Summary
- Use
slice_min()andslice_max()combined withbind_rows()to select extremes simultaneously - Create reusable functions with
{ }syntax to make extreme selection more efficient across different analyses
- Consider percentage-based selection with
quantile()for more flexible extreme identification that scales with data size - Always use
na.rm = TRUEorwith_ties = FALSEparameters to handle missing values and ensure consistent result counts This technique works excellently for outlier analysis, comparative reporting, and creating datasets that showcase data range extremes