How to select top and bottom rows by a column simultaneously

dplyr slice_max()
rank()
Learn how to select top and bottom rows by a column simultaneously with this comprehensive R tutorial. Includes practical examples and code snippets.
Published

September 2, 2024

Introduction

When analyzing data, you often need to examine both extremes of a distribution - the highest and lowest values in a column. Instead of running separate queries, dplyr allows you to efficiently select top and bottom rows simultaneously. This technique is particularly useful for identifying outliers, comparing extremes, or creating summary reports that highlight contrasts in your data.

Getting Started

library(tidyverse)

Example 1: Basic Usage

The Problem

We want to find both the heaviest and lightest penguins from our dataset to understand the weight distribution extremes. This requires selecting rows from both ends of the weight spectrum in a single operation.

Step 1: Load and examine the data

First, let’s load the penguins dataset and take a quick look at the structure.

library(palmerpenguins)
data(penguins)

head(penguins)
glimpse(penguins$body_mass_g)

This shows us the penguin data structure and confirms we have body mass measurements to work with.

Step 2: Use slice_min() and slice_max() with bind_rows()

We’ll select the 3 lightest and 3 heaviest penguins by combining two slice functions.

lightest <- penguins |>
  slice_min(body_mass_g, n = 3, na.rm = TRUE)

heaviest <- penguins |>
  slice_max(body_mass_g, n = 3, na.rm = TRUE)

extremes <- bind_rows(lightest, heaviest)

This approach creates two separate dataframes and combines them, giving us 6 penguins representing both weight extremes.

Step 3: View the results

Let’s examine our combined results to see the weight distribution.

extremes |>
  select(species, island, body_mass_g) |>
  arrange(body_mass_g)

The output shows penguins ordered from lightest to heaviest, clearly displaying the weight range extremes in our dataset.

Example 2: Practical Application

The Problem

A car dealership wants to create a marketing report showing both the most fuel-efficient and least fuel-efficient vehicles in their inventory. They need to identify the top 5 cars in each category based on miles per gallon (mpg) to highlight their range of offerings to different customer segments.

Step 1: Prepare the dataset with car names

We’ll add car names to make our results more meaningful for the marketing report.

cars_data <- mtcars |>
  rownames_to_column("car_model") |>
  select(car_model, mpg, cyl, hp, wt)

head(cars_data)

This creates a cleaner dataset with car names as a column, making our results easier to interpret.

Step 2: Create a custom function for top and bottom selection

We’ll build a reusable function that combines the selection process into one operation.

select_extremes <- function(data, column, n = 5) {
  top_values <- data |>
    slice_max({{ column }}, n = n, with_ties = FALSE)
  
  bottom_values <- data |>
    slice_min({{ column }}, n = n, with_ties = FALSE)
  
  bind_rows(bottom_values, top_values)
}

This function encapsulates our logic and can be reused for any column and number of extreme values.

Step 3: Apply the function and create the report

Now we’ll use our function to generate the fuel efficiency report.

fuel_extremes <- cars_data |>
  select_extremes(mpg, n = 5)

fuel_extremes |>
  mutate(category = ifelse(mpg < 20, "Low MPG", "High MPG")) |>
  select(car_model, mpg, category, hp, wt) |>
  arrange(mpg)

The results show 10 vehicles split between fuel-efficient and gas-guzzling categories, perfect for targeted marketing campaigns.

Step 4: Add percentage-based selection

For more flexible analysis, we can select extremes based on percentages rather than fixed numbers.

cars_data |>
  filter(
    mpg <= quantile(mpg, 0.2) | 
    mpg >= quantile(mpg, 0.8)
  ) |>
  arrange(mpg)

This approach selects the bottom 20% and top 20% of vehicles by fuel efficiency, automatically adjusting to dataset size.

Summary

  • Use slice_min() and slice_max() combined with bind_rows() to select extremes simultaneously
  • Create reusable functions with { } syntax to make extreme selection more efficient across different analyses
  • Consider percentage-based selection with quantile() for more flexible extreme identification that scales with data size
  • Always use na.rm = TRUE or with_ties = FALSE parameters to handle missing values and ensure consistent result counts
  • This technique works excellently for outlier analysis, comparative reporting, and creating datasets that showcase data range extremes