How to use data.frame in R
Introduction
A data.frame is R’s fundamental data structure for storing rectangular data with rows and columns, similar to a spreadsheet or database table. It’s the most commonly used object for data analysis because it can hold different data types (numeric, character, logical) in different columns while maintaining the same length for each column.
Getting Started
library(tidyverse)Example 1: Basic Usage
The Problem
We need to create and manipulate a simple dataset to understand how data.frames store and organize information. Let’s start by building a data.frame from scratch and exploring its basic properties.
Step 1: Create a basic data.frame
We’ll construct a data.frame using vectors of equal length.
# Create a simple data.frame
students <- data.frame(
name = c("Alice", "Bob", "Carol", "David"),
age = c(20, 22, 19, 21),
grade = c("A", "B", "A", "C"),
passed = c(TRUE, TRUE, TRUE, FALSE)
)This creates a data.frame with 4 rows and 4 columns of different data types.
Step 2: Examine the structure
Understanding your data.frame’s structure is essential for further analysis.
# Explore the data.frame structure
str(students)
head(students)
dim(students)The str() function shows data types, head() displays the first few rows, and dim() returns the dimensions.
Step 3: Access specific elements
Data.frames offer multiple ways to extract data using indexing and column names.
# Access columns and rows
students$name # Access name column
students[1, ] # Access first row
students[, "age"] # Access age column
students[1:2, c("name", "grade")] # Multiple rows/columnsThese indexing methods allow precise data extraction using row numbers, column names, or combinations.
Example 2: Practical Application
The Problem
Let’s work with the built-in mtcars dataset to perform realistic data analysis tasks. We need to filter data, create new variables, and summarize information to understand car performance characteristics across different categories.
Step 1: Load and examine real data
We’ll start by exploring the mtcars dataset structure and contents.
# Load and examine mtcars dataset
data(mtcars)
head(mtcars)
str(mtcars)
rownames(mtcars)[1:5]The mtcars dataset contains 32 car models with 11 performance variables, with car names stored as row names.
Step 2: Filter and subset data
Let’s extract cars that meet specific performance criteria.
# Filter high-performance cars
fast_cars <- mtcars[mtcars$hp > 150 & mtcars$mpg > 15, ]
nrow(fast_cars)
# Select specific columns
efficiency <- mtcars[, c("mpg", "hp", "wt", "qsec")]We filtered for cars with over 150 horsepower and 15+ mpg, then created a subset focusing on efficiency metrics.
Step 3: Create new variables
Adding calculated columns enhances our analysis capabilities.
# Create new variables
mtcars$hp_per_weight <- mtcars$hp / mtcars$wt
mtcars$efficiency_class <- ifelse(mtcars$mpg > 20, "High", "Low")
# View the additions
head(mtcars[, c("hp", "wt", "hp_per_weight", "efficiency_class")])We calculated horsepower-to-weight ratio and classified cars by fuel efficiency for deeper analysis.
Step 4: Group analysis using modern syntax
Using pipes makes data manipulation more readable and intuitive.
# Analyze by cylinder groups using modern R
mtcars |>
group_by(cyl) |>
summarise(
avg_mpg = mean(mpg),
avg_hp = mean(hp),
count = n()
)This pipeline groups cars by cylinder count and calculates average performance metrics for each group.
Step 5: Export and save results
Saving your processed data.frame preserves analysis results for future use.
# Save processed data
write.csv(mtcars, "processed_mtcars.csv", row.names = TRUE)
# Or save as R object
saveRDS(mtcars, "mtcars_analysis.rds")These functions export your data.frame to CSV format or save as an R object for later loading.
Summary
- Data.frames are R’s primary structure for rectangular data, combining different data types in columns
- Create data.frames using
data.frame()function or load existing datasets withdata() - Access data using
$notation, bracket indexing, or column names for flexible data extraction - Filter and subset using logical conditions and bracket notation for targeted analysis
Use modern pipe operators
|>with dplyr functions for readable data manipulation workflows