How to linear regression basics in R

statistics
linear regression basics
Linear regression is a fundamental statistical method that models the relationship between a continuous outcome variable (dependent variable) and one or more…
Published

February 21, 2026

1. Introduction

Linear regression is a fundamental statistical method that models the relationship between a continuous outcome variable (dependent variable) and one or more predictor variables (independent variables). It assumes this relationship can be represented by a straight line, making it one of the most interpretable machine learning techniques.

You would use linear regression when you want to: - Predict a continuous outcome based on other variables - Understand how much each predictor influences the outcome - Test hypotheses about relationships between variables - Create a baseline model for comparison with more complex methods

Linear regression requires several key assumptions: - Linearity: The relationship between predictors and outcome is linear - Independence: Observations are independent of each other - Homoscedasticity: Residuals have constant variance across all fitted values - Normality: Residuals are normally distributed - No multicollinearity: Predictor variables aren’t highly correlated with each other

When these assumptions are met, linear regression provides reliable, interpretable results that form the foundation for more advanced statistical modeling techniques.

2. The Math

The basic linear regression equation is:

Y = β₀ + β₁X₁ + β₂X₂ + ... + βₚXₚ + ε

Where: - Y = the outcome variable we’re trying to predict - β₀ = the intercept (value of Y when all X variables equal zero) - β₁, β₂, βₚ = coefficients showing how much Y changes for a one-unit increase in each X - X₁, X₂, Xₚ = predictor variables - ε = error term (residuals - the difference between predicted and actual values)

For simple linear regression (one predictor), this simplifies to:

Y = β₀ + β₁X + ε

The goal is to find the best-fitting line by minimizing the sum of squared residuals (ordinary least squares). R calculates these coefficients automatically, but understanding what they represent is crucial for interpretation.

3. R Implementation

# Load required packages
library(tidyverse)
library(palmerpenguins)
library(broom)

# Load the data
data(penguins)
glimpse(penguins)
Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1…
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1…
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 18…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475,…
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 200…

Basic linear regression in R uses the lm() function:

# Simple linear regression: body mass predicted by flipper length
model1 <- lm(body_mass_g ~ flipper_length_mm, data = penguins)
summary(model1)
Call:
lm(formula = body_mass_g ~ flipper_length_mm, data = penguins)

Residuals:
     Min       1Q   Median       3Q      Max 
-1058.80  -259.27   -26.88   247.33  1288.69 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -5780.83     305.51  -18.93   <2e-16 ***
flipper_length_mm    49.69       1.52   32.72   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 394.3 on 340 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.759, Adjusted R-squared:  0.7583 
F-statistic:  1071 on 1 and 340 DF,  p-value: < 2.2e-16

4. Full Worked Example

Let’s predict penguin body mass using flipper length and bill length:

# Remove missing values for clean analysis
penguins_clean <- penguins %>% 
  filter(!is.na(body_mass_g), !is.na(flipper_length_mm), !is.na(bill_length_mm))

# Multiple linear regression
model2 <- lm(body_mass_g ~ flipper_length_mm + bill_length_mm, 
             data = penguins_clean)

# Get detailed results
summary(model2)
Call:
lm(formula = body_mass_g ~ flipper_length_mm + bill_length_mm, 
    data = penguins_clean)

Residuals:
     Min       1Q   Median       3Q      Max 
-1003.49  -242.37   -23.98   222.05  1287.42 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -6787.89     389.00  -17.46   <2e-16 ***
flipper_length_mm    45.99       1.73   26.62   <2e-16 ***
bill_length_mm       17.33       4.49    3.86  0.000134 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 376.7 on 339 degrees of freedom
Multiple R-squared:  0.7739,    Adjusted R-squared:  0.7725 
F-statistic: 580.4 on 2 and 339 DF,  p-value: < 2.2e-16

Interpreting the results:

  • Intercept (-6787.89): The predicted body mass when both flipper length and bill length are zero (not meaningful in this context)
  • Flipper length coefficient (45.99): For each 1mm increase in flipper length, body mass increases by about 46 grams, holding bill length constant
  • Bill length coefficient (17.33): For each 1mm increase in bill length, body mass increases by about 17 grams, holding flipper length constant
  • R-squared (0.7739): About 77% of the variation in body mass is explained by these two predictors
  • p-values: All coefficients are highly significant (p < 0.001)
# Get confidence intervals for coefficients
confint(model2)
                      2.5 %     97.5 %>%
(Intercept)       -7554.06 -6021.7253
flipper_length_mm    42.60    49.3803
bill_length_mm        8.52    26.1425

5. Visualization

# Load packages for visualization
library(tidyverse)
library(palmerpenguins)

# Prepare clean data
penguins_clean <- penguins |>
  filter(!is.na(body_mass_g), !is.na(flipper_length_mm))

# Scatter plot with regression line
ggplot(penguins_clean, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point(aes(color = species), alpha = 0.7) +
  geom_smooth(method = "lm", se = TRUE, color = "black") +
  labs(
    title = "Penguin Body Mass vs Flipper Length",
    subtitle = "Linear regression line with 95% confidence interval",
    x = "Flipper Length (mm)",
    y = "Body Mass (g)",
    color = "Species"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")

This plot shows the strong positive linear relationship between flipper length and body mass. The gray shaded area represents the 95% confidence interval around the regression line. Different colored points show the three penguin species, revealing that the relationship holds across species, though there are some species-specific patterns.

6. Assumptions & Limitations

When NOT to use linear regression:

  • Non-linear relationships: If the relationship curves significantly, consider polynomial regression or other non-linear methods
  • Non-constant variance: If residuals show patterns (fan shapes, curves), you may need data transformation or robust regression
  • Highly correlated predictors: Multicollinearity makes coefficients unstable and hard to interpret

Checking assumptions:

# Fit model for diagnostics
penguins_clean <- penguins |>
  filter(!is.na(body_mass_g), !is.na(flipper_length_mm), !is.na(bill_length_mm))

model2 <- lm(body_mass_g ~ flipper_length_mm + bill_length_mm, data = penguins_clean)

# Diagnostic plots
par(mfrow = c(2, 2))
plot(model2)

These plots help identify: 1. Residuals vs Fitted: Should show random scatter (linearity, homoscedasticity) 2. Normal Q-Q: Points should follow diagonal line (normality) 3. Scale-Location: Should show random scatter (homoscedasticity) 4. Residuals vs Leverage: Identifies influential outliers

7. Common Mistakes

1. Assuming causation from correlation: Linear regression shows association, not causation. Just because flipper length predicts body mass doesn’t mean longer flippers cause higher mass.

2. Extrapolating beyond data range: Don’t use the model to predict outcomes for predictor values outside your observed range. The linear relationship may not hold.

3. Ignoring assumption violations: Always check diagnostic plots. Violating assumptions can lead to biased estimates, incorrect confidence intervals, and poor predictions.