How to linear regression basics in R

statistics

linear regression basics

Linear regression is a fundamental statistical method that models the relationship between a continuous outcome variable (dependent variable) and one or more…

Published

February 21, 2026

1. Introduction

Linear regression is a fundamental statistical method that models the relationship between a continuous outcome variable (dependent variable) and one or more predictor variables (independent variables). It assumes this relationship can be represented by a straight line, making it one of the most interpretable machine learning techniques.

You would use linear regression when you want to: - Predict a continuous outcome based on other variables - Understand how much each predictor influences the outcome - Test hypotheses about relationships between variables - Create a baseline model for comparison with more complex methods

Linear regression requires several key assumptions: - Linearity: The relationship between predictors and outcome is linear - Independence: Observations are independent of each other - Homoscedasticity: Residuals have constant variance across all fitted values - Normality: Residuals are normally distributed - No multicollinearity: Predictor variables aren’t highly correlated with each other

When these assumptions are met, linear regression provides reliable, interpretable results that form the foundation for more advanced statistical modeling techniques.

2. The Math

The basic linear regression equation is:

Y = β₀ + β₁X₁ + β₂X₂ + ... + βₚXₚ + ε

Where: - Y = the outcome variable we’re trying to predict - β₀ = the intercept (value of Y when all X variables equal zero) - β₁, β₂, βₚ = coefficients showing how much Y changes for a one-unit increase in each X - X₁, X₂, Xₚ = predictor variables - ε = error term (residuals - the difference between predicted and actual values)

For simple linear regression (one predictor), this simplifies to:

Y = β₀ + β₁X + ε

The goal is to find the best-fitting line by minimizing the sum of squared residuals (ordinary least squares). R calculates these coefficients automatically, but understanding what they represent is crucial for interpretation.

3. R Implementation

# Load required packages
library(tidyverse)
library(palmerpenguins)
library(broom)

# Load the data
data(penguins)
glimpse(penguins)

Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1…
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1…
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 18…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475,…
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 200…

Basic linear regression in R uses the lm() function:

# Simple linear regression: body mass predicted by flipper length
model1 <- lm(body_mass_g ~ flipper_length_mm, data = penguins)
summary(model1)

Call:
lm(formula = body_mass_g ~ flipper_length_mm, data = penguins)

Residuals:
     Min       1Q   Median       3Q      Max 
-1058.80  -259.27   -26.88   247.33  1288.69 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -5780.83     305.51  -18.93   <2e-16 ***
flipper_length_mm    49.69       1.52   32.72   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 394.3 on 340 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.759, Adjusted R-squared:  0.7583 
F-statistic:  1071 on 1 and 340 DF,  p-value: < 2.2e-16

4. Full Worked Example

Let’s predict penguin body mass using flipper length and bill length:

# Remove missing values for clean analysis
penguins_clean <- penguins %>% 
  filter(!is.na(body_mass_g), !is.na(flipper_length_mm), !is.na(bill_length_mm))

# Multiple linear regression
model2 <- lm(body_mass_g ~ flipper_length_mm + bill_length_mm, 
             data = penguins_clean)

# Get detailed results
summary(model2)

Call:
lm(formula = body_mass_g ~ flipper_length_mm + bill_length_mm, 
    data = penguins_clean)

Residuals:
     Min       1Q   Median       3Q      Max 
-1003.49  -242.37   -23.98   222.05  1287.42 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -6787.89     389.00  -17.46   <2e-16 ***
flipper_length_mm    45.99       1.73   26.62   <2e-16 ***
bill_length_mm       17.33       4.49    3.86  0.000134 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 376.7 on 339 degrees of freedom
Multiple R-squared:  0.7739,    Adjusted R-squared:  0.7725 
F-statistic: 580.4 on 2 and 339 DF,  p-value: < 2.2e-16

Interpreting the results:

Intercept (-6787.89): The predicted body mass when both flipper length and bill length are zero (not meaningful in this context)
Flipper length coefficient (45.99): For each 1mm increase in flipper length, body mass increases by about 46 grams, holding bill length constant
Bill length coefficient (17.33): For each 1mm increase in bill length, body mass increases by about 17 grams, holding flipper length constant
R-squared (0.7739): About 77% of the variation in body mass is explained by these two predictors
p-values: All coefficients are highly significant (p < 0.001)

# Get confidence intervals for coefficients
confint(model2)

                      2.5 %     97.5 %>%
(Intercept)       -7554.06 -6021.7253
flipper_length_mm    42.60    49.3803
bill_length_mm        8.52    26.1425

5. Visualization

# Load packages for visualization
library(tidyverse)
library(palmerpenguins)

# Prepare clean data
penguins_clean <- penguins |>
  filter(!is.na(body_mass_g), !is.na(flipper_length_mm))

# Scatter plot with regression line
ggplot(penguins_clean, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point(aes(color = species), alpha = 0.7) +
  geom_smooth(method = "lm", se = TRUE, color = "black") +
  labs(
    title = "Penguin Body Mass vs Flipper Length",
    subtitle = "Linear regression line with 95% confidence interval",
    x = "Flipper Length (mm)",
    y = "Body Mass (g)",
    color = "Species"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")

This plot shows the strong positive linear relationship between flipper length and body mass. The gray shaded area represents the 95% confidence interval around the regression line. Different colored points show the three penguin species, revealing that the relationship holds across species, though there are some species-specific patterns.

6. Assumptions & Limitations

When NOT to use linear regression:

Non-linear relationships: If the relationship curves significantly, consider polynomial regression or other non-linear methods
Non-constant variance: If residuals show patterns (fan shapes, curves), you may need data transformation or robust regression
Highly correlated predictors: Multicollinearity makes coefficients unstable and hard to interpret

Checking assumptions:

# Fit model for diagnostics
penguins_clean <- penguins |>
  filter(!is.na(body_mass_g), !is.na(flipper_length_mm), !is.na(bill_length_mm))

model2 <- lm(body_mass_g ~ flipper_length_mm + bill_length_mm, data = penguins_clean)

# Diagnostic plots
par(mfrow = c(2, 2))
plot(model2)

These plots help identify: 1. Residuals vs Fitted: Should show random scatter (linearity, homoscedasticity) 2. Normal Q-Q: Points should follow diagonal line (normality) 3. Scale-Location: Should show random scatter (homoscedasticity) 4. Residuals vs Leverage: Identifies influential outliers

7. Common Mistakes

1. Assuming causation from correlation: Linear regression shows association, not causation. Just because flipper length predicts body mass doesn’t mean longer flippers cause higher mass.

2. Extrapolating beyond data range: Don’t use the model to predict outcomes for predictor values outside your observed range. The linear relationship may not hold.

3. Ignoring assumption violations: Always check diagnostic plots. Violating assumptions can lead to biased estimates, incorrect confidence intervals, and poor predictions.

--- title: "How to linear regression basics in R" description: "Linear regression is a fundamental statistical method that models the relationship between a continuous outcome variable (dependent variable) and one or more..." date: 2026-02-21 categories: ["statistics", "linear regression basics"] format: html: code-fold: false code-tools: true --- ## 1. Introduction Linear regression is a fundamental statistical method that models the relationship between a continuous outcome variable (dependent variable) and one or more predictor variables (independent variables). It assumes this relationship can be represented by a straight line, making it one of the most interpretable machine learning techniques. You would use linear regression when you want to: - Predict a continuous outcome based on other variables - Understand how much each predictor influences the outcome - Test hypotheses about relationships between variables - Create a baseline model for comparison with more complex methods Linear regression requires several key assumptions: - **Linearity**: The relationship between predictors and outcome is linear - **Independence**: Observations are independent of each other - **Homoscedasticity**: Residuals have constant variance across all fitted values - **Normality**: Residuals are normally distributed - **No multicollinearity**: Predictor variables aren't highly correlated with each other When these assumptions are met, linear regression provides reliable, interpretable results that form the foundation for more advanced statistical modeling techniques. ## 2. The Math The basic linear regression equation is: ``` Y = β₀ + β₁X₁ + β₂X₂ + ... + βₚXₚ + ε ``` Where: - **Y** = the outcome variable we're trying to predict - **β₀** = the intercept (value of Y when all X variables equal zero) - **β₁, β₂, βₚ** = coefficients showing how much Y changes for a one-unit increase in each X - **X₁, X₂, Xₚ** = predictor variables - **ε** = error term (residuals - the difference between predicted and actual values) For simple linear regression (one predictor), this simplifies to: ``` Y = β₀ + β₁X + ε ``` The goal is to find the best-fitting line by minimizing the sum of squared residuals (ordinary least squares). R calculates these coefficients automatically, but understanding what they represent is crucial for interpretation. ## 3. R Implementation ```r # Load required packages library(tidyverse) library(palmerpenguins) library(broom) # Load the data data(penguins) glimpse(penguins) ``` ``` Rows: 344 Columns: 8 $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel… $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse… $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1… $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1… $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 18… $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475,… $ sex <fct> male, female, female, NA, female, male, female, male… $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 200… ``` Basic linear regression in R uses the [`lm()`](simple-linear-regression-in-r.html) function: ```r # Simple linear regression: body mass predicted by flipper length model1 <- lm(body_mass_g ~ flipper_length_mm, data = penguins) summary(model1) ``` ``` Call: lm(formula = body_mass_g ~ flipper_length_mm, data = penguins) Residuals: Min 1Q Median 3Q Max -1058.80 -259.27 -26.88 247.33 1288.69 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -5780.83 305.51 -18.93 <2e-16 *** flipper_length_mm 49.69 1.52 32.72 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 394.3 on 340 degrees of freedom (2 observations deleted due to missingness) Multiple R-squared: 0.759, Adjusted R-squared: 0.7583 F-statistic: 1071 on 1 and 340 DF, p-value: < 2.2e-16 ``` ## 4. Full Worked Example Let's predict penguin body mass using flipper length and bill length: ```r # Remove missing values for clean analysis penguins_clean <- penguins %>% filter(!is.na(body_mass_g), !is.na(flipper_length_mm), !is.na(bill_length_mm)) # Multiple linear regression model2 <- lm(body_mass_g ~ flipper_length_mm + bill_length_mm, data = penguins_clean) # Get detailed results summary(model2) ``` ``` Call: lm(formula = body_mass_g ~ flipper_length_mm + bill_length_mm, data = penguins_clean) Residuals: Min 1Q Median 3Q Max -1003.49 -242.37 -23.98 222.05 1287.42 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -6787.89 389.00 -17.46 <2e-16 *** flipper_length_mm 45.99 1.73 26.62 <2e-16 *** bill_length_mm 17.33 4.49 3.86 0.000134 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 376.7 on 339 degrees of freedom Multiple R-squared: 0.7739, Adjusted R-squared: 0.7725 F-statistic: 580.4 on 2 and 339 DF, p-value: < 2.2e-16 ``` **Interpreting the results:** - **Intercept (-6787.89)**: The predicted body mass when both flipper length and bill length are zero (not meaningful in this context) - **Flipper length coefficient (45.99)**: For each 1mm increase in flipper length, body mass increases by about 46 grams, holding bill length constant - **Bill length coefficient (17.33)**: For each 1mm increase in bill length, body mass increases by about 17 grams, holding flipper length constant - **R-squared (0.7739)**: About 77% of the variation in body mass is explained by these two predictors - **p-values**: All coefficients are highly significant (p < 0.001) ```r # Get confidence intervals for coefficients confint(model2) ``` ``` 2.5 % 97.5 %>% (Intercept) -7554.06 -6021.7253 flipper_length_mm 42.60 49.3803 bill_length_mm 8.52 26.1425 ``` ## 5. Visualization ```{r} #| message: false #| warning: false # Load packages for visualization library(tidyverse) library(palmerpenguins) # Prepare clean data penguins_clean <- penguins |> filter(!is.na(body_mass_g), !is.na(flipper_length_mm)) # Scatter plot with regression line ggplot(penguins_clean, aes(x = flipper_length_mm, y = body_mass_g)) + geom_point(aes(color = species), alpha = 0.7) + geom_smooth(method = "lm", se = TRUE, color = "black") + labs( title = "Penguin Body Mass vs Flipper Length", subtitle = "Linear regression line with 95% confidence interval", x = "Flipper Length (mm)", y = "Body Mass (g)", color = "Species" ) + theme_minimal() + theme(legend.position = "bottom") ``` This plot shows the strong positive linear relationship between flipper length and body mass. The gray shaded area represents the 95% confidence interval around the regression line. Different colored points show the three penguin species, revealing that the relationship holds across species, though there are some species-specific patterns. ## 6. Assumptions & Limitations **When NOT to use linear regression:** - **Non-linear relationships**: If the relationship curves significantly, consider polynomial regression or other non-linear methods - **Non-constant variance**: If residuals show patterns (fan shapes, curves), you may need data transformation or robust regression - **Highly correlated predictors**: Multicollinearity makes coefficients unstable and hard to interpret **Checking assumptions:** ```{r} #| fig-height: 8 # Fit model for diagnostics penguins_clean <- penguins |> filter(!is.na(body_mass_g), !is.na(flipper_length_mm), !is.na(bill_length_mm)) model2 <- lm(body_mass_g ~ flipper_length_mm + bill_length_mm, data = penguins_clean) # Diagnostic plots par(mfrow = c(2, 2)) plot(model2) ``` These plots help identify: 1. **Residuals vs Fitted**: Should show random scatter (linearity, homoscedasticity) 2. **Normal Q-Q**: Points should follow diagonal line (normality) 3. **Scale-Location**: Should show random scatter (homoscedasticity) 4. **Residuals vs Leverage**: Identifies influential outliers ## 7. Common Mistakes **1. Assuming causation from correlation**: Linear regression shows association, not causation. Just because flipper length predicts body mass doesn't mean longer flippers cause higher mass. **2. Extrapolating beyond data range**: Don't use the model to predict outcomes for predictor values outside your observed range. The linear relationship may not hold. **3. Ignoring assumption violations**: Always check diagnostic plots. Violating assumptions can lead to biased estimates, incorrect confidence intervals, and poor predictions. ## 8. Related Concepts **What to learn next:** - **Multiple regression**: Using categorical predictors, interaction terms - **Polynomial regression**: Modeling curved relationships - **Regularized regression**: Ridge and Lasso for high-dimensional data - **Logistic regression**: For binary outcomes **Alternative methods:** - **Robust regression**: When you have outliers - **Generalized linear models**: For non-normal outcomes - **Random forests or gradient boosting**: For complex, non-linear relationships - **Mixed-effects models**: When you have grouped or nested data Linear regression serves as the foundation for understanding these more advanced techniques, making it an essential tool in every data scientist's toolkit. ## Related Tutorials - [Linear Regression in R with lm() function - A Practical Tutorial](simple-linear-regression-in-r.html) - [How to Spearman correlation in R](how-to-spearman-correlation-in-r.html) - [How to Extract p-values from multiple simple linear regression models](extract-p-values-from-multiple-simple-linear-regression-models.html) - [How to perform multiple t-tests using tidyverse](how-to-perform-multiple-t-tests-using-tidyverse.html) - [How to ANOVA two-way in R](how-to-anova-two-way-in-r.html)