colSums in R - compute sum of all columns in a dataframe or matrix
In this tutorial, we will learn about colSums() function in base R and use it to calculate sum of all columns in a matrix or a dataframe. We will see two examples to understand the use colSums() function. First, we will calculate sum of all columns in a matrix and dataframe with no missing values (NAs). Next, we will learn how to compute sum of all columns when the matrix/dataframe has missing values.
Create a matrix and dataframe from scratch
Let us create a matrix and dataframe from scratch using random numbers generated using sample() function. First we create a vector of numbers.
set.seed(42)
data <- sample(c(1:6), 50, replace = TRUE)data
## [1] 1 5 1 1 2 4 2 2 1 4 1 5 6 4 2 2 3 1 1 3 4 5 5 5 4 2 4 3 2 1 2 6 3 6 2 4 4 6
## [39] 2 5 4 5 4 2 2 3 1 5 2 2And then we use matrix() function to create a matrix.
data_mat <- matrix(data, ncol=5)Finally, we use as.data.frame() function to create a dataframe.
data_df<- as.data.frame(data_mat)Sum of columns of a matrix
Let us compute the sum of all the columns using colSums() on the matrix. Our data matrix is complete with no missing data.
head(data_mat)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 1 4 2 4
## [2,] 5 5 5 6 5
## [3,] 1 6 5 3 4
## [4,] 1 4 5 6 2
## [5,] 2 2 4 2 2
## [6,] 4 2 2 4 3Applying colSums() on the matrix we get the sum of each column as a vector.
colSums(data_mat)
## [1] 23 28 35 40 30Sum of columns of a dataframe
We can also use colSums() function to calculate sum of all columns in a dataframe. The dataframe should not have any non-numerical columns.
head(data_df)
## V1 V2 V3 V4 V5
## 1 1 1 4 2 4
## 2 5 5 5 6 5
## 3 1 6 5 3 4
## 4 1 4 5 6 2
## 5 2 2 4 2 2
## 6 4 2 2 4 3In our sample datafram all the columns are numerical. We get the sum of all columns in the dataframe.
colSums(data_df)
## V1 V2 V3 V4 V5
## 23 28 35 40 30How to calculate Sum of columns of a matrix with missing data (NAs)
First, let create a matrix and dataframe with missing values.
data <- sample(c(1:5, NA), 50, replace = TRUE)
data_mat <- matrix(data, ncol=5)
data_df<- as.data.frame(data_mat)In this example, the data matrix has missing values (NAs) in all columns except the second column the first and fourth columns.
head(data_mat)
## [,1] [,2] [,3] [,4] [,5]
## [1,] NA 2 4 NA 4
## [2,] NA 5 1 2 2
## [3,] 2 1 3 2 2
## [4,] 4 1 3 1 3
## [5,] 3 4 5 2 5
## [6,] NA 5 5 5 5So when we apply colSums() on the data matrix, it computes the sum on the columns where there is no missing values. For columns containing missing values we get NAs. This because, colSums() function has argument na.rm=FALSE by default.
colSums(data_mat)
## [1] NA 30 NA NA NAWith na.rm=TRUE argument, colSums() function will calculate sum after ignoring the missing values.
colSums(data_mat, na.rm=TRUE)
## [1] 18 30 34 22 28How to calculate Sum of columns of a dataframe with missing data (NAs)
head(data_df)
## V1 V2 V3 V4 V5
## 1 NA 2 4 NA 4
## 2 NA 5 1 2 2
## 3 2 1 3 2 2
## 4 4 1 3 1 3
## 5 3 4 5 2 5
## 6 NA 5 5 5 5When there is missing values, colSums() returns NAs for dataframes as well by default.
colSums(data_df)
## V1 V2 V3 V4 V5
## NA 30 NA NA NAWe can use na.rm =TRUE argument to compute sum of all columns with missing values. And we would get sums ignoring the missing values in the dataframe columns.
colSums(data_df, na.rm=TRUE)
## V1 V2 V3 V4 V5
## 18 30 34 22 28