# Descriptive statistics in R, part III

In my last couple of posts, I discussed some of the ways R can summarize data. I started this discussion by demonstrating how to calculate frequencies and create data tables and then covered some common functions in base R and other packages that provide more detailed descriptive statistics. In today’s post, I will cover four more ways to calculating descriptive statistics and give some of my thoughts on these methods.

In my most recent post, I said I would use the `gapminder`

dataset as a common example for both parts II and III. Turns out, I lied. I am actually going to create some fake data in this post. While this fake dataset may not be as interesting of an example as the `gapminder`

data, it does have the added benefit of containing both discrete and continuous data as well as missing values. Thus, my fake dataset will illustrate how the descriptive methods here handle missing data.

```
# Create some fake data
employee <- c("Mary", "John", "Peter", "Jolie") # create employee variable
salary <- c(21000, NA, 23400, 26800) # create salary variable
department <- c("1", "1", "2", "2") # create department variable
startdate <- as.Date(c("2015-11-1", "2018-1-25", "2007-3-14", "2011-08-10")) # create start date variable
company <- data.frame(employee, salary, department, startdate) # create dataframe containing variables
str(company)
## 'data.frame': 4 obs. of 4 variables:
## $ employee : Factor w/ 4 levels "John","Jolie",..: 3 1 4 2
## $ salary : num 21000 NA 23400 26800
## $ department: Factor w/ 2 levels "1","2": 1 1 2 2
## $ startdate : Date, format: "2015-11-01" "2018-01-25" ...
```

# ggpairs, from the GGally package

The `ggpairs`

function is a unique method of summarizing data. It does not tell you the frequency, central tendency, variation or any other single-number descriptive statistic. It also does not provide any information on missing data. At this point, you may be wondering how the hell this function even ended up in a blog post about descriptive statistics. Well, the approach `ggpairs`

takes to summarizing data is to provide a pairs plot, which shows the interactions of each variable in a dataset with each of the other variables. For example:

```
library(GGally)
## Loading required package: ggplot2
ggpairs(company)
## Warning: Removed 1 rows containing non-finite values (stat_boxplot).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing non-finite values (stat_density).
## Warning: Removed 1 rows containing non-finite values (stat_boxplot).
## Warning in (function (data, mapping, alignPercent = 0.6, method =
## "pearson", : Removing 1 row that contained a missing value
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1 rows containing non-finite values (stat_bin).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1 rows containing missing values (geom_point).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
```

`ggpairs`

is sensitive to whether the data are discrete or continuous, so it is capable of automatically finding the appropriate visualization to display the data. But, this can be tweaked if you want something other than what the defaults provide (see documentation link in section header for details).

The diagonal from the upper left to lower right of the matrix provides univariate information on those variables. Again, because `ggpairs`

is sensitive to data type, it will automatically display histograms for categorical variables and density plots for continuous data. This can be very helpful for spotting unusual distributions (e.g., bimodal distribution) or outliers. However, it does seem to force dates (written in ISO format “2018-01-01”) as continuous data rather than as a separate date type.

To compare descriptive statistics by group, you can pass `ggplot2`

aesthetics through `ggpairs`

. This means you could compare groups by assigning them to different colors (or other aesthetics). In my opinion, this seems to force a lot of information out of an already complex figure. Plus, if you have more than a few groups that you want to summarize by, you could end up with quite the color spectrum in your figure. Instead, I would find another way to summarize my data. Then, after I have an idea of what the data look like, I would use `ggpairs`

as a way of visualizing the dataset (or subset of variables). This function is great for getting an overview of the relationship between any two variables in a dataset quickly and efficiently.

# CreateTableOne, from the tableone package

With the `CreateTableOne`

function, you need to specify the data argument before you can execute the command.

```
library(tableone)
CreateTableOne(data = company)
## Warning in CreateTableOne(data = company): Dropping variable(s) startdate due to unsupported class.
##
## Overall
## n 4
## employee (%)
## John 1 (25.0)
## Jolie 1 (25.0)
## Mary 1 (25.0)
## Peter 1 (25.0)
## salary (mean (sd)) 23733.33 (2914.33)
## department = 2 (%) 2 (50.0)
```

The output is minimal, and it clearly displays counts for each variable. Like `ggpairs`

, `CreateTableOne`

automatically differentiates between discrete and continuous variables. However, the output is very sparse for numeric data (i.e., salary), providing only a mean and SD. Despite this shortcoming, the categorical data are accompanied by the percentage breakdown, which can be very useful.

A warning message is displayed at the top of the output that states the “startdate” variable was dropped, so this function gives a clear indication of which (if any) variables are not displayed. While these messages are very appreciated, the output does not really provide any information on missing data.

However, you can use our old friend `summary`

from part II to get an output that displays more information. To do this, simply wrap the `CreateTableOne`

function in the `summary`

function:

```
summary(CreateTableOne(data = company))
## Warning in CreateTableOne(data = company): Dropping variable(s) startdate due to unsupported class.
##
## ### Summary of continuous variables ###
##
## strata: Overall
## n miss p.miss mean sd median p25 p75 min max skew kurt
## salary 4 1 25 23733 2914 23400 22200 25100 21000 26800 0.5 NaN
##
## =======================================================================================
##
## ### Summary of categorical variables ###
##
## strata: Overall
## var n miss p.miss level freq percent cum.percent
## employee 4 0 0.0 John 1 25.0 25.0
## Jolie 1 25.0 50.0
## Mary 1 25.0 75.0
## Peter 1 25.0 100.0
##
## department 4 0 0.0 1 2 50.0 50.0
## 2 2 50.0 100.0
##
```

This tells us what data are missing as well as a variety of descriptive statistics. Additionally, if you specify the “strata” argument in `CreateTableOne`

, you can summarize by group.

```
CreateTableOne(strata = "department", data = company)
## Warning in CreateTableOne(strata = "department", data = company): Dropping variable(s) startdate due to unsupported class.
## Stratified by department
## 1 2 p test
## n 2 2
## employee (%) 0.261
## John 1 (50.0) 0 ( 0.0)
## Jolie 0 ( 0.0) 1 ( 50.0)
## Mary 1 (50.0) 0 ( 0.0)
## Peter 0 ( 0.0) 1 ( 50.0)
## salary (mean (sd)) 21000.00 (NA) 25100.00 (2404.16) NA
## department = 2 (%) 0 ( 0.0) 2 (100.0) 0.317
```

Not only does the output display descriptive statistics by group, it also automatically calculates p-values from a hypothesis test examining whether the groups differ from one another. By default, these are either ANOVA (for continuous variables) or chi-squared (for categorical variables) analyses. You can also alter the default statistic tests (e.g., use exact tests) should you need to.

# desctable, from the desctable package

`desctable`

provides many of the same functions as `CreateTableOne`

, but it is much more flexible. By default, its output is in tabular format, making it tidy enough to pass onto other functions in the `tidyverse`

. It can handle all data types and automatically displays the appropriate descriptive statistic. For instance:

```
library(desctable)
##
## Attaching package: 'desctable'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test, IQR
desctable(company)
## N % Median IQR
## 1 employee 4 NA NA NA
## 2 employee: John 1 25 NA NA
## 3 employee: Jolie 1 25 NA NA
## 4 employee: Mary 1 25 NA NA
## 5 employee: Peter 1 25 NA NA
## 6 salary 3 NA 23400 2900
## 7 department 4 NA NA NA
## 8 department: 1 2 50 NA NA
## 9 department: 2 2 50 NA NA
## 10 startdate 4 NA 15968 NA
```

The output is unique (at least to me) such that some fields (e.g., department) are hierarchical. This could get confusing with a large dataset, especially if you are not expecting it. There is also no overall summary count of missing data, which can be a major limitation in some cases.

Since the output is tidy, you can easily use the `group_by`

function from the `dplyr`

package to summarize the data by group.

```
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
##
## nasa
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
group_by(company, department) %>%
desctable()
## department: 1 (n=2) / N department: 1 (n=2) / %
## 1 employee 2 NA
## 2 employee: John 1 50
## 3 employee: Jolie 0 0
## 4 employee: Mary 1 50
## 5 employee: Peter 0 0
## 6 salary 1 NA
## 7 startdate 2 NA
## department: 1 (n=2) / Median department: 1 (n=2) / IQR
## 1 NA NA
## 2 NA NA
## 3 NA NA
## 4 NA NA
## 5 NA NA
## 6 21000 0
## 7 17148 NA
## department: 2 (n=2) / N1 department: 2 (n=2) / %1
## 1 2 NA
## 2 0 0
## 3 1 50
## 4 0 0
## 5 1 50
## 6 2 NA
## 7 2 NA
## department: 2 (n=2) / Median1 department: 2 (n=2) / IQR1 tests / p
## 1 NA NA 1.0000000
## 2 NA NA NA
## 3 NA NA NA
## 4 NA NA NA
## 5 NA NA NA
## 6 25100 1700 0.6666667
## 7 14391 NA NaN
## tests / test
## 1 fisher.test
## 2 <NA>
## 3 <NA>
## 4 <NA>
## 5 <NA>
## 6 wilcox.test
## 7 wilcox.test
```

The output is not very intuitive because it is so spread out. This can make it a bit difficult to compare summary statistics across groups (especially if you have more groups than I am using in this example). However, the nice thing about the output is that it includes a statistical test comparing differences across the groups.

The `desctable`

function is also extremely customizable. For example, if I wanted to display more descriptive statistics than the default output provided, I could do something like this:

```
desctable(company, stats = list("N" = length,
"Mean" = mean,
"SD" = sd,
"Median" = median,
"IQR" = IQR,
"Min" = min,
"Max" = max))
## N Mean SD Median IQR Min Max
## 1 employee 4 NA 1.2909944 NA NA NA NA
## 2 employee: John 1 NA NA NA NA NA NA
## 3 employee: Jolie 1 NA NA NA NA NA NA
## 4 employee: Mary 1 NA NA NA NA NA NA
## 5 employee: Peter 1 NA NA NA NA NA NA
## 6 salary 3 23733.33 2914.3323993 23400 2900 21000 26800
## 7 department 4 NA 0.5773503 NA NA NA NA
## 8 department: 1 2 NA 0.0000000 NA NA NA NA
## 9 department: 2 2 NA 0.0000000 NA NA NA NA
## 10 startdate 4 15769.50 1754.0455144 15968 NA 13586 17556
```

You can also customize the statistical tests that are conducted or add conditional formulas. This function is so customizable that you may be able to create a summary output that fits your needs for virtually any situation. See the documentation (link in section header) for more details.

# stat.desc, from the pastecs package

```
library(pastecs)
##
## Attaching package: 'pastecs'
## The following objects are masked from 'package:dplyr':
##
## first, last
stat.desc(company)
## employee salary department startdate
## nbr.val NA 3.000000e+00 NA 4.000000e+00
## nbr.null NA 0.000000e+00 NA 0.000000e+00
## nbr.na NA 1.000000e+00 NA 0.000000e+00
## min NA 2.100000e+04 NA 1.358600e+04
## max NA 2.680000e+04 NA 1.755600e+04
## range NA 5.800000e+03 NA 3.970000e+03
## sum NA 7.120000e+04 NA 6.307800e+04
## median NA 2.340000e+04 NA 1.596800e+04
## mean NA 2.373333e+04 NA 1.576950e+04
## SE.mean NA 1.682591e+03 NA 8.770228e+02
## CI.mean NA 7.239603e+03 NA 2.791078e+03
## var NA 8.493333e+06 NA 3.076676e+06
## std.dev NA 2.914332e+03 NA 1.754046e+03
## coef.var NA 1.227949e-01 NA 1.112303e-01
```

Much like the `ds_summary_stats`

function, a potential limitation of the `stat.desc`

function is that it cannot handle categorical data. As the output shows, categorical data are treated as NA with this function. However, it does display all the major descriptive statistics and missing data information for numeric data.

One of the nice features about `stat.desc`

is that it provides measures of uncertainty (e.g., confidence intervals, standard errors). This is great information to know about the data, and something not often provided in other summary functions. Moreover, with other arguments, you can apply tests of normality to the data.

The output is rather tidy, but you may have to do a little manipulation depending on your needs. There is not group-by comparison function, but you can use the `by`

function (covered in a part II) with it. This yields a list that contains a copy of the output for each group, which can be useful when examining the summary of a particular group but difficult when comparing a summary statistic across groups.

# Summary

And, that is it for this series on descriptive statistics in R! In this post, I covered five additional packages for summarizing data.

`ggpairs`

in the`GGally`

package`CreateTableOne`

in the`tableone`

package`desctable`

in the`desctable`

package`stat.desc`

in the`pastecs`

package

Out of these four methods of summarizing data, I find the `desctable`

function to be one of the more power approaches. It can provide more informative summaries compared to the other functions discussed in this post, and with a few simple tweaks, it is likely be useful in almost any situation that calls for summary statistics. I also like the data visualization approach that `ggpairs`

takes. Descriptive statistics are very useful, and in many circumstances, we need a lot of data to be summarized with a single number. However, these numbers can sometimes be very misleading. `ggpairs`

helps to minimize this illusion by visually presenting the data in a quick and intuitive manner.

That being said, I think a combination of functions from each post is probably the most practical and useful approach for summarizing data. For example, you may want to start with `tabyl`

from the `janitor`

package or `skim`

from the `skimr`

package to get an understanding of your overall dataset. Then, you may want to use `ggpairs`

from the `GGally`

package to visually inspect the distribution of the data. You might want to even take this further by utilizing the strength of tidy data and pipe the output from one function into another (e.g, the output from `skim`

into `dplyr`

into `ggplot2`

). Then again, you may just need some simple, bare-bones summary statistics. In these cases, utilizing `summary`

and `table`

in base R make be sufficient.

Across this three-part series, I have tried to cover as much as possible but while also balancing the detail given to each approach. As you can see, R does indeed make some simple tasks (e.g., summarizing a dataset) complicated. Despite my best efforts, parts I, II, and III only scratch the surface of all the ways R can summarize data. CRAN *literally* has thousands of available packages, and I have only mentioned methods/packages that I am familiar with. Undoubtedly, there are other (and probably more efficient) methods of calculating descriptive statistics with R. I am always open to new ideas, so if anyone has any suggestions, please feel free to comment or contact me!