Summarizing Data

class: center, middle, inverse, title-slide

# Summarizing Data
## DATA 606 - Statistics & Probability for Data Analytics
### Jason Bryer, Ph.D. and Angela Lui, Ph.D.
### November 9, 2026

---
# One Minute Paper Results

.pull-left[
**What was the most important thing you learned during this class?**
<img src="02-Summarizing_Data_files/figure-html/unnamed-chunk-2-1.png" alt="" style="display: block; margin: auto;" />
]
.pull-right[
**What important question remains unanswered for you?**
<img src="02-Summarizing_Data_files/figure-html/unnamed-chunk-3-1.png" alt="" style="display: block; margin: auto;" />
]

---
# Announcements

* There will be no meetup next Monday, February 16th. This does not affect due dates.

* I will be giving a talk on March 10th at NYU for the New York Open Statistical Programming Meetup https://nyhackr.org

---
# Workflow

.center[
<img src='images/data-science-wrangle.png' alt = 'Data Science Workflow' width='1000' />
]

.font80[Source: [Wickham & Grolemund, 2017](https://r4ds.had.co.nz)]

---
# Tidy Data

.center[
<img src='images/tidydata_1.jpg' height='500' />
]

See Wickham (2014) [Tidy data](https://vita.had.co.nz/papers/tidy-data.html).

---
# Types of Data

.pull-left[
* Numerical (quantitative)
	* Continuous
	* Discrete
]
.pull-right[
* Categorical (qualitative)
	* Regular categorical
	* Ordinal
]
.center[
<img src='images/continuous_discrete.png' height='400' />
]

---
# Data Types in R

---
# Data Types / Descriptives / Visualizations

Data Type    |  Descriptive Stats                            | Visualization
-------------|-----------------------------------------------|-------------------|
Continuous   | mean, median, mode, standard deviation, IQR   | histogram, density, box plot
Discrete     | contingency table, proportional table, median | bar plot
Categorical  | contingency table, proportional table         | bar plot
Ordinal      | contingency table, proportional table, median | bar plot
Two quantitative | correlation                               | scatter plot
Two qualitative  | contingency table, chi-squared            | mosaic plot, bar plot
Quantitative & Qualitative | grouped summaries, ANOVA, t-test | box plot

---
# Statistics

.pull-left[
When describing a quantitative variable we are often interested in two things:

1. A measure of center
2. A measure of spread

The most common measures we will use in this class is the mean and median.

$$ \bar{x} = \frac{\Sigma(x_i)}{n} $$

$$ S^2 = \frac{\Sigma (x_i - \bar{x})^2}{N}$$

]
.pull-right[

Note that in the numerator for the variance calculation we square the differences (also known as deviations). Squaring terms is common practice in statistics that serves two purposes:

1. It makes all the values positive.

2. It weighs observations that are further from the center more.

]

---
# Variance

.pull-left[
Population Variance:
$$ S^2 = \frac{\Sigma (x_i - \bar{x})^2}{N}$$
Consider a dataset with five values (black points in the figure). For the largest value, the deviance is represented by the blue line ( `$x_i - \bar{x}$` ).

See also:
https://shiny.rit.albany.edu/stat/visualizess/  
https://github.com/jbryer/VisualStats/

]
.pull-right[

<img src="02-Summarizing_Data_files/figure-html/unnamed-chunk-5-1.png" alt="" style="display: block; margin: auto;" />
]

---
# Variance (cont.)

.pull-left[
Population Variance:
$$ S^2 = \frac{\Sigma (x_i - \bar{x})^2}{N}$$
In the numerator, we square each of these deviances. We can conceptualize this as a square. Here, we add the deviance in the *y* direction.
]
.pull-right[
<img src="02-Summarizing_Data_files/figure-html/unnamed-chunk-6-1.png" alt="" style="display: block; margin: auto;" />
]

---
# Variance (cont.)

.pull-left[
Population Variance:
$$ S^2 = \frac{\Sigma (x_i - \bar{x})^2}{N}$$

We end up with a square.
]
.pull-right[
<img src="02-Summarizing_Data_files/figure-html/unnamed-chunk-7-1.png" alt="" style="display: block; margin: auto;" />
]

---
# Variance (cont.)

.pull-left[
Population Variance:
$$ S^2 = \frac{\Sigma (x_i - \bar{x})^2}{N}$$
We can plot the squared deviance for all the data points. That is, each component in the numerator is the area of each of these squares.
]
.pull-right[
<img src="02-Summarizing_Data_files/figure-html/unnamed-chunk-8-1.png" alt="" style="display: block; margin: auto;" />
]

---
# Variance (cont.)

.pull-left[
Population Variance:
$$ S^2 = \frac{\Sigma (x_i - \bar{x})^2}{N}$$
The variance is therefore the average of the area of all these squares, here represented by the orange square.
]
.pull-right[

<img src="02-Summarizing_Data_files/figure-html/unnamed-chunk-9-1.png" alt="" style="display: block; margin: auto;" />
]

---
# Population versus Sample Variance

.pull-left[
Typically we want the sample variance. The difference is we divide by `$n - 1$` to calculate the sample variance. This results in a slightly larger area (variance) then if we divide by `$n$`.

Population Variance (yellow):
$$ S^2 = \frac{\Sigma (x_i - \bar{x})^2}{N}$$

Sample Variance (green):
$$ s^2 = \frac{\Sigma (x_i - \bar{x})^2}{n-1}$$

]
.pull-right[

]

---
# Robust Statistics

Consider the following data randomly selected from the normal distribution:

.pull-left[

``` r
set.seed(41)
x <- rnorm(30, mean = 100, sd = 15)
mean(x); sd(x)
```

```
## [1] 103.1934
```

```
## [1] 16.8945
```

``` r
median(x); IQR(x)
```

```
## [1] 103.9947
```

```
## [1] 25.68004
```
]
.pull-right[
<img src="02-Summarizing_Data_files/figure-html/unnamed-chunk-12-1.png" alt="" style="display: block; margin: auto;" />
]

---
# Robust Statistics

---
# Robust Statistics

Let's add an extreme value:

``` r
x <- c(x, 1000)
```

---
# Robust Statistics

Median and IQR are more robust to skewness and outliers than mean and SD. Therefore,

* for skewed distributions it is often more helpful to use median and IQR to describe the center and spread

* for symmetric distributions it is often more helpful to use the mean and SD to describe the center and spread

---
class: font80
# About `legosets` <img src="images/hex/brickset.png" class="title-hex">

To install the `brickset` package:

``` r
remotes::install_github('jbryer/brickset')
```

To load the load the `legosets` dataset.

``` r
data('legosets', package = 'brickset')
```

The `legosets` data has 21546 observations of 36 variables.

.code70[

``` r
names(legosets)
```

```
##  [1] "setID"                 "number"                "numberVariant"        
##  [4] "name"                  "year"                  "theme"                
##  [7] "themeGroup"            "subtheme"              "category"             
## [10] "released"              "pieces"                "minifigs"             
## [13] "bricksetURL"           "rating"                "reviewCount"          
## [16] "packagingType"         "availability"          "agerange_min"         
## [19] "thumbnailURL"          "imageURL"              "US_retailPrice"       
## [22] "US_dateFirstAvailable" "US_dateLastAvailable"  "UK_retailPrice"       
## [25] "UK_dateFirstAvailable" "UK_dateLastAvailable"  "CA_retailPrice"       
## [28] "CA_dateFirstAvailable" "CA_dateLastAvailable"  "DE_retailPrice"       
## [31] "DE_dateFirstAvailable" "DE_dateLastAvailable"  "height"               
## [34] "width"                 "depth"                 "weight"
```
]

---
# Structure (`str`) <img src="images/hex/brickset.png" class="title-hex">

.code50[

``` r
str(legosets)
```

```
## 'data.frame':	21546 obs. of  36 variables:
##  $ setID                : int  7693 7695 7697 7698 25534 7418 7419 6020 22704 7421 ...
##  $ number               : chr  "1" "2" "3" "4" ...
##  $ numberVariant        : int  8 8 6 4 6 1 1 1 3 4 ...
##  $ name                 : chr  "Small house set" "Medium house set" "Medium house set" "Large house set" ...
##  $ year                 : int  1970 1970 1970 1970 1970 1970 1970 1970 1970 1970 ...
##  $ theme                : chr  "Minitalia" "Minitalia" "Minitalia" "Minitalia" ...
##  $ themeGroup           : chr  "Vintage" "Vintage" "Vintage" "Vintage" ...
##  $ subtheme             : chr  NA NA NA NA ...
##  $ category             : chr  "Normal" "Normal" "Normal" "Normal" ...
##  $ released             : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
##  $ pieces               : int  67 109 158 233 NA 1 1 60 65 NA ...
##  $ minifigs             : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ bricksetURL          : chr  "https://brickset.com/sets/1-8" "https://brickset.com/sets/2-8" "https://brickset.com/sets/3-6" "https://brickset.com/sets/4-4" ...
##  $ rating               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ reviewCount          : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ packagingType        : chr  "{Not specified}" "{Not specified}" "{Not specified}" "{Not specified}" ...
##  $ availability         : chr  "{Not specified}" "{Not specified}" "{Not specified}" "{Not specified}" ...
##  $ agerange_min         : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ thumbnailURL         : chr  "https://images.brickset.com/sets/small/1-8.jpg" "https://images.brickset.com/sets/small/2-8.jpg" "https://images.brickset.com/sets/small/3-6.jpg" "https://images.brickset.com/sets/small/4-4.jpg" ...
##  $ imageURL             : chr  "https://images.brickset.com/sets/images/1-8.jpg" "https://images.brickset.com/sets/images/2-8.jpg" "https://images.brickset.com/sets/images/3-6.jpg" "https://images.brickset.com/sets/images/4-4.jpg" ...
##  $ US_retailPrice       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ US_dateFirstAvailable: Date, format: NA NA ...
##  $ US_dateLastAvailable : Date, format: NA NA ...
##  $ UK_retailPrice       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ UK_dateFirstAvailable: Date, format: NA NA ...
##  $ UK_dateLastAvailable : Date, format: NA NA ...
##  $ CA_retailPrice       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ CA_dateFirstAvailable: Date, format: NA NA ...
##  $ CA_dateLastAvailable : Date, format: NA NA ...
##  $ DE_retailPrice       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ DE_dateFirstAvailable: Date, format: NA NA ...
##  $ DE_dateLastAvailable : Date, format: NA NA ...
##  $ height               : num  NA NA NA NA NA ...
##  $ width                : num  NA NA NA NA NA ...
##  $ depth                : num  NA NA NA NA NA NA NA NA 5.08 NA ...
##  $ weight               : num  NA NA NA NA NA NA NA NA NA NA ...
```

]

---
# RStudio Eenvironment tab can help <img src="images/hex/rstudio.png" class="title-hex">

---
class: hide-logo
# Table View

.font60[

<div class="datatables html-widget html-fill-item" id="htmlwidget-d12d736726445e158ced" style="width:100%;height:auto;"></div>
<script type="application/json" data-for="htmlwidget-d12d736726445e158ced">{"x":{"filter":"none","vertical":false,"fillContainer":false,"data":[["1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","23","24","25","26","27","28","29","30","31","32","33","34","35","36","37","38","39","40","41","42","43","44","45","46","47","48","49","50","51","52","53","54","55","56","57","58","59","60","61","62","63","64","65","66","67","68","69","70","71","72","73","74","75","76","77","78","79","80","81","82","83","84","85","86","87","88","89","90","91","92","93","94","95","96","97","98","99","100"],[9788,28901,24155,1324,4184,28548,31973,4118,51241,31879,29455,7548,24920,2955,48695,29210,23211,95,23212,50337,26085,32216,648,2770,30267,26950,31620,5473,7594,4068,6747,3951,1242,2338,4511,7231,7129,1683,50214,31940,50837,24902,26019,24821,48729,1002,24396,5325,838,51080,1855,30379,32249,1213,5807,9831,31487,27883,30238,29882,3998,5262,3340,6944,50418,24338,8962,30965,34497,970,6449,9522,4555,498,31814,28351,5914,7411,34051,866,8032,31564,27767,7949,2986,26103,31183,22563,29492,26078,28920,34442,2342,27200,31736,2041,4429,3248,34519,23678],["Moose","LEGO Minifigures - The LEGO Movie 2 Series {Random bag}","Temple of Airjitzu","Marie in Rainbow Skirt","Fire Engine","Kai Minifigure Alarm Clock","LEGO Meet the Minifigures","Universal Set with Flex System","Mario Kart - Mario & Standard Kart","Luigi Key Chain","Ocean Exploration Ship","Zoo Super Pack","Misfortune's Keep","Cat Show","Blocks magazine issue 110","Large Hub","Curve Cruiser","TIE Fighter Collection","Circuit Star","Lunch Bag – Space Cadet","Anchor-Jet","The City of Lanterns","Four Set Value Pack","Shape and Colour Sorter","Andrea's Jungle Play Cube","Race Car","Karaoke Mermaid","Shield of the Vikings","Undercover Cruiser","CLIKITS Fashion Design Kit B","Propeller Power","Little Robots Messy Plush Mini","Ice Planet Scooter","Ogel Mutant Ray","Super Car","Clone Walker","Click","Theatre Stories","Darth Vader, Princess Leia, Yoda","Darth Vader Bag Tag","Monster Jam DIGatron Pull-Back","Tiketz","Airport Cargo Plane","Winter Fun","Republic Fighter Tank","Farm","Drinking Bottle","The Batmobile: Ultimate Collectors' Edition","Extra Bricks (S)","Captain America vs. Thanos","Drome Racer Key Chain","Go Team! Bracelet","Extra DOTS - Series 6","Bouncing with Tigger","3+ Building Set","Spider-Man Key Chain","Venom","Emily & Noctura's Showdown","Joker's Trike Chase","8 Stud Blue Desk Drawer","Freight Steam Train Set","PreSchool Set","Rocket Racer","Super Glider","The Joker and Cash Machine","Luke Skywalker Key Chain","Naboo Starfighter & Naboo","Monkie Kid's Lion Guardian","Brick Shelf Set - Black","Tow Truck","MY LEGO Duplo Town","Jungle Boy","RCX Programmable LEGO Brick","Trial Size Box","SPIKE Essential Set","Make Your Own Movie ","City Police Megaphone","Building Set ","Steve with Drowned Zombie","Building Set","Mindstorms NXT IR Receiver","Spider-Man & Doctor Octopus Mech Battle","LEGO Dino","Piano","Battle Dragon","Elves Azari the Fire Elf Key Chain","BrickJournal Compendium 3","Building Bricks with Figure","Nissan GT-R NISMO","My First Set","Apocalypseburg Abe","Creative Party Box","AT Navigator and ROV","Harley Quinn Luggage Tag","Swoop","Basic Building Set, 3+","Arctic Rescue Unit","Airport","Character Pack Series 6 - Sealed Box","Dolphin Key Chain"],[2012,2019,2015,1998,1995,2018,2022,1991,2025,2021,2020,2009,2016,2004,2023,2020,2009,2004,2009,2024,2016,2022,1992,2002,2020,2017,2021,2006,2010,2004,2009,2003,1998,2002,1994,2009,2009,2002,2024,2021,2025,2016,2016,2015,2023,1980,2015,2006,1997,2025,2002,2020,2022,2001,1982,2012,2021,2018,2020,2020,1983,1975,1996,2002,2024,2015,2012,2021,2023,1989,2008,2012,1999,1992,2021,2018,2007,1970,2022,1991,2010,2021,2017,1973,1990,2016,2009,1985,2020,2016,2019,2023,2002,2017,2021,1998,1986,1985,2023,2014],["Promotional","Collectable Minifigures","Ninjago","Scala","Technic","Gear","Books","Technic","Super Mario","Gear","City","Duplo","Ninjago","Belville","Books","Education","Gear","Star Wars","Gear","Gear","Ninjago","Monkie Kid","Assorted","Explore","Friends","Promotional","Vidiyo","Gear","Space","Clikits","Creator","Gear","Space","Alpha Team","Technic","Star Wars","Bionicle","Explore","Star Wars","Gear","Technic","Mixels","City","Seasonal","Star Wars","Duplo","Gear","Batman","Duplo","Marvel Super Heroes","Gear","Dots","Dots","Duplo","Basic","Gear","Marvel Super Heroes","Elves","DC Comics Super Heroes","Gear","Trains","PreSchool","Time Cruisers","Jack Stone","DC Comics Super Heroes","Gear","Star Wars","Monkie Kid","Gear","Duplo","Duplo","Collectable Minifigures","Mindstorms","Basic","Education","Books","Gear","Duplo","Minecraft","Duplo","Mindstorms","Marvel Super Heroes","Books","Homemaker","Castle","Gear","Books","Duplo","Speed Champions","Duplo","Collectable Minifigures","Classic","Alpha Team","Gear","Super Mario","Basic","Technic","Town","Super Mario","Gear"],["Miscellaneous","Miscellaneous","Action/Adventure","Girls","Technical","Miscellaneous","Miscellaneous","Technical","Licensed","Miscellaneous","Modern day","Pre-school","Action/Adventure","Girls","Miscellaneous","Educational","Miscellaneous","Licensed","Miscellaneous","Miscellaneous","Action/Adventure","Action/Adventure","Miscellaneous","Pre-school","Modern day","Miscellaneous","Licensed","Miscellaneous","Action/Adventure","Girls","Model making","Miscellaneous","Action/Adventure","Action/Adventure","Technical","Licensed","Constraction","Pre-school","Licensed","Miscellaneous","Technical","Miscellaneous","Modern day","Miscellaneous","Licensed","Pre-school","Miscellaneous","Licensed","Pre-school","Licensed","Miscellaneous","Art and crafts","Art and crafts","Pre-school","Basic","Miscellaneous","Licensed","Action/Adventure","Licensed","Miscellaneous","Modern day","Pre-school","Action/Adventure","Junior","Licensed","Miscellaneous","Licensed","Action/Adventure","Miscellaneous","Pre-school","Pre-school","Miscellaneous","Technical","Basic","Educational","Miscellaneous","Miscellaneous","Pre-school","Licensed","Pre-school","Technical","Licensed","Miscellaneous","Vintage","Historical","Miscellaneous","Miscellaneous","Pre-school","Licensed","Pre-school","Miscellaneous","Basic","Action/Adventure","Miscellaneous","Licensed","Basic","Technical","Modern day","Licensed","Miscellaneous"],["Normal","Random","Normal","Normal","Normal","Gear","Book","Normal","Normal","Gear","Normal","Collection","Normal","Normal","Book","Normal","Gear","Normal","Gear","Gear","Normal","Normal","Collection","Normal","Normal","Normal","Normal","Gear","Normal","Normal","Normal","Gear","Normal","Normal","Normal","Normal","Normal","Normal","Collection","Gear","Normal","Normal","Normal","Normal","Other","Normal","Gear","Normal","Normal","Normal","Gear","Normal","Normal","Normal","Normal","Gear","Other","Normal","Normal","Gear","Normal","Normal","Normal","Normal","Other","Gear","Normal","Normal","Gear","Normal","Normal","Normal","Normal","Normal","Normal","Book","Gear","Normal","Other","Normal","Extended","Normal","Book","Normal","Normal","Gear","Book","Normal","Normal","Normal","Normal","Normal","Normal","Gear","Normal","Normal","Normal","Normal","Collection","Gear"],[null,null,199.99,null,null,null,null,null,169.99,null,149.99,null,79.98999999999999,null,null,null,null,null,null,null,null,159.99,null,null,9.99,null,null,null,29.99,null,19.99,null,null,null,null,null,null,null,null,9.99,27.99,4.99,24.99,9.99,null,null,null,null,null,39.99,null,4.99,3.99,null,null,4.99,null,49.99,49.99,24.99,null,null,null,null,null,4.99,9.99,79.98999999999999,35.99,null,29.99,2.99,null,null,329.95,null,11.99,null,null,null,49.99,19.99,null,null,null,4.99,null,null,19.99,null,null,49.99,null,null,null,null,null,null,5.99,4.99],[27,null,2028,10,424,null,8,313,1972,null,745,95,754,31,null,1,null,682,null,null,38,2187,160,9,47,52,12,null,317,4,247,null,19,66,1343,31,33,45,69,null,218,62,157,107,44,60,null,1045,50,107,null,33,118,11,82,null,11,650,440,null,570,20,58,7,25,null,56,774,null,3,79,7,1,31,442,null,null,20,25,38,1,305,null,131,49,null,null,10,298,10,7,900,91,null,30,489,385,530,null,null],[null,null,13,1,2,null,null,null,null,null,8,null,6,null,null,null,null,4,null,null,1,9,5,null,1,null,1,null,2,null,null,null,1,2,null,1,null,5,null,null,null,null,2,2,null,5,null,null,null,2,null,null,null,1,null,null,1,4,4,null,2,null,1,1,1,null,1,5,null,1,1,1,null,null,4,null,null,null,2,null,null,2,null,null,2,null,null,null,1,null,1,null,1,null,1,null,2,8,null,null],[0,3.8,4.6,0,3.6,0,0,3.7,4.4,0,4,0,3.9,0,0,0,0,4,0,0,0,4.4,0,0,0,3.1,0,0,4.1,0,3.8,0,0,3,4.5,3.8,3.8,0,0,0,0,3.2,3.8,3.6,3.4,0,0,4.4,0,0,0,0,3.7,0,0,0,3.5,4.1,3.7,0,0,0,3.2,0,0,0,3.4,3.6,0,0,0,3.3,0,0,0,0,0,0,0,0,0,3.3,0,0,3.7,0,0,0,4.1,0,4,0,4,0,0,0,4.2,4.5,0,0]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th> <\/th>\n      <th>setID<\/th>\n      <th>name<\/th>\n      <th>year<\/th>\n      <th>theme<\/th>\n      <th>themeGroup<\/th>\n      <th>category<\/th>\n      <th>US_retailPrice<\/th>\n      <th>pieces<\/th>\n      <th>minifigs<\/th>\n      <th>rating<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"pageLength":10,"columnDefs":[{"className":"dt-right","targets":[1,3,7,8,9,10]},{"orderable":false,"targets":0},{"name":" ","targets":0},{"name":"setID","targets":1},{"name":"name","targets":2},{"name":"year","targets":3},{"name":"theme","targets":4},{"name":"themeGroup","targets":5},{"name":"category","targets":6},{"name":"US_retailPrice","targets":7},{"name":"pieces","targets":8},{"name":"minifigs","targets":9},{"name":"rating","targets":10}],"order":[],"autoWidth":false,"orderClasses":false},"selection":{"mode":"multiple","selected":null,"target":"row","selectable":null}},"evals":[],"jsHooks":[]}</script>

]

---
# Data Wrangling Cheat Sheet <img src="images/hex/dplyr.png" class="title-hex">

.center[
<a href='https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf' target='_new'><img src='images/data-transformation.png' width='700' /></a>
]

---
# Tidyverse vs Base R <img src="images/hex/tidyverse.png" class="title-hex"><img src="images/hex/pipe.png" class="title-hex">

.center[
<a href='images/R_Syntax_Comparison.jpeg' target='_new'><img src="images/R_Syntax_Comparison.jpeg" width='700' /></a>
]

---
class: font90
# Pipes `%>%` and `|>` <img src="images/hex/magrittr.png" class="title-hex">

.font90[
The pipe operator (`%>%`) introduced with the `magrittr` R package allows for the chaining of R operations. As of version 4.1, R now has a native pipe operator (`|>`). They take the output from the left-hand side and passes it as the first parameter to the function on the right-hand side.
]

.pull-left[
You can do this in two steps:

``` r
tab_out <- table(legosets$category)
prop.table(tab_out)
```

Or as nested function calls.

``` r
prop.table(table(legosets$category))
```
]
.pull-right[
Using the pipe (`|>`) operator we can chain these calls in a what is arguably a more readable format:

``` r
table(legosets$category) |> prop.table()
```
]

<hr />

```
## 
##        Book  Collection    Extended        Gear      Normal       Other 
## 0.035087719 0.029889539 0.036248027 0.158869396 0.668894458 0.067205050 
##      Random 
## 0.003805811
```

---
# Filter <img src="images/hex/tidyverse.png" class="title-hex"><img src="images/hex/dplyr.png" class="title-hex">

.center[
<img src='images/dplyr_filter_sm.png' width='800' />
]

---
# Logical Operators

* `!a` - TRUE if a is FALSE
* `a == b` - TRUE if a and be are equal
* `a != b` - TRUE if a and b are not equal
* `a > b` - TRUE if a is larger than b, but not equal
* `a >= b` - TRUE if a is larger or equal to b
* `a < b` - TRUE if a is smaller than be, but not equal
* `a <= b` - TRUE if a is smaller or equal to b
* `a %in% b` - TRUE if a is in b where b is a vector

``` r
which( letters %in% c('a','e','i','o','u') )
```

```
## [1]  1  5  9 15 21
```
* `a | b` - TRUE if a *or* b are TRUE
* `a & b` - TRUE if a *and* b are TRUE
* `isTRUE(a)` - TRUE if a is TRUE

---
# Filter <img src="images/hex/tidyverse.png" class="title-hex"><img src="images/hex/dplyr.png" class="title-hex">

### `dplyr`

``` r
mylego <- legosets %>% filter(themeGroup == 'Educational' & year > 2015)
```

### Base R

``` r
mylego <- legosets[legosets$themeGroup == 'Educational' & legosets$year > 2015,]
```

<hr />

``` r
nrow(mylego)
```

```
## [1] 121
```

---
# Select <img src="images/hex/tidyverse.png" class="title-hex"><img src="images/hex/dplyr.png" class="title-hex">

### `dplyr`

``` r
mylego <- mylego %>% select(setID, pieces, theme, availability, US_retailPrice, minifigs)
```

### Base R

``` r
mylego <- mylego[,c('setID', 'pieces', 'theme', 'availability', 'US_retailPrice', 'minifigs')]
```

<hr />

``` r
head(mylego, n = 4)
```

```
##   setID pieces     theme    availability US_retailPrice minifigs
## 1 26803    109 Education {Not specified}             NA        6
## 2 26277    188 Education     Educational          94.95       NA
## 3 27742    160 Education {Not specified}             NA       NA
## 4 26805   1000 Education {Not specified}             NA       NA
```

---
# Relocate <img src="images/hex/tidyverse.png" class="title-hex"><img src="images/hex/dplyr.png" class="title-hex">

.center[
<img src='images/dplyr_relocate.png' width='800' />
]

---
# Relocate <img src="images/hex/tidyverse.png" class="title-hex"><img src="images/hex/dplyr.png" class="title-hex">

### `dplyr`

``` r
mylego %>% relocate(where(is.numeric), .after = where(is.character)) %>% head(n = 3)
```

```
##       theme    availability setID pieces US_retailPrice minifigs
## 1 Education {Not specified} 26803    109             NA        6
## 2 Education     Educational 26277    188          94.95       NA
## 3 Education {Not specified} 27742    160             NA       NA
```

### Base R

``` r
mylego2 <- mylego[,c('theme', 'availability', 'setID', 'pieces', 'US_retailPrice', 'minifigs')]
head(mylego2, n = 3)
```

---
# Rename <img src="images/hex/tidyverse.png" class="title-hex"><img src="images/hex/dplyr.png" class="title-hex">

.center[
<img src='images/rename_sm.jpg' width='1000' />
]

---
# Rename <img src="images/hex/tidyverse.png" class="title-hex"><img src="images/hex/dplyr.png" class="title-hex">

### `dplyr`

``` r
mylego %>% dplyr::rename(USD = US_retailPrice) %>% head(n = 3)
```

```
##   setID pieces     theme    availability   USD minifigs
## 1 26803    109 Education {Not specified}    NA        6
## 2 26277    188 Education     Educational 94.95       NA
## 3 27742    160 Education {Not specified}    NA       NA
```

### Base R

``` r
names(mylego2)[5] <- 'USD'
head(mylego2, n = 3)
```

```
##       theme    availability setID pieces   USD minifigs
## 1 Education {Not specified} 26803    109    NA        6
## 2 Education     Educational 26277    188 94.95       NA
## 3 Education {Not specified} 27742    160    NA       NA
```

---
# Mutate <img src="images/hex/tidyverse.png" class="title-hex"><img src="images/hex/dplyr.png" class="title-hex">

.center[
<img src='images/dplyr_mutate.png' width='700' />
]

---
# Mutate <img src="images/hex/tidyverse.png" class="title-hex"><img src="images/hex/dplyr.png" class="title-hex">

### `dplyr`

``` r
mylego %>% filter(!is.na(pieces) & !is.na(US_retailPrice)) %>% 
	mutate(Price_per_piece = US_retailPrice / pieces) %>% head(n = 3)
```

```
##   setID pieces     theme availability US_retailPrice minifigs Price_per_piece
## 1 26277    188 Education  Educational          94.95       NA       0.5050532
## 2 25949    280 Education  Educational         224.95       NA       0.8033929
## 3 25954      1 Education  Educational          14.95       NA      14.9500000
```

### Base R

``` r
mylego2 <- mylego[!is.na(mylego$US_retailPrice) & !is.na(mylego$Price_per_piece),]
mylego2$Price_per_piece <- mylego2$Price_per_piece / mylego2$US_retailPrice
head(mylego2, n = 3)
```

```
## [1] setID           pieces          theme           availability   
## [5] US_retailPrice  minifigs        Price_per_piece
## <0 rows> (or 0-length row.names)
```

---
# Group By and Summarize <img src="images/hex/tidyverse.png" class="title-hex"><img src="images/hex/dplyr.png" class="title-hex">

.code80[

``` r
legosets %>% group_by(themeGroup) %>% summarize(mean_price = mean(US_retailPrice, na.rm = TRUE),
												sd_price = sd(US_retailPrice, na.rm = TRUE),
												median_price = median(US_retailPrice, na.rm = TRUE),
												n = n(),
												missing = sum(is.na(US_retailPrice)))
```

```
## # A tibble: 17 × 6
##    themeGroup       mean_price sd_price median_price     n missing
##    <chr>                 <dbl>    <dbl>        <dbl> <int>   <int>
##  1 Action/Adventure       43.0     41.9         30.0  1620     845
##  2 Art and crafts         41.0     53.0         20.0   104       9
##  3 Basic                  22.8     19.4         15.0   884     733
##  4 Constraction           16.4     12.4         13.0   503     285
##  5 Educational           185.     188.         138.    546     508
##  6 Girls                  35.8     24.0         23.0   240     227
##  7 Historical             34.2     32.4         20.0   474     401
##  8 Junior                 22.0     10.1         20.0   228     165
##  9 Licensed               55.5     72.9         35.0  3353    1245
## 10 Miscellaneous          23.4     33.5         15.0  7151    4501
## 11 Model making           86.2     99.4         50.0   919     422
## 12 Modern day             39.8     36.8         30.0  2669    1594
## 13 Pre-school             31.7     23.2         25.0  1613    1108
## 14 Racing                 26.4     26.8         15.0   266     176
## 15 Technical              86.3     96.5         50.0   667     344
## 16 Vintage               NaN       NA           NA     307     307
## 17 <NA>                  NaN       NA           NA       2       2
```
]

---
# Describe and Describe By

``` r
library(psych)
describe(legosets$US_retailPrice)
```

```
##    vars    n  mean    sd median trimmed   mad  min    max range skew kurtosis
## X1    1 8674 42.26 59.75  24.99   30.25 22.24 1.49 999.99 998.5 4.97    40.38
##      se
## X1 0.64
```

``` r
describeBy(legosets$US_retailPrice, group = legosets$availability, mat = TRUE, skew = FALSE)
```

```
##      item                         group1 vars    n      mean        sd median   min    max range         se
## X11     1                {Not specified}    1 2004  27.76015  38.92310  19.99  1.49 789.99 788.5  0.8694780
## X12     2                    Educational    1   12 217.03333 108.17617 232.45 14.95 399.95 385.0 31.2277699
## X13     3 Gift with Purchase at LEGO.com    1    0       NaN        NA     NA   Inf   -Inf  -Inf         NA
## X14     4                Insiders Reward    1    0       NaN        NA     NA   Inf   -Inf  -Inf         NA
## X15     5                 LEGO exclusive    1 1268  65.12353 112.88247  14.99  1.99 999.99 998.0  3.1700555
## X16     6             LEGOLAND exclusive    1    2   4.99000   0.00000   4.99  4.99   4.99   0.0  0.0000000
## X17     7                       Not sold    1    1  12.99000        NA  12.99 12.99  12.99   0.0         NA
## X18     8                    Promotional    1    5   4.79000   0.83666   4.99  3.99   5.99   2.0  0.3741657
## X19     9          Promotional (Airline)    1    0       NaN        NA     NA   Inf   -Inf  -Inf         NA
## X110   10                         Retail    1 5062  40.59401  40.96294  29.99  1.99 699.99 698.0  0.5757448
## X111   11               Retail - limited    1  319  63.40755  69.70365  39.99  2.49 449.99 447.5  3.9026552
## X112   12                        Unknown    1    1   3.99000        NA   3.99  3.99   3.99   0.0         NA
```

---
class: middle
# Grammer of Graphics

.center[
<img src="images/ggplot2_masterpiece.png" height="550" />
]

---
# Data Visualizations with ggplot2 <img src="images/hex/ggplot2.png" class="title-hex">

* `ggplot2` is an R package that provides an alternative framework based upon Wilkinson’s (2005) Grammar of Graphics.

* `ggplot2` is, in general, more flexible for creating "prettier" and complex plots.

* Works by creating layers of different types of objects/geometries (i.e. bars, points, lines, polygons, etc.)
`ggplot2` has at least three ways of creating plots:
     1. `qplot`
     2. `ggplot(...) + geom_XXX(...) + ...`
     3. `ggplot(...) + layer(...)`

* We will focus only on the second.

---
# Parts of a `ggplot2` Statement <img src="images/hex/ggplot2.png" class="title-hex">

* Data  
`ggplot(myDataFrame, aes(x=x, y=y))`

* Layers  
`geom_point()`, `geom_histogram()`

* Facets  
`facet_wrap(~ cut)`, `facet_grid(~ cut)`

* Scales  
`scale_y_log10()`

* Other options  
`ggtitle('my title')`, `ylim(c(0, 10000))`, `xlab('x-axis label')`

---
# Lots of geoms <img src="images/hex/ggplot2.png" class="title-hex">

``` r
ls('package:ggplot2')[grep('^geom_', ls('package:ggplot2'))]
```

```
##  [1] "geom_abline"            "geom_area"              "geom_bar"               "geom_bin_2d"           
##  [5] "geom_bin2d"             "geom_blank"             "geom_boxplot"           "geom_col"              
##  [9] "geom_contour"           "geom_contour_filled"    "geom_count"             "geom_crossbar"         
## [13] "geom_curve"             "geom_density"           "geom_density_2d"        "geom_density_2d_filled"
## [17] "geom_density2d"         "geom_density2d_filled"  "geom_dotplot"           "geom_errorbar"         
## [21] "geom_errorbarh"         "geom_freqpoly"          "geom_function"          "geom_hex"              
## [25] "geom_histogram"         "geom_hline"             "geom_jitter"            "geom_label"            
## [29] "geom_line"              "geom_linerange"         "geom_map"               "geom_path"             
## [33] "geom_point"             "geom_pointrange"        "geom_polygon"           "geom_qq"               
## [37] "geom_qq_line"           "geom_quantile"          "geom_raster"            "geom_rect"             
## [41] "geom_ribbon"            "geom_rug"               "geom_segment"           "geom_sf"               
## [45] "geom_sf_label"          "geom_sf_text"           "geom_smooth"            "geom_spoke"            
## [49] "geom_step"              "geom_text"              "geom_tile"              "geom_violin"           
## [53] "geom_vline"
```

---
# Data Visualization Cheat Sheet <img src="images/hex/ggplot2.png" class="title-hex">

.center[
<a href='https://github.com/rstudio/cheatsheets/raw/master/data-visualization-2.1.pdf'><img src='images/data-visualization-2.1.png' width='700' /></a>
]

---
# Scatterplot  <img src="images/hex/ggplot2.png" class="title-hex">

``` r
ggplot(legosets, aes(x=pieces, y=US_retailPrice)) + geom_point()
```

---
# Scatterplot (cont.)  <img src="images/hex/ggplot2.png" class="title-hex">

``` r
ggplot(legosets, aes(x=pieces, y=US_retailPrice, color=availability)) + geom_point()
```

---
# Scatterplot (cont.)  <img src="images/hex/ggplot2.png" class="title-hex">

``` r
ggplot(legosets, aes(x=pieces, y=US_retailPrice, size=minifigs, color=availability)) + geom_point()
```

---
# Scatterplot (cont.)  <img src="images/hex/ggplot2.png" class="title-hex">

``` r
ggplot(legosets, aes(x=pieces, y=US_retailPrice, size=minifigs)) + geom_point() + facet_wrap(~ availability)
```

---
# Boxplots  <img src="images/hex/ggplot2.png" class="title-hex">

``` r
ggplot(legosets, aes(x='Lego', y=US_retailPrice)) + geom_boxplot()
```

---
# Boxplots (cont.)  <img src="images/hex/ggplot2.png" class="title-hex">

``` r
ggplot(legosets, aes(x=availability, y=US_retailPrice)) + geom_boxplot()
```

---
# Boxplot (cont.)  <img src="images/hex/ggplot2.png" class="title-hex">

``` r
ggplot(legosets, aes(x=availability, y=US_retailPrice)) + geom_boxplot() + coord_flip()
```

---
# Histograms <img src="images/hex/ggplot2.png" class="title-hex">

``` r
ggplot(legosets, aes(x = US_retailPrice)) + geom_histogram(binwidth = 25)
```

---
# Histograms (cont.)<img src="images/hex/ggplot2.png" class="title-hex">

``` r
ggplot(legosets, aes(x = US_retailPrice)) + geom_histogram(bins = 15) + scale_x_log10()
```

---
# Histograms (cont.) <img src="images/hex/ggplot2.png" class="title-hex">

``` r
ggplot(legosets, aes(x = US_retailPrice)) + geom_histogram(binwidth = 25) + facet_wrap(~ availability)
```

---
# Density Plots <img src="images/hex/ggplot2.png" class="title-hex">

``` r
ggplot(legosets, aes(x = US_retailPrice, color = availability)) + geom_density()
```

---
# Density Plots (cont.) <img src="images/hex/ggplot2.png" class="title-hex">

``` r
ggplot(legosets, aes(x = US_retailPrice, color = availability)) + geom_density() + scale_x_log10()
```

---
# `ggplot2` aesthetics <img src="images/hex/ggplot2.png" class="title-hex">

.center[
<a href='images/ggplot_aesthetics_cheatsheet.png' target='_new'> <img src='images/ggplot_aesthetics_cheatsheet.png' height='550' /></a>
]

---
# Likert Scales <img src="images/hex/likert.png" class="title-hex">

Likert scales are a type of questionnaire where respondents are asked to rate items on scales usually ranging from four to seven levels (e.g. strongly disagree to strongly agree).

``` r
library(likert)
library(reshape)
data(pisaitems)
items24 <- pisaitems[,substr(names(pisaitems), 1,5) == 'ST24Q']
items24 <- rename(items24, c(
			ST24Q01="I read only if I have to.",
			ST24Q02="Reading is one of my favorite hobbies.",
			ST24Q03="I like talking about books with other people.",
			ST24Q04="I find it hard to finish books.",
			ST24Q05="I feel happy if I receive a book as a present.",
			ST24Q06="For me, reading is a waste of time.",
			ST24Q07="I enjoy going to a bookstore or a library.",
			ST24Q08="I read only to get information that I need.",
			ST24Q09="I cannot sit still and read for more than a few minutes.",
			ST24Q10="I like to express my opinions about books I have read.",
			ST24Q11="I like to exchange books with my friends."))
```

---
# `likert` R Package <img src="images/hex/likert.png" class="title-hex">

``` r
l24 <- likert(items24)
summary(l24)
```

```
##                                                        Item      low neutral     high     mean        sd
## 10   I like to express my opinions about books I have read. 41.07516       0 58.92484 2.604913 0.9009968
## 5            I feel happy if I receive a book as a present. 46.93475       0 53.06525 2.466751 0.9446590
## 8               I read only to get information that I need. 50.39874       0 49.60126 2.484616 0.9089688
## 7                I enjoy going to a bookstore or a library. 51.21231       0 48.78769 2.428508 0.9164136
## 3             I like talking about books with other people. 54.99129       0 45.00871 2.328049 0.9090326
## 11                I like to exchange books with my friends. 55.54115       0 44.45885 2.343193 0.9609234
## 2                    Reading is one of my favorite hobbies. 56.64470       0 43.35530 2.344530 0.9277495
## 1                                 I read only if I have to. 58.72868       0 41.27132 2.291811 0.9369023
## 4                           I find it hard to finish books. 65.35125       0 34.64875 2.178299 0.8991628
## 9  I cannot sit still and read for more than a few minutes. 76.24524       0 23.75476 1.974736 0.8793028
## 6                       For me, reading is a waste of time. 82.88729       0 17.11271 1.810093 0.8611554
```

---
# `likert` Plots  <img src="images/hex/likert.png" class="title-hex">

``` r
plot(l24)
```

---
# Pie Charts

There is only one pie chart in *OpenIntro Statistics* (Diez, Barr, & Çetinkaya-Rundel, 2015, p. 48). Consider the following three pie charts that represent the preference of five different colors. Is there a difference between the three pie charts? This is probably a difficult to answer.

---
# Pie Charts

Source: [https://en.wikipedia.org/wiki/Pie_chart](https://en.wikipedia.org/wiki/Pie_chart).

---
class: middle
# Just say NO to pie charts!

.font150[
"There is no data that can be displayed in a pie chart that cannot better be displayed in some other type of chart"]
.right[.font130[John Tukey]]

---
# Additional Resources

For data wrangling:

* `dplyr` website: https://dplyr.tidyverse.org
* R for Data Science book: https://r4ds.had.co.nz/wrangle-intro.html
* Wrangling penguins tutorial: https://allisonhorst.shinyapps.io/dplyr-learnr/#section-welcome
* Data transformation cheat sheet: https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf

For data visualization:

* `ggplot2` website: https://ggplot2.tidyverse.org
* R for Data Science book: https://r4ds.had.co.nz/data-visualisation.html
* R Graphics Cookbook: https://r-graphics.org
* Data visualization cheat sheet: https://github.com/rstudio/cheatsheets/raw/master/data-visualization-2.1.pdf

---
class: left, font140
# One Minute Paper

.pull-left[
1. What was the most important thing you learned during this class?
2. What important question remains unanswered for you?
]
.pull-right[
<img src="02-Summarizing_Data_files/figure-html/unnamed-chunk-60-1.png" alt="" style="display: block; margin: auto;" />
]

https://forms.gle/Ze19MooQHvZmQE2ZA