class: center, middle, inverse, title-slide # Summarizing Data ## DATA 606 - Statistics & Probability for Data Analytics ### Jason Bryer, Ph.D. and Angela Lui, Ph.D. ### November 9, 2026 --- # One Minute Paper Results .pull-left[ **What was the most important thing you learned during this class?** <img src="02-Summarizing_Data_files/figure-html/unnamed-chunk-2-1.png" alt="" style="display: block; margin: auto;" /> ] .pull-right[ **What important question remains unanswered for you?** <img src="02-Summarizing_Data_files/figure-html/unnamed-chunk-3-1.png" alt="" style="display: block; margin: auto;" /> ] --- # Announcements * There will be no meetup next Monday, February 16th. This does not affect due dates. * I will be giving a talk on March 10th at NYU for the New York Open Statistical Programming Meetup https://nyhackr.org --- # Workflow .center[ <img src='images/data-science-wrangle.png' alt = 'Data Science Workflow' width='1000' /> ] .font80[Source: [Wickham & Grolemund, 2017](https://r4ds.had.co.nz)] --- # Tidy Data .center[ <img src='images/tidydata_1.jpg' height='500' /> ] See Wickham (2014) [Tidy data](https://vita.had.co.nz/papers/tidy-data.html). --- # Types of Data .pull-left[ * Numerical (quantitative) * Continuous * Discrete ] .pull-right[ * Categorical (qualitative) * Regular categorical * Ordinal ] .center[ <img src='images/continuous_discrete.png' height='400' /> ] --- # Data Types in R <img src="images/DataTypesConceptModel.png" alt="" width="1000" style="display: block; margin: auto;" /> --- # Data Types / Descriptives / Visualizations Data Type | Descriptive Stats | Visualization -------------|-----------------------------------------------|-------------------| Continuous | mean, median, mode, standard deviation, IQR | histogram, density, box plot Discrete | contingency table, proportional table, median | bar plot Categorical | contingency table, proportional table | bar plot Ordinal | contingency table, proportional table, median | bar plot Two quantitative | correlation | scatter plot Two qualitative | contingency table, chi-squared | mosaic plot, bar plot Quantitative & Qualitative | grouped summaries, ANOVA, t-test | box plot --- # Statistics .pull-left[ When describing a quantitative variable we are often interested in two things: 1. A measure of center 2. A measure of spread The most common measures we will use in this class is the mean and median. $$ \bar{x} = \frac{\Sigma(x_i)}{n} $$ $$ S^2 = \frac{\Sigma (x_i - \bar{x})^2}{N}$$ ] .pull-right[ Note that in the numerator for the variance calculation we square the differences (also known as deviations). Squaring terms is common practice in statistics that serves two purposes: 1. It makes all the values positive. 2. It weighs observations that are further from the center more. ] --- # Variance .pull-left[ Population Variance: $$ S^2 = \frac{\Sigma (x_i - \bar{x})^2}{N}$$ Consider a dataset with five values (black points in the figure). For the largest value, the deviance is represented by the blue line ( `\(x_i - \bar{x}\)` ). See also: https://shiny.rit.albany.edu/stat/visualizess/ https://github.com/jbryer/VisualStats/ ] .pull-right[ <img src="02-Summarizing_Data_files/figure-html/unnamed-chunk-5-1.png" alt="" style="display: block; margin: auto;" /> ] --- # Variance (cont.) .pull-left[ Population Variance: $$ S^2 = \frac{\Sigma (x_i - \bar{x})^2}{N}$$ In the numerator, we square each of these deviances. We can conceptualize this as a square. Here, we add the deviance in the *y* direction. ] .pull-right[ <img src="02-Summarizing_Data_files/figure-html/unnamed-chunk-6-1.png" alt="" style="display: block; margin: auto;" /> ] --- # Variance (cont.) .pull-left[ Population Variance: $$ S^2 = \frac{\Sigma (x_i - \bar{x})^2}{N}$$ We end up with a square. ] .pull-right[ <img src="02-Summarizing_Data_files/figure-html/unnamed-chunk-7-1.png" alt="" style="display: block; margin: auto;" /> ] --- # Variance (cont.) .pull-left[ Population Variance: $$ S^2 = \frac{\Sigma (x_i - \bar{x})^2}{N}$$ We can plot the squared deviance for all the data points. That is, each component in the numerator is the area of each of these squares. ] .pull-right[ <img src="02-Summarizing_Data_files/figure-html/unnamed-chunk-8-1.png" alt="" style="display: block; margin: auto;" /> ] --- # Variance (cont.) .pull-left[ Population Variance: $$ S^2 = \frac{\Sigma (x_i - \bar{x})^2}{N}$$ The variance is therefore the average of the area of all these squares, here represented by the orange square. ] .pull-right[ <img src="02-Summarizing_Data_files/figure-html/unnamed-chunk-9-1.png" alt="" style="display: block; margin: auto;" /> ] --- # Population versus Sample Variance .pull-left[ Typically we want the sample variance. The difference is we divide by `\(n - 1\)` to calculate the sample variance. This results in a slightly larger area (variance) then if we divide by `\(n\)`. Population Variance (yellow): $$ S^2 = \frac{\Sigma (x_i - \bar{x})^2}{N}$$ Sample Variance (green): $$ s^2 = \frac{\Sigma (x_i - \bar{x})^2}{n-1}$$ ] .pull-right[ <img src="02-Summarizing_Data_files/figure-html/unnamed-chunk-10-1.png" alt="" style="display: block; margin: auto;" /> ] --- # Robust Statistics Consider the following data randomly selected from the normal distribution: .pull-left[ ``` r set.seed(41) x <- rnorm(30, mean = 100, sd = 15) mean(x); sd(x) ``` ``` ## [1] 103.1934 ``` ``` ## [1] 16.8945 ``` ``` r median(x); IQR(x) ``` ``` ## [1] 103.9947 ``` ``` ## [1] 25.68004 ``` ] .pull-right[ <img src="02-Summarizing_Data_files/figure-html/unnamed-chunk-12-1.png" alt="" style="display: block; margin: auto;" /> ] --- # Robust Statistics <img src="02-Summarizing_Data_files/figure-html/unnamed-chunk-13-1.png" alt="" style="display: block; margin: auto;" /> --- # Robust Statistics <img src="02-Summarizing_Data_files/figure-html/unnamed-chunk-14-1.png" alt="" style="display: block; margin: auto;" /> Let's add an extreme value: ``` r x <- c(x, 1000) ``` -- <img src="02-Summarizing_Data_files/figure-html/unnamed-chunk-16-1.png" alt="" style="display: block; margin: auto;" /> --- # Robust Statistics Median and IQR are more robust to skewness and outliers than mean and SD. Therefore, * for skewed distributions it is often more helpful to use median and IQR to describe the center and spread * for symmetric distributions it is often more helpful to use the mean and SD to describe the center and spread --- class: font80 # About `legosets` <img src="images/hex/brickset.png" class="title-hex"> To install the `brickset` package: ``` r remotes::install_github('jbryer/brickset') ``` To load the load the `legosets` dataset. ``` r data('legosets', package = 'brickset') ``` The `legosets` data has 21546 observations of 36 variables. .code70[ ``` r names(legosets) ``` ``` ## [1] "setID" "number" "numberVariant" ## [4] "name" "year" "theme" ## [7] "themeGroup" "subtheme" "category" ## [10] "released" "pieces" "minifigs" ## [13] "bricksetURL" "rating" "reviewCount" ## [16] "packagingType" "availability" "agerange_min" ## [19] "thumbnailURL" "imageURL" "US_retailPrice" ## [22] "US_dateFirstAvailable" "US_dateLastAvailable" "UK_retailPrice" ## [25] "UK_dateFirstAvailable" "UK_dateLastAvailable" "CA_retailPrice" ## [28] "CA_dateFirstAvailable" "CA_dateLastAvailable" "DE_retailPrice" ## [31] "DE_dateFirstAvailable" "DE_dateLastAvailable" "height" ## [34] "width" "depth" "weight" ``` ] --- # Structure (`str`) <img src="images/hex/brickset.png" class="title-hex"> .code50[ ``` r str(legosets) ``` ``` ## 'data.frame': 21546 obs. of 36 variables: ## $ setID : int 7693 7695 7697 7698 25534 7418 7419 6020 22704 7421 ... ## $ number : chr "1" "2" "3" "4" ... ## $ numberVariant : int 8 8 6 4 6 1 1 1 3 4 ... ## $ name : chr "Small house set" "Medium house set" "Medium house set" "Large house set" ... ## $ year : int 1970 1970 1970 1970 1970 1970 1970 1970 1970 1970 ... ## $ theme : chr "Minitalia" "Minitalia" "Minitalia" "Minitalia" ... ## $ themeGroup : chr "Vintage" "Vintage" "Vintage" "Vintage" ... ## $ subtheme : chr NA NA NA NA ... ## $ category : chr "Normal" "Normal" "Normal" "Normal" ... ## $ released : logi TRUE TRUE TRUE TRUE TRUE TRUE ... ## $ pieces : int 67 109 158 233 NA 1 1 60 65 NA ... ## $ minifigs : int NA NA NA NA NA NA NA NA NA NA ... ## $ bricksetURL : chr "https://brickset.com/sets/1-8" "https://brickset.com/sets/2-8" "https://brickset.com/sets/3-6" "https://brickset.com/sets/4-4" ... ## $ rating : num 0 0 0 0 0 0 0 0 0 0 ... ## $ reviewCount : int 0 0 1 0 0 0 0 0 0 0 ... ## $ packagingType : chr "{Not specified}" "{Not specified}" "{Not specified}" "{Not specified}" ... ## $ availability : chr "{Not specified}" "{Not specified}" "{Not specified}" "{Not specified}" ... ## $ agerange_min : int NA NA NA NA NA NA NA NA NA NA ... ## $ thumbnailURL : chr "https://images.brickset.com/sets/small/1-8.jpg" "https://images.brickset.com/sets/small/2-8.jpg" "https://images.brickset.com/sets/small/3-6.jpg" "https://images.brickset.com/sets/small/4-4.jpg" ... ## $ imageURL : chr "https://images.brickset.com/sets/images/1-8.jpg" "https://images.brickset.com/sets/images/2-8.jpg" "https://images.brickset.com/sets/images/3-6.jpg" "https://images.brickset.com/sets/images/4-4.jpg" ... ## $ US_retailPrice : num NA NA NA NA NA NA NA NA NA NA ... ## $ US_dateFirstAvailable: Date, format: NA NA ... ## $ US_dateLastAvailable : Date, format: NA NA ... ## $ UK_retailPrice : num NA NA NA NA NA NA NA NA NA NA ... ## $ UK_dateFirstAvailable: Date, format: NA NA ... ## $ UK_dateLastAvailable : Date, format: NA NA ... ## $ CA_retailPrice : num NA NA NA NA NA NA NA NA NA NA ... ## $ CA_dateFirstAvailable: Date, format: NA NA ... ## $ CA_dateLastAvailable : Date, format: NA NA ... ## $ DE_retailPrice : num NA NA NA NA NA NA NA NA NA NA ... ## $ DE_dateFirstAvailable: Date, format: NA NA ... ## $ DE_dateLastAvailable : Date, format: NA NA ... ## $ height : num NA NA NA NA NA ... ## $ width : num NA NA NA NA NA ... ## $ depth : num NA NA NA NA NA NA NA NA 5.08 NA ... ## $ weight : num NA NA NA NA NA NA NA NA NA NA ... ``` ] --- # RStudio Eenvironment tab can help <img src="images/hex/rstudio.png" class="title-hex"> <img src="images/legosets_rstudio_environment.png" alt="" width="500" style="display: block; margin: auto;" /> --- class: hide-logo # Table View .font60[
] --- # Data Wrangling Cheat Sheet <img src="images/hex/dplyr.png" class="title-hex"> .center[ <a href='https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf' target='_new'><img src='images/data-transformation.png' width='700' /></a> ] --- # Tidyverse vs Base R <img src="images/hex/tidyverse.png" class="title-hex"><img src="images/hex/pipe.png" class="title-hex"> .center[ <a href='images/R_Syntax_Comparison.jpeg' target='_new'><img src="images/R_Syntax_Comparison.jpeg" width='700' /></a> ] --- class: font90 # Pipes `%>%` and `|>` <img src="images/hex/magrittr.png" class="title-hex"> <img src='images/magrittr_pipe.jpg' align='right' width='200' /> .font90[ The pipe operator (`%>%`) introduced with the `magrittr` R package allows for the chaining of R operations. As of version 4.1, R now has a native pipe operator (`|>`). They take the output from the left-hand side and passes it as the first parameter to the function on the right-hand side. ] .pull-left[ You can do this in two steps: ``` r tab_out <- table(legosets$category) prop.table(tab_out) ``` Or as nested function calls. ``` r prop.table(table(legosets$category)) ``` ] .pull-right[ Using the pipe (`|>`) operator we can chain these calls in a what is arguably a more readable format: ``` r table(legosets$category) |> prop.table() ``` ] <hr /> ``` ## ## Book Collection Extended Gear Normal Other ## 0.035087719 0.029889539 0.036248027 0.158869396 0.668894458 0.067205050 ## Random ## 0.003805811 ``` --- # Filter <img src="images/hex/tidyverse.png" class="title-hex"><img src="images/hex/dplyr.png" class="title-hex"> .center[ <img src='images/dplyr_filter_sm.png' width='800' /> ] --- # Logical Operators * `!a` - TRUE if a is FALSE * `a == b` - TRUE if a and be are equal * `a != b` - TRUE if a and b are not equal * `a > b` - TRUE if a is larger than b, but not equal * `a >= b` - TRUE if a is larger or equal to b * `a < b` - TRUE if a is smaller than be, but not equal * `a <= b` - TRUE if a is smaller or equal to b * `a %in% b` - TRUE if a is in b where b is a vector ``` r which( letters %in% c('a','e','i','o','u') ) ``` ``` ## [1] 1 5 9 15 21 ``` * `a | b` - TRUE if a *or* b are TRUE * `a & b` - TRUE if a *and* b are TRUE * `isTRUE(a)` - TRUE if a is TRUE --- # Filter <img src="images/hex/tidyverse.png" class="title-hex"><img src="images/hex/dplyr.png" class="title-hex"> ### `dplyr` ``` r mylego <- legosets %>% filter(themeGroup == 'Educational' & year > 2015) ``` ### Base R ``` r mylego <- legosets[legosets$themeGroup == 'Educational' & legosets$year > 2015,] ``` <hr /> ``` r nrow(mylego) ``` ``` ## [1] 121 ``` --- # Select <img src="images/hex/tidyverse.png" class="title-hex"><img src="images/hex/dplyr.png" class="title-hex"> ### `dplyr` ``` r mylego <- mylego %>% select(setID, pieces, theme, availability, US_retailPrice, minifigs) ``` ### Base R ``` r mylego <- mylego[,c('setID', 'pieces', 'theme', 'availability', 'US_retailPrice', 'minifigs')] ``` <hr /> ``` r head(mylego, n = 4) ``` ``` ## setID pieces theme availability US_retailPrice minifigs ## 1 26803 109 Education {Not specified} NA 6 ## 2 26277 188 Education Educational 94.95 NA ## 3 27742 160 Education {Not specified} NA NA ## 4 26805 1000 Education {Not specified} NA NA ``` --- # Relocate <img src="images/hex/tidyverse.png" class="title-hex"><img src="images/hex/dplyr.png" class="title-hex"> .center[ <img src='images/dplyr_relocate.png' width='800' /> ] --- # Relocate <img src="images/hex/tidyverse.png" class="title-hex"><img src="images/hex/dplyr.png" class="title-hex"> ### `dplyr` ``` r mylego %>% relocate(where(is.numeric), .after = where(is.character)) %>% head(n = 3) ``` ``` ## theme availability setID pieces US_retailPrice minifigs ## 1 Education {Not specified} 26803 109 NA 6 ## 2 Education Educational 26277 188 94.95 NA ## 3 Education {Not specified} 27742 160 NA NA ``` ### Base R ``` r mylego2 <- mylego[,c('theme', 'availability', 'setID', 'pieces', 'US_retailPrice', 'minifigs')] head(mylego2, n = 3) ``` ``` ## theme availability setID pieces US_retailPrice minifigs ## 1 Education {Not specified} 26803 109 NA 6 ## 2 Education Educational 26277 188 94.95 NA ## 3 Education {Not specified} 27742 160 NA NA ``` --- # Rename <img src="images/hex/tidyverse.png" class="title-hex"><img src="images/hex/dplyr.png" class="title-hex"> .center[ <img src='images/rename_sm.jpg' width='1000' /> ] --- # Rename <img src="images/hex/tidyverse.png" class="title-hex"><img src="images/hex/dplyr.png" class="title-hex"> ### `dplyr` ``` r mylego %>% dplyr::rename(USD = US_retailPrice) %>% head(n = 3) ``` ``` ## setID pieces theme availability USD minifigs ## 1 26803 109 Education {Not specified} NA 6 ## 2 26277 188 Education Educational 94.95 NA ## 3 27742 160 Education {Not specified} NA NA ``` ### Base R ``` r names(mylego2)[5] <- 'USD' head(mylego2, n = 3) ``` ``` ## theme availability setID pieces USD minifigs ## 1 Education {Not specified} 26803 109 NA 6 ## 2 Education Educational 26277 188 94.95 NA ## 3 Education {Not specified} 27742 160 NA NA ``` --- # Mutate <img src="images/hex/tidyverse.png" class="title-hex"><img src="images/hex/dplyr.png" class="title-hex"> .center[ <img src='images/dplyr_mutate.png' width='700' /> ] --- # Mutate <img src="images/hex/tidyverse.png" class="title-hex"><img src="images/hex/dplyr.png" class="title-hex"> ### `dplyr` ``` r mylego %>% filter(!is.na(pieces) & !is.na(US_retailPrice)) %>% mutate(Price_per_piece = US_retailPrice / pieces) %>% head(n = 3) ``` ``` ## setID pieces theme availability US_retailPrice minifigs Price_per_piece ## 1 26277 188 Education Educational 94.95 NA 0.5050532 ## 2 25949 280 Education Educational 224.95 NA 0.8033929 ## 3 25954 1 Education Educational 14.95 NA 14.9500000 ``` ### Base R ``` r mylego2 <- mylego[!is.na(mylego$US_retailPrice) & !is.na(mylego$Price_per_piece),] mylego2$Price_per_piece <- mylego2$Price_per_piece / mylego2$US_retailPrice head(mylego2, n = 3) ``` ``` ## [1] setID pieces theme availability ## [5] US_retailPrice minifigs Price_per_piece ## <0 rows> (or 0-length row.names) ``` --- # Group By and Summarize <img src="images/hex/tidyverse.png" class="title-hex"><img src="images/hex/dplyr.png" class="title-hex"> .code80[ ``` r legosets %>% group_by(themeGroup) %>% summarize(mean_price = mean(US_retailPrice, na.rm = TRUE), sd_price = sd(US_retailPrice, na.rm = TRUE), median_price = median(US_retailPrice, na.rm = TRUE), n = n(), missing = sum(is.na(US_retailPrice))) ``` ``` ## # A tibble: 17 × 6 ## themeGroup mean_price sd_price median_price n missing ## <chr> <dbl> <dbl> <dbl> <int> <int> ## 1 Action/Adventure 43.0 41.9 30.0 1620 845 ## 2 Art and crafts 41.0 53.0 20.0 104 9 ## 3 Basic 22.8 19.4 15.0 884 733 ## 4 Constraction 16.4 12.4 13.0 503 285 ## 5 Educational 185. 188. 138. 546 508 ## 6 Girls 35.8 24.0 23.0 240 227 ## 7 Historical 34.2 32.4 20.0 474 401 ## 8 Junior 22.0 10.1 20.0 228 165 ## 9 Licensed 55.5 72.9 35.0 3353 1245 ## 10 Miscellaneous 23.4 33.5 15.0 7151 4501 ## 11 Model making 86.2 99.4 50.0 919 422 ## 12 Modern day 39.8 36.8 30.0 2669 1594 ## 13 Pre-school 31.7 23.2 25.0 1613 1108 ## 14 Racing 26.4 26.8 15.0 266 176 ## 15 Technical 86.3 96.5 50.0 667 344 ## 16 Vintage NaN NA NA 307 307 ## 17 <NA> NaN NA NA 2 2 ``` ] --- # Describe and Describe By ``` r library(psych) describe(legosets$US_retailPrice) ``` ``` ## vars n mean sd median trimmed mad min max range skew kurtosis ## X1 1 8674 42.26 59.75 24.99 30.25 22.24 1.49 999.99 998.5 4.97 40.38 ## se ## X1 0.64 ``` ``` r describeBy(legosets$US_retailPrice, group = legosets$availability, mat = TRUE, skew = FALSE) ``` ``` ## item group1 vars n mean sd median min max range se ## X11 1 {Not specified} 1 2004 27.76015 38.92310 19.99 1.49 789.99 788.5 0.8694780 ## X12 2 Educational 1 12 217.03333 108.17617 232.45 14.95 399.95 385.0 31.2277699 ## X13 3 Gift with Purchase at LEGO.com 1 0 NaN NA NA Inf -Inf -Inf NA ## X14 4 Insiders Reward 1 0 NaN NA NA Inf -Inf -Inf NA ## X15 5 LEGO exclusive 1 1268 65.12353 112.88247 14.99 1.99 999.99 998.0 3.1700555 ## X16 6 LEGOLAND exclusive 1 2 4.99000 0.00000 4.99 4.99 4.99 0.0 0.0000000 ## X17 7 Not sold 1 1 12.99000 NA 12.99 12.99 12.99 0.0 NA ## X18 8 Promotional 1 5 4.79000 0.83666 4.99 3.99 5.99 2.0 0.3741657 ## X19 9 Promotional (Airline) 1 0 NaN NA NA Inf -Inf -Inf NA ## X110 10 Retail 1 5062 40.59401 40.96294 29.99 1.99 699.99 698.0 0.5757448 ## X111 11 Retail - limited 1 319 63.40755 69.70365 39.99 2.49 449.99 447.5 3.9026552 ## X112 12 Unknown 1 1 3.99000 NA 3.99 3.99 3.99 0.0 NA ``` --- class: middle # Grammer of Graphics .center[ <img src="images/ggplot2_masterpiece.png" height="550" /> ] --- # Data Visualizations with ggplot2 <img src="images/hex/ggplot2.png" class="title-hex"> * `ggplot2` is an R package that provides an alternative framework based upon Wilkinson’s (2005) Grammar of Graphics. * `ggplot2` is, in general, more flexible for creating "prettier" and complex plots. * Works by creating layers of different types of objects/geometries (i.e. bars, points, lines, polygons, etc.) `ggplot2` has at least three ways of creating plots: 1. `qplot` 2. `ggplot(...) + geom_XXX(...) + ...` 3. `ggplot(...) + layer(...)` * We will focus only on the second. --- # Parts of a `ggplot2` Statement <img src="images/hex/ggplot2.png" class="title-hex"> * Data `ggplot(myDataFrame, aes(x=x, y=y))` * Layers `geom_point()`, `geom_histogram()` * Facets `facet_wrap(~ cut)`, `facet_grid(~ cut)` * Scales `scale_y_log10()` * Other options `ggtitle('my title')`, `ylim(c(0, 10000))`, `xlab('x-axis label')` --- # Lots of geoms <img src="images/hex/ggplot2.png" class="title-hex"> ``` r ls('package:ggplot2')[grep('^geom_', ls('package:ggplot2'))] ``` ``` ## [1] "geom_abline" "geom_area" "geom_bar" "geom_bin_2d" ## [5] "geom_bin2d" "geom_blank" "geom_boxplot" "geom_col" ## [9] "geom_contour" "geom_contour_filled" "geom_count" "geom_crossbar" ## [13] "geom_curve" "geom_density" "geom_density_2d" "geom_density_2d_filled" ## [17] "geom_density2d" "geom_density2d_filled" "geom_dotplot" "geom_errorbar" ## [21] "geom_errorbarh" "geom_freqpoly" "geom_function" "geom_hex" ## [25] "geom_histogram" "geom_hline" "geom_jitter" "geom_label" ## [29] "geom_line" "geom_linerange" "geom_map" "geom_path" ## [33] "geom_point" "geom_pointrange" "geom_polygon" "geom_qq" ## [37] "geom_qq_line" "geom_quantile" "geom_raster" "geom_rect" ## [41] "geom_ribbon" "geom_rug" "geom_segment" "geom_sf" ## [45] "geom_sf_label" "geom_sf_text" "geom_smooth" "geom_spoke" ## [49] "geom_step" "geom_text" "geom_tile" "geom_violin" ## [53] "geom_vline" ``` --- # Data Visualization Cheat Sheet <img src="images/hex/ggplot2.png" class="title-hex"> .center[ <a href='https://github.com/rstudio/cheatsheets/raw/master/data-visualization-2.1.pdf'><img src='images/data-visualization-2.1.png' width='700' /></a> ] --- # Scatterplot <img src="images/hex/ggplot2.png" class="title-hex"> ``` r ggplot(legosets, aes(x=pieces, y=US_retailPrice)) + geom_point() ``` <img src="02-Summarizing_Data_files/figure-html/unnamed-chunk-45-1.png" alt="" style="display: block; margin: auto;" /> --- # Scatterplot (cont.) <img src="images/hex/ggplot2.png" class="title-hex"> ``` r ggplot(legosets, aes(x=pieces, y=US_retailPrice, color=availability)) + geom_point() ``` <img src="02-Summarizing_Data_files/figure-html/unnamed-chunk-46-1.png" alt="" style="display: block; margin: auto;" /> --- # Scatterplot (cont.) <img src="images/hex/ggplot2.png" class="title-hex"> ``` r ggplot(legosets, aes(x=pieces, y=US_retailPrice, size=minifigs, color=availability)) + geom_point() ``` <img src="02-Summarizing_Data_files/figure-html/unnamed-chunk-47-1.png" alt="" style="display: block; margin: auto;" /> --- # Scatterplot (cont.) <img src="images/hex/ggplot2.png" class="title-hex"> ``` r ggplot(legosets, aes(x=pieces, y=US_retailPrice, size=minifigs)) + geom_point() + facet_wrap(~ availability) ``` <img src="02-Summarizing_Data_files/figure-html/unnamed-chunk-48-1.png" alt="" style="display: block; margin: auto;" /> --- # Boxplots <img src="images/hex/ggplot2.png" class="title-hex"> ``` r ggplot(legosets, aes(x='Lego', y=US_retailPrice)) + geom_boxplot() ``` <img src="02-Summarizing_Data_files/figure-html/unnamed-chunk-49-1.png" alt="" style="display: block; margin: auto;" /> --- # Boxplots (cont.) <img src="images/hex/ggplot2.png" class="title-hex"> ``` r ggplot(legosets, aes(x=availability, y=US_retailPrice)) + geom_boxplot() ``` <img src="02-Summarizing_Data_files/figure-html/unnamed-chunk-50-1.png" alt="" style="display: block; margin: auto;" /> --- # Boxplot (cont.) <img src="images/hex/ggplot2.png" class="title-hex"> ``` r ggplot(legosets, aes(x=availability, y=US_retailPrice)) + geom_boxplot() + coord_flip() ``` <img src="02-Summarizing_Data_files/figure-html/unnamed-chunk-51-1.png" alt="" style="display: block; margin: auto;" /> --- # Histograms <img src="images/hex/ggplot2.png" class="title-hex"> ``` r ggplot(legosets, aes(x = US_retailPrice)) + geom_histogram(binwidth = 25) ``` <img src="02-Summarizing_Data_files/figure-html/unnamed-chunk-52-1.png" alt="" style="display: block; margin: auto;" /> --- # Histograms (cont.)<img src="images/hex/ggplot2.png" class="title-hex"> ``` r ggplot(legosets, aes(x = US_retailPrice)) + geom_histogram(bins = 15) + scale_x_log10() ``` <img src="02-Summarizing_Data_files/figure-html/unnamed-chunk-53-1.png" alt="" style="display: block; margin: auto;" /> --- # Histograms (cont.) <img src="images/hex/ggplot2.png" class="title-hex"> ``` r ggplot(legosets, aes(x = US_retailPrice)) + geom_histogram(binwidth = 25) + facet_wrap(~ availability) ``` <img src="02-Summarizing_Data_files/figure-html/unnamed-chunk-54-1.png" alt="" style="display: block; margin: auto;" /> --- # Density Plots <img src="images/hex/ggplot2.png" class="title-hex"> ``` r ggplot(legosets, aes(x = US_retailPrice, color = availability)) + geom_density() ``` <img src="02-Summarizing_Data_files/figure-html/unnamed-chunk-55-1.png" alt="" style="display: block; margin: auto;" /> --- # Density Plots (cont.) <img src="images/hex/ggplot2.png" class="title-hex"> ``` r ggplot(legosets, aes(x = US_retailPrice, color = availability)) + geom_density() + scale_x_log10() ``` <img src="02-Summarizing_Data_files/figure-html/unnamed-chunk-56-1.png" alt="" style="display: block; margin: auto;" /> --- # `ggplot2` aesthetics <img src="images/hex/ggplot2.png" class="title-hex"> .center[ <a href='images/ggplot_aesthetics_cheatsheet.png' target='_new'> <img src='images/ggplot_aesthetics_cheatsheet.png' height='550' /></a> ] --- # Likert Scales <img src="images/hex/likert.png" class="title-hex"> Likert scales are a type of questionnaire where respondents are asked to rate items on scales usually ranging from four to seven levels (e.g. strongly disagree to strongly agree). ``` r library(likert) library(reshape) data(pisaitems) items24 <- pisaitems[,substr(names(pisaitems), 1,5) == 'ST24Q'] items24 <- rename(items24, c( ST24Q01="I read only if I have to.", ST24Q02="Reading is one of my favorite hobbies.", ST24Q03="I like talking about books with other people.", ST24Q04="I find it hard to finish books.", ST24Q05="I feel happy if I receive a book as a present.", ST24Q06="For me, reading is a waste of time.", ST24Q07="I enjoy going to a bookstore or a library.", ST24Q08="I read only to get information that I need.", ST24Q09="I cannot sit still and read for more than a few minutes.", ST24Q10="I like to express my opinions about books I have read.", ST24Q11="I like to exchange books with my friends.")) ``` --- # `likert` R Package <img src="images/hex/likert.png" class="title-hex"> ``` r l24 <- likert(items24) summary(l24) ``` ``` ## Item low neutral high mean sd ## 10 I like to express my opinions about books I have read. 41.07516 0 58.92484 2.604913 0.9009968 ## 5 I feel happy if I receive a book as a present. 46.93475 0 53.06525 2.466751 0.9446590 ## 8 I read only to get information that I need. 50.39874 0 49.60126 2.484616 0.9089688 ## 7 I enjoy going to a bookstore or a library. 51.21231 0 48.78769 2.428508 0.9164136 ## 3 I like talking about books with other people. 54.99129 0 45.00871 2.328049 0.9090326 ## 11 I like to exchange books with my friends. 55.54115 0 44.45885 2.343193 0.9609234 ## 2 Reading is one of my favorite hobbies. 56.64470 0 43.35530 2.344530 0.9277495 ## 1 I read only if I have to. 58.72868 0 41.27132 2.291811 0.9369023 ## 4 I find it hard to finish books. 65.35125 0 34.64875 2.178299 0.8991628 ## 9 I cannot sit still and read for more than a few minutes. 76.24524 0 23.75476 1.974736 0.8793028 ## 6 For me, reading is a waste of time. 82.88729 0 17.11271 1.810093 0.8611554 ``` --- # `likert` Plots <img src="images/hex/likert.png" class="title-hex"> ``` r plot(l24) ``` <img src="02-Summarizing_Data_files/figure-html/unnamed-chunk-59-1.png" alt="" style="display: block; margin: auto;" /> --- # Pie Charts There is only one pie chart in *OpenIntro Statistics* (Diez, Barr, & Çetinkaya-Rundel, 2015, p. 48). Consider the following three pie charts that represent the preference of five different colors. Is there a difference between the three pie charts? This is probably a difficult to answer. <center><img src='images/Pie.png' width='500'></center> --- # Pie Charts There is only one pie chart in *OpenIntro Statistics* (Diez, Barr, & Çetinkaya-Rundel, 2015, p. 48). Consider the following three pie charts that represent the preference of five different colors. Is there a difference between the three pie charts? This is probably a difficult to answer. <center><img src='images/Pie.png' width='500'></center> <center><img src='images/Bar.png' width='500'></center> Source: [https://en.wikipedia.org/wiki/Pie_chart](https://en.wikipedia.org/wiki/Pie_chart). --- class: middle # Just say NO to pie charts! .font150[ "There is no data that can be displayed in a pie chart that cannot better be displayed in some other type of chart"] .right[.font130[John Tukey]] --- # Additional Resources For data wrangling: * `dplyr` website: https://dplyr.tidyverse.org * R for Data Science book: https://r4ds.had.co.nz/wrangle-intro.html * Wrangling penguins tutorial: https://allisonhorst.shinyapps.io/dplyr-learnr/#section-welcome * Data transformation cheat sheet: https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf For data visualization: * `ggplot2` website: https://ggplot2.tidyverse.org * R for Data Science book: https://r4ds.had.co.nz/data-visualisation.html * R Graphics Cookbook: https://r-graphics.org * Data visualization cheat sheet: https://github.com/rstudio/cheatsheets/raw/master/data-visualization-2.1.pdf --- class: left, font140 # One Minute Paper .pull-left[ 1. What was the most important thing you learned during this class? 2. What important question remains unanswered for you? ] .pull-right[ <img src="02-Summarizing_Data_files/figure-html/unnamed-chunk-60-1.png" alt="" style="display: block; margin: auto;" /> ] https://forms.gle/Ze19MooQHvZmQE2ZA