Dave's Web Corner

semesters > winter 2025 > mth225 > week 2

MTH-225 Week 2 (January 12-18)

Outline
Assessments
Handouts
Videos
Technology

Week 2 in Data Science

This week you are beginning to investigate the organization and manipulation of data.
At the beginning of the class, we are practicing on data that is already well-organized in the format that we call tidy data. Tidy data is a standard for organizing data that we will formally discuss in a Chapter 5.
You are already accustomed to organizing and consuming information in tables and spreadsheets, data science begins with the concept of "table" and makes it much more fluid, functional, and formal.
- A data frame is a table of information organized into observations (rows) and variables (columns). The information stored in a data frame for a given observation and a given variable is called a value.
- A tibble is a data frame with some additional organization that allows us to easily manipulate the observations, variables, and values.
- Tidy data is data that has been organized according to specific rules to maximize efficiency and reduce redundancy.
Major functions for manipulating data frames:
- filter(), arrange(), and distinct() manipulate observations
- mutate(), select(), rename(), and relocate() manipulate variables
Some valuable helper functions
- desc(), group_by(), summarize(), and the slice_*() functions
This week you are introduced to the pipe, |>, an efficient tool for writing and organizing R commands that manipulate data frames.
Some pointers for this week:
- The pipe is powerful! Pay special attention to how you use and read commands that use the pipe. The pipe creates a flow from one data frame to the next.
- Read the documentation anytime the authors or solutions suggest. You will begin to understand how and where to find help for functions and commands, and you will learn how to read and search the documentation.
- Pay special attention to the RStudio shortcuts and tips that the authors share.
- Practice using all four panes of RStudio.
- Practice creating, saving, storing, and sharing *.R files in the editor and file manager. This is a much more forgiving way to enter R commands, document, and share your work.

semesters > winter 2025 > mth225 > week 2 > outline

Outline

2 Workflow: basics

2.1 Coding basics
- assignment, setting values of variables
2.2 Comments
- "notes to your future self" - a beautiful analogy of the authors
2.3 What’s in a name?
- naming variables and files to improve the consistency of your workspace
2.4 Calling functions
- functions are tools that transform your data
2.5 Exercises
2.6 Summary

3 Data transformation

3.1 Introduction
- dplyr, is meant to stand for data pliers, as in the pliers in your toolbox in the garage.
3.2 Rows
- main functions: filter(), arrange(), and distinct()
- helper function: desc()
3.3 Columns
- main functions: mutate(), select(), rename(), and relocate()
3.4 The pipe
- learn how to read the flow of data frames in multistep commands that use the pipe
- learn the keyboard shortcut for the pipe!
3.5 Groups
- main functions: group_by() and summarize()
- helper functions: ungroup(), and the slice_*() functions
- the .by argument is a relatively new and useful tool for grouping
3.6 Case study: aggregates and sample size
- there are a couple of interesting points in this small "case study"
  - "variation decreases as sample size increases" - that is a general truth that you understand, but it is useful to remember and useful to see illustrated, in statistics we call this "the law of large numbers"
  - note the interaction between the pipe workflow of manipulating data froms using |>, and the layer workflow of adding layers to plots using +, this is a powerful technique
  - in the graphic in this section, note the authors' use of transparency with alpha to illustrate density in a plot with a very large number of points
3.7 Summary

semesters > winter 2025 > mth225 > week 2 > assessments

Assessments

Deadlines and File Submission
- Assessment deadlines will be 11:59pm each Saturday.
- All assessments are submitted to the Homework Folder inside your assigned Google Drive folder.
- There are no make-ups for missed assessments. Contact me before a deadline if you have an issue meeting the deadline and we will find a mutually agreeable solution.
Homework
- Homework 2 (due Saturday, January 18)
  - Download the R script file homework_02.R.
  - Import the R script file into R Studio and complete the exercises contained in the file by including the appropriate R commands.
  - Upload the completed R script file to your Homework Folder.
  - additional resources
    - Average_Flight_Delays.pdf
  - solutions
    - homework_02_solution.R

semesters > winter 2025 > mth225 > week 2 > handouts

Handouts

Announcements
- Schmidt Award
- Math Graduate Award
- Honors Program introduction
Required Materials:
- R for Data Science (2e), Wickham, Çetinkaya-Rundel, Grolemund.
- R for Data Science (2e): Solutions to Exercises, Ghaffar, Person, and Çetinkaya-Rundel.
- R and RStudio (open source edition) and an account at RStudio Cloud.

semesters > winter 2025 > mth225 > week 2 > videos

Videos

semesters > winter 2025 > mth225 > week 2 > technology

Communication
- Zoom
- Joining a Zoom Webinar
Rstudio
- RStudio Cloud
- RStudio