This week you are beginning to investigate the organization and manipulation of
data.
At the beginning of the class, we are practicing on data that is already
well-organized in the format that we call tidy data. Tidy data
is a standard for organizing data that we will formally discuss in a Chapter 5.
You are already accustomed to organizing and consuming information in tables and
spreadsheets, data science begins with the concept of "table" and makes it much
more fluid, functional, and formal.
A data frame is a table of information organized into
observations (rows) and variables (columns). The
information stored in a data frame for a given observation and a given
variable is called a value.
A tibble is a data frame with some additional organization that
allows us to easily manipulate the observations, variables, and values.
Tidy data is data that has been organized according to specific
rules to maximize efficiency and reduce redundancy.
Major functions for manipulating data frames:
filter(), arrange(), and
distinct() manipulate observations
mutate(), select(),
rename(), and relocate()
manipulate variables
Some valuable helper functions
desc(), group_by(),
summarize(), and the slice_*() functions
This week you are introduced to the pipe, |>, an
efficient tool for writing and organizing R commands that manipulate data frames.
Some pointers for this week:
The pipe is powerful! Pay special attention to how you use and read
commands that use the pipe. The pipe creates a flow from one data frame
to the next.
Read the documentation anytime the authors or solutions suggest. You will
begin to understand how and where to find help for functions and commands,
and you will learn how to read and search the documentation.
Pay special attention to the RStudio shortcuts and tips that the authors
share.
Practice using all four panes of RStudio.
Practice creating, saving, storing, and sharing *.R files in the
editor and file manager. This is a much more forgiving way to enter R
commands, document, and share your work.
"notes to your future self" - a beautiful analogy of the authors
2.3 What’s in a name?
naming variables and files to improve the consistency of your workspace
2.4 Calling functions
functions are tools that transform your data
2.5 Exercises
2.6 Summary
3 Data transformation
3.1 Introduction
dplyr, is meant to stand for data pliers, as in the
pliers in your toolbox in the garage.
3.2 Rows
main functions: filter(), arrange(), and
distinct()
helper function: desc()
3.3 Columns
main functions: mutate(), select(), rename(),
and relocate()
3.4 The pipe
learn how to read the flow of data frames in multistep commands that use
the pipe
learn the keyboard shortcut for the pipe!
3.5 Groups
main functions: group_by() and summarize()
helper functions: ungroup(), and the slice_*() functions
the .by argument is a relatively new and useful tool for grouping
3.6 Case study: aggregates and sample size
there are a couple of interesting points in this small "case study"
"variation decreases as sample size increases" - that is a general
truth that you understand, but it is useful to remember and useful
to see illustrated, in statistics we call this "the law of large
numbers"
note the interaction between the pipe workflow of manipulating
data froms using |>, and the layer workflow of adding
layers to plots using +, this is a powerful technique
in the graphic in this section, note the authors' use of
transparency with alpha to illustrate density in a plot
with a very large number of points
Assessment deadlines will be 11:59pm each Saturday.
All assessments are submitted to the Homework Folder inside your assigned
Google Drive folder.
There are no make-ups for missed assessments. Contact me before a deadline
if you have an issue meeting the deadline and we will find a mutually
agreeable solution.