We are finishing up the Part III: Transform portion of the book. The
first three parts of our book you must know if you want to say that you've
taken a first course in data science. By "must know" I mean you have worked
through them and are continuously improving your skills. It will take some time
before you are comfortable that you are using them effectively.
We have one more topic in Transform next week, the idea of combining
data frames to make new data frames with joins. That is a relatively
technical topic that will require some concentration.
This week I want you to read three relatively small chapters Chapter 16
Factors, Chapter 17 Dates and Times, and Chapter 18
Missing Values, but together this is a fair amount of content.
Factors - organizing, modifying, and incorporating categorical data
in R.
Dates and Times - surprisingly messy and varied, around the world
and even within countries. If you are going to encounter actual global data,
date and time wrangling is unavoidable.
Missing Values - we wish we always had complete data, but you
understand in your own experience how much information is "missing". Your
main defense, just as in your daily life, is to know what and
where information is missing.
Even though we have done some experimentation and investigation with data, the
data has still been mostly handed to you. When you find your own data that is
interesting to you, and make something of it, that is when you have really caught
the data science bug.
If you would like to see people who have really caught the bug and make some
public good of data science, I would suggest that you subscribe to the New York
Times. Their reporting has some fantastic analysis of politics, air travel,
business, economics, health, sports, ..., often with some deep and impressive
data analysis. You can find find an introductory subscription to their online
content for $1 a week for one year.
Major news outlets are clearly commercial applications of data that advocate
for their own perspectives. They do not always readily share the raw data and
techniques behind their analysis because this is their business, but with what
you have learned you can start to imagine how professionals do things and you
can start to imitate them. This is the beginning of the rest of your journey.
To that end I am giving you a different kind of assignment to submit this
Saturday. I am posting two data set,
delta_flight_275.csv
and
joann_locations.csv,
that I want you to investigate. You can certainly open them as text files, or in
Excel, but try to import them into R with your knowledge of "read_csv()". Think
how to clean them up and what you might do with them.
Your assignment this week is to look at both of these data sets and give me a
rough description of what they are about. What are the observations, variables,
and values? You will propose three things that you can learn from each data set
by analyzing and visualizing in R. See our Week 12 assessments for a complete
description of your assignment.