CDS-101: Introduction to Computational and Data Scienceshttp://spring18.cds101.com/2018-05-11T23:59:00-04:00Final Portfolio2018-05-11T23:59:00-04:002018-05-11T23:59:00-04:00Dr. Glasbrennertag:spring18.cds101.com,2018-05-11:/assignments/final-portfolio/In lieu of a written final exam, you will construct a course portfolio of three RMarkdown files containing annotated R functions, topic suggestions for a follow-up course to <span class="caps">CDS</span>-101, and a comparative discussion of two simulations.<p><span class="h3"><strong>Due:</strong> May 11, 2018 @ 11:59pm</span></p>
<div class="no-bullets">
<ul>
<li><p><a href="/doc/final_portfolio.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-8"></i> Offline copy of instructions</a></p></li>
<li><p><a href="https://classroom.github.com/a/2Jx2sVq0"><i class="fab fa-github-alt" data-fa-transform="grow-8"></i> Github Classroom repo for <strong>Final Portfolio</strong></a></p></li>
<li><p><a href="https://jkglasbrenner.github.io/ising-model-js/"><i class="fa fa-link" data-fa-transform="grow-2"></i> Ising Model simulation</a></p></li>
<li><a href="https://jkglasbrenner.github.io/schelling-model-js/"><i class="fa fa-link" data-fa-transform="grow-2"></i> Schelling’s Model of Segregation simulation</a></li>
</ul>
</div>
<h2 id="instructions">Instructions</h2>
<p>In lieu of a written final exam, you will construct a personal course portfolio containing three elements:</p>
<ol type="1">
<li><p><a href="#annotated-list-of-r-functions">An annotated list of R functions</a></p></li>
<li><p><a href="#reflection-and-further-learning-response">Reflection and further learning response</a></p></li>
<li><p><a href="#comparative-discussion-of-simulations">Comparative discussion of related simulations</a></p></li>
</ol>
<p>You will be provided with a starter repository on Github that must be used for putting together your individual portfolio, following the instructions provided below. The portfolio must be submitted by the end of the day on <strong>May 11, 2018</strong> by issuing a <em>Pull Request</em> in the usual way.</p>
<p>In addition to the portfolio, students will also meet with the course instructor for a 5–10 minute final interview on <strong>May 15, 2018</strong> during the scheduled final exam period. If you have an unavoidable conflict with this time slot, you must inform the instructor and schedule another time during the final exams period that you will do the interview.</p>
<h2 id="grading">Grading</h2>
<p>The assessment of your portfolio and final interview will be combined into a single grade, with the portfolio worth 70% and the final interview worth 30%. <a href="/syllabus.html#breakdown">As per the syllabus</a>, the combined grade is worth 25% of your overall course grade.</p>
<h2 id="portfolio-guidelines">Portfolio guidelines</h2>
<p>Consider the following guidelines as you get started with putting together your portfolio, <strong>but please note that this list is not exhaustive</strong>. It is important that you fulfill the basic requirements for each element, but organization, thoroughness, and creativity are encouraged. Grades will reflect the overall quality of the portfolio.</p>
<h3 id="annotated-list-of-r-functions">Annotated list of R functions</h3>
<p><strong>Purpose</strong> — A significant portion of the course is dedicated to learning how to use R and the <span class="monospace">tidyverse</span> ecosystem to wrangle and explore data, and then analyze it using statistics. Remembering the different functions, how they are used, and what problems they solve requires consistent practice and review. This is why it’s important to have personal notes that explain how a function works in your own words, which can be referenced long after you’ve completed this course.</p>
<p><strong>For your portfolio</strong> — Put together an annotated list of R functions for your portfolio by adding information to the template file in your starter repository. The template file comes with a pre-filled example for the <span class="monospace">ggplot2</span> syntax and how to use <code>geom_histogram()</code>, as well as listing other sections that are expected to be part of your notes. Use these examples as inspiration for how to put together notes on the other functions in R.</p>
<p>Unless told otherwise, when writing notes for a function, please include the following:</p>
<ul>
<li><p>The function’s name</p></li>
<li><p>The important inputs</p>
<ul>
<li><p>If the input was used in class or for a homework assignment, then it’s important</p></li>
<li><p>For <span class="monospace">ggplot2</span> functions, include a separate subsection for important aesthetic mappings (see example in your template file)</p></li>
</ul></li>
<li><p>A summary — 1 or 2 sentences — of what the function does</p></li>
<li><p>An example showing how to use the function</p>
<ul>
<li><strong>Your example cannot use the same dataset as the examples found in the help documentation or from class.</strong> For example, when explaining the <span class="monospace">dplyr</span> function you should not use the <code>presidential</code> dataset, but you <em>can</em> use a dataset from one of the homework assignments.</li>
</ul></li>
</ul>
<p>Your notes should discuss the following packages and functions:</p>
<ul>
<li><p><span class="monospace">base</span> R</p>
<ul>
<li><code>c()</code> and <code>nrow()</code></li>
</ul></li>
<li><p>The <a href="http://ggplot2.tidyverse.org/"><span class="monospace">ggplot2</span></a> package</p>
<ul>
<li><p><code>geom_histogram()</code>, <code>geom_point()</code>, <code>geom_bar()</code>, <code>geom_col()</code>, <code>geom_qq()</code>, <code>geom_smooth()</code>, <code>facet_wrap()</code>, <code>facet_grid()</code></p></li>
<li><p>Basic instructions on how to change the axes labels and give the plot a title</p></li>
</ul></li>
<li><p>The <a href="http://readr.tidyverse.org/"><span class="monospace">readr</span></a> package</p>
<ul>
<li><code>read_csv()</code></li>
</ul></li>
<li><p>The <a href="http://dplyr.tidyverse.org/"><span class="monospace">dplyr</span></a> package</p>
<ul>
<li><code>select()</code>, <code>slice()</code>, <code>rename()</code>, <code>arrange()</code>, <code>filter()</code>, <code>mutate()</code>, <code>group_by()</code>, <code>summarize()</code>, <code>count()</code></li>
</ul></li>
<li><p>The <a href="http://tibble.tidyverse.org/"><span class="monospace">tibble</span></a> package</p>
<ul>
<li>Instructions on manually creating a <code>tibble</code> using <code>data_frame()</code></li>
</ul></li>
<li><p>The <a href="http://tidyr.tidyverse.org/"><span class="monospace">tidyr</span></a> package</p>
<ul>
<li><code>gather()</code>, <code>separate()</code></li>
</ul></li>
<li><p>The <a href="https://github.com/hadley/rvest"><span class="monospace">rvest</span></a> package</p>
<ul>
<li>Instead of notes for each function, describe the procedure for using the <a href="http://selectorgadget.com/">SelectorGadget extension</a> with <code>read_html()</code>, <code>html_nodes()</code>, and <code>html_text()</code> to perform basic webscraping tasks</li>
</ul></li>
<li><p>Statistics functions</p>
<ul>
<li><p>Summary statistics functions: <code>mean()</code>, <code>median()</code>, <code>sd()</code>, <code>IQR()</code>, <code>min()</code>, and <code>max()</code></p></li>
<li><p>Percentiles: <code>quantile()</code></p></li>
<li><p>Instructions on extracting the <strong>values</strong> of the cumulative distribution function using <code>ggplot_build()</code> and <code>stat_ecdf()</code> from the <span class="monospace">ggplot2</span> package</p></li>
<li><p>Linear modeling: <code>lm()</code></p></li>
</ul></li>
<li><p>The <a href="https://github.com/andrewpbray/infer"><span class="monospace">infer</span></a> package</p>
<ul>
<li><p>Describe the inputs for the four functions: <code>specify()</code>, <code>hypothesize()</code>, <code>generate()</code>, and <code>calculate()</code></p></li>
<li><p>Instructions <strong>with an example</strong> on how you use <span class="monospace">infer</span> to conduct a hypothesis test: how you simulate the null distribution, and how you calculate the <em>p</em>-value</p></li>
<li><p>Instructions <strong>with an example</strong> on how you use <span class="monospace">infer</span> to find a confidence interval using bootstrapping, including how you would find different size intervals, for example a 90% interval or a 95% interval</p></li>
</ul></li>
<li><p>The <a href="https://github.com/tidyverse/modelr"><span class="monospace">modelr</span></a> package</p>
<ul>
<li>Instructions on how to use <code>data_grid()</code>, <code>add_residuals()</code>, <code>add_predictions()</code>, <code>seq_range()</code>, and <code>geom_ref_line()</code> to obtain the predictions and residuals of a model, as well as create visualizations that allow you to inspect the quality of the model</li>
</ul></li>
</ul>
<p>You are welcome to include notes on additional functions not listed here. Remember, these notes are ultimately for you, so think about what you would find helpful when you refer back to these.</p>
<h3 id="reflection-and-further-learning-response">Reflection and further learning response</h3>
<p><strong>Purpose</strong> — You’ve practiced and reviewed a lot of different skills this semester during in-class demos, group work, posts about the readings, homework assignments, and the midterm project. You’ve learned a lot! As with any discipline, there’s many more topics to learn about in the data science field that we did not have enough time to cover here, and there is far more depth and complexity to the various topics we did cover. Imagine for a moment that an official sequel to this course (let’s call it <span class="caps">CDS</span>-103) was being developed, and you have been asked to provide input that will help in designing the topic list and schedule. Reflect on what you’ve learned in this course, the skills you’ve learned that you’ve found to be the most valuable, which data science topics we discussed that you now want to learn even more about, and the topics you wish we covered in <span class="caps">CDS</span>-101 but did not. For this part of the portfolio, you will turn these thoughts into two concrete suggestions:</p>
<ol type="1">
<li><p>A topic that we introduced in <span class="caps">CDS</span>-101 that deserves more in-depth discussion and practice exercises</p></li>
<li><p>A data science topic that we did not cover in <span class="caps">CDS</span>-101 that should be covered in a follow-up course</p></li>
</ol>
<p><strong>For your portfolio</strong> — Your response should be at least two paragraphs in length — one paragraph per suggestion minimum — with at least 3 sentences per paragraph. While there is an element of opinion here, your suggestions must be <strong>substantive</strong> and explained/supported by referring to content from the course and additional information you look up on your own. If you refer to content from this course, mention the assignment, reading, or class we covered it in (see the calendar on <a href="http://spring18.cds101.com" class="uri">http://spring18.cds101.com</a>). If you refer to ideas you found elsewhere, provide a citation.</p>
<p>Consider the following guidelines and suggestions when completing your write-up:</p>
<ul>
<li><p>Keep in mind that the only prerequisite for a hypothetical <span class="caps">CDS</span>-103 class would be completing <span class="caps">CDS</span>-101, so while it would be a lot of fun to have in-depth coursework on neural networks, the math and programming skills necessary to cover this kind of topic properly is too high for a 100-level class</p></li>
<li><p>When suggesting a topic for more in-depth coverage, explain your reasoning for why it should receive more attention. What aspects of the topic should be focused on, and how will this improve your understanding of the topic? Be precise!</p></li>
<li><p>When suggesting a new topic that should be covered, provide a reference to where you heard about it. Also explain why this is a good topic to cover, and why it should be considered as a top choice over other possible topics. In your justification, you may want to explain how would learning about this topic be useful to you. For example, if you are not a <span class="caps">CDS</span> major, does it intersect with your chosen major, and if so, how? If you are a <span class="caps">CDS</span> major, will it better prepare you for a later class and is this topic covered in any of the other classes provided by the <span class="caps">CDS</span> department?</p></li>
<li><p>While programming is an important aspect of data science, it is ultimately just a tool and is not an end unto itself. This means that suggestions should <strong>not</strong> be of the form “learn how to <em>X</em> using R”. Suggestions that primarily focus on a programming topic need to be justified by citing data science methods that require them.</p></li>
<li><p>Do not suggest that we cover the mechanism-driven simulations (this includes agent-based simulations) described in the next section, <a href="#comparative-discussion-of-simulations"><strong>Comparative discussion of simulations</strong></a>. As discussed, these are approaches that generate their own data and predictions without the use of datasets, and so this is generally beyond the scope of a follow-up <span class="caps">CDS</span>-103 course. Additionally, the <span class="caps">CDS</span> department has several courses that touch on the methods and concepts behind these types of simulations, including <span class="caps">CDS</span>-201, <span class="caps">CDS</span>-205, <span class="caps">CDS</span>-230, and <span class="caps">CDS</span>-411.</p></li>
<li><p>If you do not know anything about the topic you are suggesting, you are expected to take the time to look up some basic information so that you can describe it competently. This is not a blog post or journal entry, so you will have to look up information in order to produce a write-up that is eligible for full credit.</p></li>
</ul>
<h3 id="comparative-discussion-of-simulations">Comparative discussion of simulations</h3>
<p><strong>Purpose</strong> — The field of computational and data sciences extends beyond the topics focused on in this course. From the <span class="caps">CDS</span>-101 perspective, we’ve used the terms <strong>model</strong> and <strong>simulation</strong> as shorthand for <em>data-driven</em> models and simulations. Yet, there is an alternative approach to model and simulation building that works in the opposite direction and is an indispensable tool in the natural sciences, engineering, and computational social sciences. This class of models and simulations <em>generate</em> predictions and data without using an underlying dataset as input. To distinguish these from their <em>data-driven</em> counterparts, we will refer to them as follows:</p>
<ul>
<li>A <em>microscopic</em> or <em>mechanism-driven</em> model or simulation is based on the known laws of nature. An example is deriving equations of motion for the planets in our solar system using Newton’s law of universal gravitation.</li>
</ul>
<p>After building this kind of model or simulation, the researcher will scan the model’s parameter space and look for trends in the predictions and outputs, which are then compared against experimental data (if available). If the model or simulation generates predictions or outputs that accord with the experimental data, then the proposed mechanism can be regarded as a plausible explanation for observed trends. However, if the predictions or outputs fail to agree with the experimental data, then the model or simulation is falsified and the proposed mechanism is rejected.</p>
<p><strong>For your portfolio</strong> — You will visit the following two webpages containing short summaries of two simulations that share a common lineage:</p>
<ul>
<li><p><a href="https://jkglasbrenner.github.io/ising-model-js/">Ising Model</a> (<a href="https://jkglasbrenner.github.io/ising-model-js/" class="uri">https://jkglasbrenner.github.io/ising-model-js/</a>)</p></li>
<li><p><a href="https://jkglasbrenner.github.io/schelling-model-js/">Schelling’s Model of Segregation</a> (<a href="https://jkglasbrenner.github.io/schelling-model-js/" class="uri">https://jkglasbrenner.github.io/schelling-model-js/</a>)</p></li>
</ul>
<p>Each page also contains the simulation itself implemented in Javascript, which runs inside the web browser. The simulations are visual and interactive, allowing you to change a small set of parameters using a simple dashboard. After becoming familiar with the different simulations and developing a basic intuition for how each one behaves, you will then compare and contrast them in a short write-up. Your comparative discussion must be at least 2 paragraphs in length (a minimum of one paragraph per simulation) and include the following:</p>
<ul>
<li><p>At least three ways in which the simulations are similar to one another</p></li>
<li><p>For each simulation, at least one way it is <em>different</em> from the other one</p></li>
<li><p>Pick one of the simulations and suggest a feature or rule that you could add to it that would change its outputs and predictions. You only need to do this using plain language, you are not expected to write any code for this. Be sure to hypothesize what you think the changes will do and what they might mean. For example, how do you anticipate that your proposed change will simulate different physical mechanisms or human behavior?</p></li>
</ul>
<h2 id="how-to-submit">How to submit</h2>
<p>When you are ready to submit, be sure to save, commit, and push your final result so that everything is synchronized to Github. Then, navigate to <strong>your copy</strong> of the <a href="https://classroom.github.com/a/2Jx2sVq0">Github repository</a> you used for your final portfolio. You should see your repository, along with the updated files that you just synchronized to Github. Confirm that your files are up-to-date, and then do the following steps:</p>
<ol type="1">
<li><p>Click the <em>Pull Requests</em> tab near the top of the page.</p></li>
<li><p>Click the green button that says “New pull request”.</p></li>
<li><p>Click the dropdown menu button labeled “base:”, and select the option <code>starting</code>.</p></li>
<li><p>Confirm that the dropdown menu button labeled “compare:” is set to <code>master</code>.</p></li>
<li><p>Click the green button that says “Create pull request”.</p></li>
<li><p>Give the <em>pull request</em> the following title: <span class="monospace">Submission: Final Portfolio, FirstName LastName</span>, replacing <span class="monospace">FirstName</span> and <span class="monospace">LastName</span> with your actual first and last name.</p></li>
<li><p>In the messagebox, write: <span class="monospace">My final portfolio is ready for grading @jkglasbrenner</span>.</p></li>
<li><p>Click “Create pull request” to lock in your submission.</p></li>
</ol>
Homework 52018-05-04T23:59:00-04:002018-05-04T23:59:00-04:00Dr. Glasbrennertag:spring18.cds101.com,2018-05-04:/assignments/homework-5/For this homework assignment, you will be guided through the process of building a regression model that predicts the market value of condominiums in New York City using a dataset published by the New York City Department of Finance.<p><span class="h3"><strong>Due:</strong> May 4, 2018 @ 11:59pm</span></p>
<div class="no-bullets">
<ul>
<li><p><a href="/doc/cds101_homework5.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-8"></i> Offline copy of instructions</a></p></li>
<li><p><a href="https://classroom.github.com/a/nFgO_ZG6"><i class="fab fa-github-alt" data-fa-transform="grow-8"></i> Github Classroom repo for <strong>Homework 5</strong></a></p></li>
</ul>
</div>
<h2 id="instructions">Instructions</h2>
<p>For this homework assignment, you will be guided through the process of building a regression model that predicts the market value of condominiums in New York City using a dataset published by the New York City Department of Finance.</p>
<p><a href="https://classroom.github.com/a/nFgO_ZG6">Obtain the Github repository you will use to complete homework 5</a> that contains a starter RMarkdown file named <span class="monospace">homework_5.Rmd</span>, which you will use to do your work and write-up when completing the questions below. Remember to fill in your name at the top of the RMarkdown document and be sure to save, commit, and push (upload) frequently to Github so that you have incremental snapshots of your work. When you’re done, follow the <a href="#how-to-submit">How to submit</a> section below to setup a Pull Request, which will be used for feedback.</p>
<h2 id="about-the-dataset">About the dataset</h2>
<p>This dataset reports on the market valuations of condominiums in New York City for Fiscal Year 2011/2012. The data<a href="#fn1" class="footnote-ref" id="fnref1"><sup>1</sup></a> comes from the New York City Department of Finance and was made available to the general public on the <a href="https://opendata.cityofnewyork.us/"><span class="caps">NYC</span> OpenData website</a> (<a href="https://opendata.cityofnewyork.us/" class="uri">https://opendata.cityofnewyork.us/</a>). The official description for the dataset is as follows:</p>
<blockquote>
<p>Condominiums and cooperatives are valued as if they were residential rental apartments. Income information from similar rental properties is applied to determine value. The Department of Finance (<span class="caps">DOF</span>) chooses similar properties to value condos and coops. Properties are selected based on a combination of factors such as: land location, income levels, building age and construction and exemptions and subsidies.</p>
</blockquote>
<p>The data in the files <span class="monospace">housing_train.rds</span> and <span class="monospace">housing_test.rds</span> was adapted from the <a href="https://www.jaredlander.com/data/">cleaned and aggregated version by data scientist Jared Lander</a> (<a href="https://www.jaredlander.com/data" class="uri">https://www.jaredlander.com/data</a>).</p>
<table>
<colgroup>
<col style="width: 22%" />
<col style="width: 77%" />
</colgroup>
<thead>
<tr class="header">
<th style="text-align: left;">Variable</th>
<th style="text-align: left;">Description</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: left;"><span class="monospace">boro</span></td>
<td style="text-align: left;">Borough where building is located. New York City is divided into 5 boroughs, Manhattan, The Bronx, Brooklyn, Queens, and Staten Island.</td>
</tr>
<tr class="even">
<td style="text-align: left;"><span class="monospace">neighborhood</span></td>
<td style="text-align: left;">Neighborhood of building location. The neighborhood name is assigned by the New York City Department of Finance, and in most cases is the same as the neighborhood’s common name.</td>
</tr>
<tr class="odd">
<td style="text-align: left;"><span class="monospace">class</span></td>
<td style="text-align: left;"><a href="http://www1.nyc.gov/assets/finance/jump/hlpbldgcode.html">Building classification code</a> assigned by the New York City Department of Finance. There are four building classifications for the condominiums in the dataset, rental, walk-up, elevator, and co-op.</td>
</tr>
<tr class="even">
<td style="text-align: left;"><span class="monospace">year_build</span></td>
<td style="text-align: left;">Year the building was built</td>
</tr>
<tr class="odd">
<td style="text-align: left;"><span class="monospace">units</span></td>
<td style="text-align: left;">Total number of units in the building</td>
</tr>
<tr class="even">
<td style="text-align: left;"><span class="monospace">sqft</span></td>
<td style="text-align: left;">Gross square footage of the building</td>
</tr>
<tr class="odd">
<td style="text-align: left;"><span class="monospace">value_per_sqft</span></td>
<td style="text-align: left;">Total market value per squarefoot of the land and building</td>
</tr>
</tbody>
</table>
<h2 id="questions">Questions</h2>
<p>The main goal of this homework assignment is to build a model that predicts the market value per squarefoot — the variable <code>value_per_sqft</code> — of condominiums in New York City. We should not expect to be able to construct a model with 100% precision, but we would like to uncover trends in the data. This allows us to pose the driving question for our analysis as follows:</p>
<blockquote>
<p>What are key factors that affect the overall price of condominiums in New York City?</p>
</blockquote>
<p>When building and evaluating predictive models, it is standard protocol to split your dataset into a <strong>training dataset</strong> and a <strong>test dataset</strong>. This has already been done for you, with the training dataset loaded into the variable <code>housing_train</code> and the testing dataset loaded into <code>housing_test</code>. You will be using <code>housing_train</code> for most of the homework to build and cross-validate your models. Once you’ve selected your model, as a final step you will use it to predict the <code>value_per_sqft</code> column in the dataset stored in <code>housing_test</code>.</p>
<p>For your convenience, the helper function <code>rep_kfold_cv(data, k, model, cv_reps)</code> is loaded into your R environment and will run the code that cross-validates your models.<a href="#fn2" class="footnote-ref" id="fnref2"><sup>2</sup></a> This function requires four inputs:</p>
<table>
<colgroup>
<col style="width: 10%" />
<col style="width: 89%" />
</colgroup>
<thead>
<tr class="header">
<th>Input</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><span class="monospace">data</span></td>
<td>The training dataset</td>
</tr>
<tr class="even">
<td><span class="monospace">k</span></td>
<td>the number of folds to use for cross-validation</td>
</tr>
<tr class="odd">
<td><span class="monospace">model</span></td>
<td>Model to cross-validate written in the <span class="monospace">lm()</span> formula syntax (<span class="monospace">money ~ work + time</span>)</td>
</tr>
<tr class="even">
<td><span class="monospace">cv_reps</span></td>
<td>Number of times to repeat cross-validation sequence to improve statistical averaging</td>
</tr>
</tbody>
</table>
<hr />
<ol class="example" type="1">
<li><p>Create the following visualizations to explore the dataset:</p>
<ul>
<li><p>A histogram of <code>value_per_sqft</code> faceted over boroughs of New York City</p></li>
<li><p>Box plots of <code>units</code> (y axis) for the different boroughs (x axis) plotted two different ways: in a normal scale and in a <code>log10()</code> scale along the y axis (see <a href="http://r4ds.had.co.nz/graphics-for-communication.html#replacing-a-scale" class="uri">http://r4ds.had.co.nz/graphics-for-communication.html#replacing-a-scale</a> for how to scale the axes)</p></li>
<li><p>Box plots of <code>sqft</code> (y axis) for the different boroughs (x axis) plotted two different ways: in a normal scale and in a <code>log10()</code> scale along the y axis</p></li>
<li><p>Scatterplots of <code>value_per_sqft</code> (y axis) versus <code>units</code> (x axis) using <code>log10()</code> scaling for <code>units</code>. Facet over two variables: boroughs <strong>and</strong> condominium classification.</p></li>
</ul>
<p>Based on your plots so far, which variables (columns) in the dataset seem to have the strongest overall impact on the condominium values?</p></li>
<li><p>The box plots of <code>units</code> in the previous question should reveal extreme outliers in the plot. Since our goal is to model general trends and not precise values, fitting to these data points may skew our model in an unhelpful way. Filter the dataset to remove these outliers (there are 3 in all).</p>
<p><strong>Note: Besides these 3 extreme points, there are other potential outliers that you might consider removing.</strong> <strong>If you detect others, you are welcome to remove them to see if it helps you build your model, provided you explain why you’re removing them.</strong></p></li>
<li><p>To begin, try building univariate (single variable) models and see how they compare with each other. Use <code>value_per_sqft</code> as your response variable and then try <code>boro</code>, <code>class</code>, <code>units</code>, and <code>sqft</code> for your explanatory variable (this means you will try out 4 different models). Plug these models into the k-fold cross-validation function <code>rep_kfold_cv()</code> with <span class="monospace">k = 10</span> and <span class="monospace">cv_reps = 3</span>. Compare the mean-square error (<span class="caps">MSE</span>), both unadjusted and adjusted, and <span class="math inline">\(R^2\)</span> for these models, noting that models with better predictive power will have lower <span class="caps">MSE</span> and higher <span class="math inline">\(R^2\)</span> scores. Which model did the best so far?</p></li>
</ol>
<hr />
<p>These next three questions can be completed for extra credit, provided you’ve already completed the first three questions in full. These will guide you through building a more complicated multivariate model and making predictions with your final model on the <code>housing_test</code> dataset.</p>
<p><strong>Answering each question correctly will give you some extra credit, but you must complete them in order.</strong> <strong>This means you shouldn’t skip question 4 and just try and answer questions 5 and 6.</strong> <strong>Submissions that skip over or provide incomplete answers for these questions will not receive any extra credit.</strong></p>
<ol start="4" class="example" type="1">
<li><p>Build and cross-validate at least 3 multivariate models that predict <code>value_per_sqft</code>, using the k-fold cross-validation parameters <span class="monospace">k = 10</span> and <span class="monospace">cv_reps = 3</span>. An example of a multivariate model is <code>value_per_sqft ~ boro + units</code>. You may also want to consider including interaction terms (see <a href="http://r4ds.had.co.nz/model-basics.html#interactions-continuous-and-categorical" class="uri">http://r4ds.had.co.nz/model-basics.html#interactions-continuous-and-categorical</a> for a quick review). For example, you might try <code>value_per_sqft ~ boro + class * sqft</code>. Which of your models performs the best? Is it significantly better than your best model in the last question?</p></li>
<li><p>Now that you’ve selected your model, train it on the full dataset:</p>
<pre class="r"><code>final_model <- lm(model_formula, data = housing_train) </code></pre>
<p>where <code>model_formula</code> is the best performing model from the previous question.</p>
<p>To predict values in the testing set, use <code>add_predictions()</code> from the <span class="monospace">modelr</span> package to put the model predictions directly into your testing dataset. Then calculate the mean-square error for the predictions:</p>
<pre class="r"><code>test_predictions %>%
summarize(mse = sum((value_per_sqft - pred)^2) / n())</code></pre>
<p>This score is useful because it is absolute and allows you to compare how well your model performs against all other model types. Can you do better than a <span class="caps">MSE</span> score of 2030.34?</p></li>
<li><p>To wrap up, evaluate how well your model obeys the conditions for least squares linear regression, which are summarized on page 238 of the <a href="http://spring18.cds101.com/doc/Diez_Barr_%C3%87etinkaya-Rundel_IntroductoryStatisticsWithRandomizationAndSimulation.pdf"><em>Introductory Statistics with Randomization and Simulation</em></a> textbook. Make two plots to inspect how well your model conforms to the requirements for linear regression:</p>
<ul>
<li><p>To evaluate the residual spread, make a scatterplot of <code>(value_per_sqft - pred)</code> (y axis) versus <code>pred</code> (x axis)</p></li>
<li><p>To inspect whether the residual distribution is nearly normal, make a Q-Q plot of <code>(value_per_sqft - pred)</code>.</p></li>
</ul>
<p>Explain whether your model obeys the conditions for least squares linear regression.</p></li>
</ol>
<h2 id="how-to-submit">How to submit</h2>
<p>When you are ready to submit, be sure to save, commit, and push your final result so that everything is synchronized to Github. Then, navigate to <strong>your copy</strong> of the <a href="https://classroom.github.com/a/nFgO_ZG6">Github repository</a> you used for this assignment. You should see your repository, along with the updated files that you just synchronized to Github. Confirm that your files are up-to-date, and then do the following steps:</p>
<ol type="1">
<li><p>Click the <em>Pull Requests</em> tab near the top of the page.</p></li>
<li><p>Click the green button that says “New pull request”.</p></li>
<li><p>Click the dropdown menu button labeled “base:”, and select the option <code>starting</code>.</p></li>
<li><p>Confirm that the dropdown menu button labeled “compare:” is set to <code>master</code>.</p></li>
<li><p>Click the green button that says “Create pull request”.</p></li>
<li><p>Give the <em>pull request</em> the following title: <span class="monospace">Submission: Homework 5, FirstName LastName</span>, replacing <span class="monospace">FirstName</span> and <span class="monospace">LastName</span> with your actual first and last name.</p></li>
<li><p>In the messagebox, write: <span class="monospace">My homework submission is ready for grading @shuaibm @jkglasbrenner</span>.</p></li>
<li><p>Click “Create pull request” to lock in your submission.</p></li>
</ol>
<h2 id="cheatsheets">Cheatsheets</h2>
<p>You are encouraged to review and keep the following cheatsheets handy while working on this assignment:</p>
<ul>
<li><p><a href="http://spring18.cds101.com/doc/rstudio-IDE-cheatsheet.pdf">RStudio cheatsheet</a></p></li>
<li><p><a href="http://spring18.cds101.com/doc/rmarkdown-cheatsheet.pdf">RMarkdown cheatsheet</a></p></li>
<li><p><a href="http://spring18.cds101.com/doc/rmarkdown-reference.pdf">RMarkdown reference</a></p></li>
<li><p><a href="http://spring18.cds101.com/doc/ggplot2-cheatsheet.pdf"><span class="monospace">ggplot2</span> cheatsheet</a></p></li>
<li><p><a href="http://spring18.cds101.com/doc/data-transformation-cheatsheet.pdf">Data transformation cheatsheet</a></p></li>
<li><p><a href="http://spring18.cds101.com/doc/data-import-cheatsheet.pdf">Data import cheatsheet</a></p></li>
</ul>
<section class="footnotes">
<hr />
<ol>
<li id="fn1"><p>Data set aggregated from the following sources:<br />
<a href="https://data.cityofnewyork.us/Finances/DOF-Condominium-Comparable-Rental-Income-Manhattan/dvzp-h4k9" class="uri">https://data.cityofnewyork.us/Finances/<span class="caps">DOF</span>-Condominium-Comparable-Rental-Income-Manhattan/dvzp-h4k9</a><br />
<a href="https://data.cityofnewyork.us/Finances/DOF-Condominium-Comparable-Rental-Income-Brooklyn-/bss9-579f" class="uri">https://data.cityofnewyork.us/Finances/<span class="caps">DOF</span>-Condominium-Comparable-Rental-Income-Brooklyn-/bss9-579f</a><br />
<a href="https://data.cityofnewyork.us/Finances/DOF-Condominium-Comparable-Rental-Income-Queens-FY/jcih-dj9q" class="uri">https://data.cityofnewyork.us/Finances/<span class="caps">DOF</span>-Condominium-Comparable-Rental-Income-Queens-<span class="caps">FY</span>/jcih-dj9q</a><br />
<a href="https://data.cityofnewyork.us/Property/DOF-Condominium-Comparable-Rental-Income-Bronx-FY-/3qfc-4tta" class="uri">https://data.cityofnewyork.us/Property/<span class="caps">DOF</span>-Condominium-Comparable-Rental-Income-Bronx-<span class="caps">FY</span>-/3qfc-4tta</a><br />
<a href="https://data.cityofnewyork.us/Finances/DOF-Condominium-Comparable-Rental-Income-Staten-Is/tkdy-59zg" class="uri">https://data.cityofnewyork.us/Finances/<span class="caps">DOF</span>-Condominium-Comparable-Rental-Income-Staten-Is/tkdy-59zg</a><a href="#fnref1" class="footnote-back">↩</a></p></li>
<li id="fn2"><p>For those that are interested in seeing how you would implement k-fold cross-validation using the <code>tidyverse</code> packages, the code for the function <code>rep_kfold_cv()</code> can be found in the file <code>repeated_kfold_cross_validation.R</code> distributed with your Github repo.<a href="#fnref2" class="footnote-back">↩</a></p></li>
</ol>
</section>
Class 28: Modeling IV2018-05-03T13:30:00-04:002018-05-03T13:30:00-04:00Dr. Glasbrennertag:spring18.cds101.com,2018-05-03:/materials/class-28/<p><span class="h3">Slides</span></p>
<div class="no-bullets">
<ul>
<li><a href="/doc/class28_slides.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-16"></i> <span class="card-downloads-format"><span class="caps">PDF</span></span></a></li>
</ul>
</div>
<p><span class="h3">Slides</span></p>
<div class="no-bullets">
<ul>
<li><a href="/doc/class28_slides.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-16"></i> <span class="card-downloads-format"><span class="caps">PDF</span></span></a></li>
</ul>
</div>
Cross-validation activity2018-05-03T13:30:00-04:002018-05-03T13:30:00-04:00Dr. Glasbrennertag:spring18.cds101.com,2018-05-03:/materials/cross-validation-activity/<p>Activity on building and cross-validating models that predict the total price <code>totalPr</code> of copies of <em>Mario Kart Wii</em> sold on eBay.</p>
<div class="no-bullets">
<ul>
<li><a href="https://classroom.github.com/a/Ce_Ci3HZ"><i class="fab fa-github-alt" data-fa-transform="grow-16"></i> <span class="card-downloads-format"><strong>Cross-validation activity</strong> repo</span></a></li>
</ul>
</div>
<p>Activity on building and cross-validating models that predict the total price <code>totalPr</code> of copies of <em>Mario Kart Wii</em> sold on eBay.</p>
<div class="no-bullets">
<ul>
<li><a href="https://classroom.github.com/a/Ce_Ci3HZ"><i class="fab fa-github-alt" data-fa-transform="grow-16"></i> <span class="card-downloads-format"><strong>Cross-validation activity</strong> repo</span></a></li>
</ul>
</div>
Class 27: Modeling III2018-05-01T13:30:00-04:002018-05-01T13:30:00-04:00Dr. Glasbrennertag:spring18.cds101.com,2018-05-01:/materials/class-27/<p><span class="h3">Slides</span></p>
<div class="no-bullets">
<ul>
<li><a href="/doc/class27_slides.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-16"></i> <span class="card-downloads-format"><span class="caps">PDF</span></span></a></li>
</ul>
</div>
<p><span class="h3">Slides</span></p>
<div class="no-bullets">
<ul>
<li><a href="/doc/class27_slides.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-16"></i> <span class="card-downloads-format"><span class="caps">PDF</span></span></a></li>
</ul>
</div>
Homework 42018-04-27T23:59:00-04:002018-04-27T23:59:00-04:00Dr. Glasbrennertag:spring18.cds101.com,2018-04-27:/assignments/homework-4/For this homework assignment, you will use statistical inference to answer a question about the National Survey of Family Growth, Cycle 6 dataset published by the National Center for Health Statistics.<p><span class="h3"><strong>Due:</strong> April 27, 2018 @ 11:59pm</span></p>
<div class="no-bullets">
<ul>
<li><p><a href="/doc/cds101_homework4.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-8"></i> Offline copy of instructions</a></p></li>
<li><p><a href="https://classroom.github.com/a/6qDS86I0"><i class="fab fa-github-alt" data-fa-transform="grow-8"></i> Github Classroom repo for <strong>Homework 4</strong></a></p></li>
</ul>
</div>
<h2 id="instructions">Instructions</h2>
<p>For this homework assignment, you will use statistical inference to answer a question about the National Survey of Family Growth, Cycle 6 dataset published by the National Center for Health Statistics.</p>
<p><a href="https://classroom.github.com/a/6qDS86I0">Obtain the Github repository you will use to complete homework 4</a> that contains a starter RMarkdown file named <span class="monospace">homework_4.Rmd</span>, which you will use to do your work and write-up when completing the questions below. Remember to fill in your name at the top of the RMarkdown document and be sure to save, commit, and push (upload) frequently to Github so that you have incremental snapshots of your work. When you’re done, follow the <a href="#how-to-submit">How to submit</a> section below to setup a Pull Request, which will be used for feedback.</p>
<h2 id="about-the-dataset">About the dataset</h2>
<p>This homework uses the <em>National Survey of Family Growth, Cycle 6</em> dataset in the file <span class="monospace">2002FemPreg.rds</span>, published by the National Center for Health Statistics. Complete descriptions of all the variables can be found in the <a href="http://www.icpsr.umich.edu/nsfg6/Controller?displayPage=codeBook&fileCode=PREG&section=&subSec="><span class="caps">NSFG</span> Cycle 6: Female Pregnancy File Codebook</a>. Below are selected descriptions of variables that will be used for this homework assignment:</p>
<table>
<colgroup>
<col style="width: 21%" />
<col style="width: 78%" />
</colgroup>
<thead>
<tr class="header">
<th style="text-align: left;">Variable</th>
<th style="text-align: left;">Description</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: left;"><span class="monospace">caseid</span></td>
<td style="text-align: left;">integer <span class="caps">ID</span> of the respondent</td>
</tr>
<tr class="even">
<td style="text-align: left;"><span class="monospace">prglngth</span></td>
<td style="text-align: left;">integer duration of the pregnancy in weeks</td>
</tr>
<tr class="odd">
<td style="text-align: left;"><span class="monospace">outcome</span></td>
<td style="text-align: left;">integer code for the outcome of the pregnancy, with a 1 indicating a live birth</td>
</tr>
<tr class="even">
<td style="text-align: left;"><span class="monospace">birthord</span></td>
<td style="text-align: left;">serial number for live births; the code for a respondent’s first child is 1, and so on. For outcomes other than live birth, this field is blank</td>
</tr>
</tbody>
</table>
<h2 id="questions">Questions</h2>
<p>This homework assignment revolves around answering the following question using this dataset:</p>
<blockquote>
<p>Do first born children either arrive early or late when compared with non-first-borns?</p>
</blockquote>
<p>The questions in this assignment will guide you through the process of answering this question using statistical inference.</p>
<ol class="example" type="1">
<li><p>Addressing the question <em>“do first born children either arrive early or late when compared with non-first-borns?”</em> means that we should only consider live births in the dataset. Filter the dataset so that it only contains outcomes with live births and assign this result to the variable <code>live_births</code>.</p>
<p>Next, we need to label all births in <code>live_births</code> as either “first” or “other” so that we can easily find the children that are first borns and the ones that are not. There are a couple of different ways that you can label the births:</p>
<ul>
<li><p>Split the dataset into two parts, a <code>first_births</code> dataset and an <code>other_births</code> dataset. Do this by applying a filter to extract the first births, then use <code>mutate()</code> to create a new column called <code>birth_order</code> that labels these rows as “first”. Then repeat this process, except apply a filter to extract all other births and label those as “other” in <code>birth_order</code>. To recombine the datasets into one, use <code>bind_rows()</code>.</p></li>
<li><p>Use <code>ifelse()</code> with <code>mutate()</code> to create the <code>birth_order</code> column and the “first” and “other” labels. You can also try using <code>case_when()</code> instead of <code>ifelse()</code> to accomplish this. If you do it this way, you won’t need to use <code>bind_rows()</code>.</p></li>
</ul>
<p>After labeling the births, remove the extraneous variables from the <code>tibble</code> leaving just the <code>prglngth</code> and <code>birth_order</code> columns. Assign the resulting <code>tibble</code> to a variable named <code>pregnancy_length</code>.</p></li>
<li><p>Take the <code>pregnancy_length</code> dataset and plot a probability mass function (<span class="caps">PMF</span>) histogram of the pregnancy length in weeks that shows “first” births and “other” births on the same plot. Choose a reasonable binwidth for the histogram and add <code>coord_cartesian(xlim = c(27, 46))</code> to your plot so that the window focuses where most of the data is. <strong>After creating the plot, describe the shape, center, and spread of the two distributions</strong>. Based on the visualization, do you think the data looks like it supports the statement that “first born children either arrive early or arrive late when compared with non-first-borns”?</p></li>
<li><p>Group the variable <code>prglngth</code> into “first” and “other” births and compute the summary statistics (mean, median, standard deviation, inter-quartile range, minimum, maximum) for each group. How do the different summary statistics compare between the two distributions? Does it look like there may be a notable difference between the two distributions? Explain.</p></li>
<li><p>Create theoretical quantile-quantile (q-q) plots to determine whether the pregnancy length distributions for “first” and “other” births are nearly normal. You should make <strong>separate</strong> q-q plots for “first” births and “other” births (i.e. you should end up with two plots).</p>
<p>Each q-q plot should include a reference line that shows what the theoretical normal distribution looks like so that we can observe deviations from normality. You’ll need to compute this line manually. Adapt the following code to compute the slope and intercept for the theoretical line (you will need to do this twice, once for “first” births and again for “other” births):</p>
<pre class="r"><code>qq_x <- qnorm(p = c(0.25, 0.75))
qq_y <- quantile(x = pull(dataset, variable), probs = c(0.25, 0.75), type = 1)
qq_slope <- diff(qq_y) / diff(qq_x)
qq_int <- pluck(qq_y, 1) - qq_slope * pluck(qq_x, 1)</code></pre>
<p><strong>“Adapting the code” means you will have to change the <code>dataset</code> and <code>variable</code> parameters.</strong> Once you have found <code>qq_slope</code> and <code>qq_int</code> for “first” and “other” births, use <code>geom_abline()</code> to include the theoretical reference line in your q-q plots.</p>
<p>Based on your plots, does it appear that the pregnancy length distribution is nearly normal for both “first” and “other” births? Why or why not?</p></li>
<li><p>Returning back to the question of whether or not “first babies arrive early or arrive late when compared with non-first-borns”, plot the cumulative distribution functions (CDFs) of the pregnancy lengths for “first” and “other” births. Plot the <span class="caps">CDF</span> for both distributions on the same figure so that we can directly compare them. How do the distributions compare? Does it look like there is there a meaningful difference between the two distributions?</p></li>
<li><p>If we want to determine whether or not the difference between two distributions is statistically significant, we need to run a hypothesis test. Formalize your analysis by writing down the null hypothesis for the question of whether first babies arrive early or arrive late when compared with non-first-borns. It should be clear from how you write your null hypothesis whether you’re conducting a <strong>one-sided</strong> or <strong>two-sided</strong> hypothesis test. You should also restate what the <strong>observed value</strong> is (this will be a number) that you will be comparing against the null distribution.</p></li>
<li><p>Use a simulation to generate the null distribution so that you can perform the hypothesis test. Do this using the functions provided in the <code>infer</code> package, <code>specify()</code>, <code>hypothesize()</code>, <code>generate()</code>, and <code>calculate()</code>. To collect enough statistics, you should set the simulation to repeat 10,000 times.</p>
<p>Once you’ve obtained the null distribution, use it to compute the <em>p</em>-value for your hypothesis test. Assuming a significance level of <span class="math inline">\(\alpha = 0.05\)</span>, state whether we can we reject the null hypothesis.</p>
<p>Finally, visualize the simulated null distribution as a histogram and use <code>geom_vline()</code> to show where the <strong>observed value</strong> sits relative to it.</p></li>
<li><p>Use a bootstrap simulation to calculate the 95% confidence interval for your <strong>observed value</strong>. Do this using the following functions from the <code>infer</code> package, <code>specify()</code>, `<code>generate()</code>, and <code>calculate()</code>. To collect enough statistics, you should set the bootstrap simulation to repeat 10,000 times.</p>
<p>Once you’ve obtained the bootstrap distribution, use <code>quantile()</code> to find the upper and lower bounds of the 95% confidence interval. Does the value corresponding to the null result fall within the range of the 95% confidence interval?</p>
<p>Finally, visualize the bootstrap distribution as a histogram and use two <code>geom_vline()</code>s to show the region corresponding to the 95% confidence interval.</p></li>
<li><p>In addition to hypothesis tests and confidence intervals, we should also consider the <strong>effect size</strong>, which measures the relative difference between two distributions. The effect size helps us better know how important a given result actually is, not just whether or not we can reject the null hypothesis. One measure of the effect size is called <a href="https://en.wikipedia.org/wiki/Effect_size#Cohen.27s_d">Cohen’s <em>d</em></a> (<a href="https://en.wikipedia.org/wiki/Effect_size#Cohen.27s_d" class="uri">https://en.wikipedia.org/wiki/Effect_size#Cohen.27s_d</a>), which we will use to compute the effect size between the pregnancy lengths for “first” and “other” births. The different ranges of <em>d</em> can be interpreted using the following table:</p>
<table>
<thead>
<tr class="header">
<th style="text-align: left;">Effect size</th>
<th style="text-align: left;">d</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: left;">Very small</td>
<td style="text-align: left;">0.01</td>
</tr>
<tr class="even">
<td style="text-align: left;">Small</td>
<td style="text-align: left;">0.20</td>
</tr>
<tr class="odd">
<td style="text-align: left;">Medium</td>
<td style="text-align: left;">0.50</td>
</tr>
<tr class="even">
<td style="text-align: left;">Large</td>
<td style="text-align: left;">0.80</td>
</tr>
<tr class="odd">
<td style="text-align: left;">Very large</td>
<td style="text-align: left;">1.20</td>
</tr>
<tr class="even">
<td style="text-align: left;">Huge</td>
<td style="text-align: left;">2.00</td>
</tr>
</tbody>
</table>
<p>The following set of functions should also be preloaded for you: <code>cohens_d_bootstrap()</code>, <code>bootstrap_report()</code>, and <code>plot_ci()</code>. These functions will use bootstrap simulations to compute the confidence interval for the Cohen’s <em>d</em> parameter. Run the bootstrap simulation as follows:</p>
<pre class="r"><code>cohens_d_bootstrap(data = pregnancy_length, model = prglngth ~ birth_order)</code></pre>
<p><strong>Be sure to assign the results to a variable</strong>, for example <code>bootstrap_results</code>.</p>
<p>To print a report for the bootstrap simulation, run:</p>
<pre class="r"><code>bootstrap_report(bootstrap_results)</code></pre>
<p>To visualize the bootstrap distribution and confidence interval, run:</p>
<pre class="r"><code>plot_ci(bootstrap_results)</code></pre>
<p>Using the provided table, report how large the effect size is for the difference in pregnancy lengths for “first” and “other” births.</p></li>
<li><p>Having completed your analysis, write a paragraph that summarizes your results and answers the question “do first born children arrive early or late compared to other children?” Write the summary as if you are an academic or scientific journalist, focusing on how you can answer the question clearly, precisely, and honestly.</p></li>
</ol>
<h2 id="how-to-submit">How to submit</h2>
<p>When you are ready to submit, be sure to save, commit, and push your final result so that everything is synchronized to Github. Then, navigate to <strong>your copy</strong> of the <a href="https://classroom.github.com/a/6qDS86I0">Github repository</a> you used for this assignment. You should see your repository, along with the updated files that you just synchronized to Github. Confirm that your files are up-to-date, and then do the following steps:</p>
<ol type="1">
<li><p>Click the <em>Pull Requests</em> tab near the top of the page.</p></li>
<li><p>Click the green button that says “New pull request”.</p></li>
<li><p>Click the dropdown menu button labeled “base:”, and select the option <code>starting</code>.</p></li>
<li><p>Confirm that the dropdown menu button labeled “compare:” is set to <code>master</code>.</p></li>
<li><p>Click the green button that says “Create pull request”.</p></li>
<li><p>Give the <em>pull request</em> the following title: <span class="monospace">Submission: Homework 4, FirstName LastName</span>, replacing <span class="monospace">FirstName</span> and <span class="monospace">LastName</span> with your actual first and last name.</p></li>
<li><p>In the messagebox, write: <span class="monospace">My homework submission is ready for grading @shuaibm @jkglasbrenner</span>.</p></li>
<li><p>Click “Create pull request” to lock in your submission.</p></li>
</ol>
<h2 id="cheatsheets">Cheatsheets</h2>
<p>You are encouraged to review and keep the following cheatsheets handy while working on this assignment:</p>
<ul>
<li><p><a href="http://spring18.cds101.com/doc/rstudio-IDE-cheatsheet.pdf">RStudio cheatsheet</a></p></li>
<li><p><a href="http://spring18.cds101.com/doc/rmarkdown-cheatsheet.pdf">RMarkdown cheatsheet</a></p></li>
<li><p><a href="http://spring18.cds101.com/doc/rmarkdown-reference.pdf">RMarkdown reference</a></p></li>
<li><p><a href="http://spring18.cds101.com/doc/ggplot2-cheatsheet.pdf"><span class="monospace">ggplot2</span> cheatsheet</a></p></li>
<li><p><a href="http://spring18.cds101.com/doc/data-transformation-cheatsheet.pdf">Data transformation cheatsheet</a></p></li>
<li><p><a href="http://spring18.cds101.com/doc/data-import-cheatsheet.pdf">Data import cheatsheet</a></p></li>
</ul>
Class 26: Modeling II2018-04-26T13:30:00-04:002018-04-26T13:30:00-04:00Dr. Glasbrennertag:spring18.cds101.com,2018-04-26:/materials/class-26/<p><span class="h3">Slides</span></p>
<div class="no-bullets">
<ul>
<li><a href="/doc/class26_slides.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-16"></i> <span class="card-downloads-format"><span class="caps">PDF</span></span></a></li>
</ul>
</div>
<p><span class="h3">Slides</span></p>
<div class="no-bullets">
<ul>
<li><a href="/doc/class26_slides.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-16"></i> <span class="card-downloads-format"><span class="caps">PDF</span></span></a></li>
</ul>
</div>
Reading 162018-04-26T13:30:00-04:002018-04-26T13:30:00-04:00Dr. Glasbrennertag:spring18.cds101.com,2018-04-26:/assignments/reading-16/<p><span class="h3"><a href="http://r4ds.had.co.nz/">R for Data Science</a></span></p>
<p>Read the following:</p>
<ul>
<li>From <a href="http://r4ds.had.co.nz/model-basics.html">chapter 23</a>: section <a href="http://r4ds.had.co.nz/model-basics.html#formulas-and-model-families">23.4</a> through to the end of section <a href="http://r4ds.had.co.nz/model-basics.html#other-model-families">23.6</a></li>
</ul>
<p><span class="h3">Reading discussion</span></p>
<div class="callout primary">
<p><strong>Discussion hashtag</strong><br />
<code>#reading16</code></p>
</div>
<div class="callout secondary">
<p>Remember to post your question about it to the <code>#5-discussion</code> channel in <a href="https://masoncds101.slack.com">Slack</a> by the due date. To receive an answer credit, reply to …</p></div><p><span class="h3"><a href="http://r4ds.had.co.nz/">R for Data Science</a></span></p>
<p>Read the following:</p>
<ul>
<li>From <a href="http://r4ds.had.co.nz/model-basics.html">chapter 23</a>: section <a href="http://r4ds.had.co.nz/model-basics.html#formulas-and-model-families">23.4</a> through to the end of section <a href="http://r4ds.had.co.nz/model-basics.html#other-model-families">23.6</a></li>
</ul>
<p><span class="h3">Reading discussion</span></p>
<div class="callout primary">
<p><strong>Discussion hashtag</strong><br />
<code>#reading16</code></p>
</div>
<div class="callout secondary">
<p>Remember to post your question about it to the <code>#5-discussion</code> channel in <a href="https://masoncds101.slack.com">Slack</a> by the due date. To receive an answer credit, reply to a posted question no later than 11:59pm on Saturday, April 28th.</p>
</div>
<hr />
<p><a href="http://spring18.cds101.com/syllabus.html#readings"><i class="fas fa-link"></i> Posting guidelines can be found in the <strong>Readings</strong> section of the syllabus.</a></p>
Class 25: Inference and simulations IV / Modeling I2018-04-24T13:30:00-04:002018-04-24T13:30:00-04:00Dr. Glasbrennertag:spring18.cds101.com,2018-04-24:/materials/class-25/<p><span class="h3">Slides</span></p>
<div class="no-bullets">
<ul>
<li><a href="/doc/class25_slides.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-16"></i> <span class="card-downloads-format"><span class="caps">PDF</span></span></a></li>
</ul>
</div>
<p><span class="h3">Slides</span></p>
<div class="no-bullets">
<ul>
<li><a href="/doc/class25_slides.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-16"></i> <span class="card-downloads-format"><span class="caps">PDF</span></span></a></li>
</ul>
</div>
Reading 152018-04-24T13:30:00-04:002018-04-24T13:30:00-04:00Dr. Glasbrennertag:spring18.cds101.com,2018-04-24:/assignments/reading-15/<p><span class="h3"><a href="http://spring18.cds101.com/doc/Diez_Barr_%C3%87etinkaya-Rundel_IntroductoryStatisticsWithRandomizationAndSimulation.pdf">Introductory Statistics with Randomization and Simulation</a></span></p>
<p>Read the following:</p>
<ul>
<li>From chapter 5: from the beginning through to the end of section 5.1.4, section 5.4.1</li>
</ul>
<p><span class="h3"><a href="http://r4ds.had.co.nz/">R for Data Science</a></span></p>
<p>Read the following:</p>
<ul>
<li>All of <a href="http://r4ds.had.co.nz/model-intro.html">chapter 22</a> (short)</li>
<li>From <a href="http://r4ds.had.co.nz/model-basics.html">chapter 23</a>: section <a href="http://r4ds.had.co.nz/model-basics.html#introduction-15">23.1</a> through to the end …</li></ul><p><span class="h3"><a href="http://spring18.cds101.com/doc/Diez_Barr_%C3%87etinkaya-Rundel_IntroductoryStatisticsWithRandomizationAndSimulation.pdf">Introductory Statistics with Randomization and Simulation</a></span></p>
<p>Read the following:</p>
<ul>
<li>From chapter 5: from the beginning through to the end of section 5.1.4, section 5.4.1</li>
</ul>
<p><span class="h3"><a href="http://r4ds.had.co.nz/">R for Data Science</a></span></p>
<p>Read the following:</p>
<ul>
<li>All of <a href="http://r4ds.had.co.nz/model-intro.html">chapter 22</a> (short)</li>
<li>From <a href="http://r4ds.had.co.nz/model-basics.html">chapter 23</a>: section <a href="http://r4ds.had.co.nz/model-basics.html#introduction-15">23.1</a> through to the end of section <a href="http://r4ds.had.co.nz/model-basics.html#visualising-models">23.3</a></li>
</ul>
<p><span class="h3">Reading discussion</span></p>
<div class="callout primary">
<p><strong>Discussion hashtag</strong><br />
<code>#reading15</code></p>
</div>
<div class="callout secondary">
<p>Remember to post your question about it to the <code>#5-discussion</code> channel in <a href="https://masoncds101.slack.com">Slack</a> by the due date. To receive an answer credit, reply to a posted question no later than 11:59pm on Thursday, April 26th.</p>
</div>
<hr />
<p><a href="http://spring18.cds101.com/syllabus.html#readings"><i class="fas fa-link"></i> Posting guidelines can be found in the <strong>Readings</strong> section of the syllabus.</a></p>
Class 24: Inference and simulations III2018-04-19T13:30:00-04:002018-04-19T13:30:00-04:00Dr. Glasbrennertag:spring18.cds101.com,2018-04-19:/materials/class-24/<p><span class="h3">Slides</span></p>
<div class="no-bullets">
<ul>
<li><a href="/doc/class24_slides.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-16"></i> <span class="card-downloads-format"><span class="caps">PDF</span></span></a></li>
</ul>
</div>
<p><span class="h3">Slides</span></p>
<div class="no-bullets">
<ul>
<li><a href="/doc/class24_slides.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-16"></i> <span class="card-downloads-format"><span class="caps">PDF</span></span></a></li>
</ul>
</div>
Reading 142018-04-19T13:30:00-04:002018-04-19T13:30:00-04:00Dr. Glasbrennertag:spring18.cds101.com,2018-04-19:/assignments/reading-14/<p><span class="h3"><em>Nature</em> News Feature article</span></p>
<p>Read the following article about <em>p</em> values:</p>
<ul>
<li><a href="http://www.nature.com/news/scientific-method-statistical-errors-1.14700">“Scientific method: Statistical errors” by R. Nuzzo</a> (<a href="https://www.nature.com/polopoly_fs/1.14700!/menu/main/topColumns/topLeftColumn/pdf/506150a.pdf"><span class="caps">PDF</span> version</a>)</li>
</ul>
<p><span class="h3">Reading discussion</span></p>
<p>Instead of posting a question as we’ve done for the other readings, please respond to the following prompts:</p>
<ol type="1">
<li><p>Had you ever heard of this situation concerning <em>p …</em></p></li></ol><p><span class="h3"><em>Nature</em> News Feature article</span></p>
<p>Read the following article about <em>p</em> values:</p>
<ul>
<li><a href="http://www.nature.com/news/scientific-method-statistical-errors-1.14700">“Scientific method: Statistical errors” by R. Nuzzo</a> (<a href="https://www.nature.com/polopoly_fs/1.14700!/menu/main/topColumns/topLeftColumn/pdf/506150a.pdf"><span class="caps">PDF</span> version</a>)</li>
</ul>
<p><span class="h3">Reading discussion</span></p>
<p>Instead of posting a question as we’ve done for the other readings, please respond to the following prompts:</p>
<ol type="1">
<li><p>Had you ever heard of this situation concerning <em>p</em>-values before this class?</p>
<ul>
<li><p><strong>If this is the first time you’ve heard this</strong>, did you find this surprising, and does it affect how you feel about science? Explain.</p></li>
<li><p><strong>If you have heard about this situation before</strong>, did the article change your perspective in any way? Explain.</p></li>
</ul></li>
<li><p>Based on the article, what practical things can we do to make sure our claims are accurate and transparent? Mention any quantities that we should compute and what kinds of details we should try to include in our RMarkdown notebooks.</p></li>
</ol>
<div class="callout secondary">
<p><strong>Students that write a full and thoughtful response that addresses both prompts will receive both a question and an answer credit.</strong> A full response consists of a minimum of two paragraphs, one for the first prompt and one for the second prompt. Each paragraph must have a minimum of three full sentences, and the content must be substantive. Posts that don’t fulfill these criteria will only be eligible for a question credit.</p>
</div>
<div class="callout primary">
<p><strong>Discussion hashtag</strong><br />
<code>#reading14</code></p>
</div>
<hr />
<p><a href="http://spring18.cds101.com/syllabus.html#readings"><i class="fas fa-link"></i> Posting guidelines can be found in the <strong>Readings</strong> section of the syllabus.</a></p>
Class 23: Inference and simulations II2018-04-17T13:30:00-04:002018-04-17T13:30:00-04:00Dr. Glasbrennertag:spring18.cds101.com,2018-04-17:/materials/class-23/<p><span class="h3">Slides</span></p>
<div class="no-bullets">
<ul>
<li><a href="/doc/class23_slides.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-16"></i> <span class="card-downloads-format"><span class="caps">PDF</span></span></a></li>
</ul>
</div>
<p><span class="h3">Slides</span></p>
<div class="no-bullets">
<ul>
<li><a href="/doc/class23_slides.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-16"></i> <span class="card-downloads-format"><span class="caps">PDF</span></span></a></li>
</ul>
</div>
infer activity2018-04-17T13:30:00-04:002018-04-17T13:30:00-04:00Dr. Glasbrennertag:spring18.cds101.com,2018-04-17:/materials/infer-activity/<p>Activity using a dataset from the <em>Mythbusters</em> television show to practice building hypothesis tests with the <code>infer</code> package</p>
<div class="no-bullets">
<ul>
<li><a href="https://classroom.github.com/a/o-KntOw5"><i class="fab fa-github-alt" data-fa-transform="grow-16"></i> <span class="card-downloads-format"><strong>infer activity</strong> repo</span></a></li>
</ul>
</div>
<p>Activity using a dataset from the <em>Mythbusters</em> television show to practice building hypothesis tests with the <code>infer</code> package</p>
<div class="no-bullets">
<ul>
<li><a href="https://classroom.github.com/a/o-KntOw5"><i class="fab fa-github-alt" data-fa-transform="grow-16"></i> <span class="card-downloads-format"><strong>infer activity</strong> repo</span></a></li>
</ul>
</div>
Reading 132018-04-17T13:30:00-04:002018-04-17T13:30:00-04:00Dr. Glasbrennertag:spring18.cds101.com,2018-04-17:/assignments/reading-13/<p><span class="h3"><a href="http://spring18.cds101.com/doc/Diez_Barr_%C3%87etinkaya-Rundel_IntroductoryStatisticsWithRandomizationAndSimulation.pdf">Introductory Statistics with Randomization and Simulation</a></span></p>
<p>Read the following:</p>
<ul>
<li><p>From chapter 2: section 2.3 through to the end of section 2.5</p></li>
<li><p>From chapter 4: section 4.5 (skip 4.5.3)</p></li>
</ul>
<p><span class="h3">Reading discussion</span></p>
<div class="callout primary">
<p><strong>Discussion hashtag</strong><br />
<code>#reading13</code></p>
</div>
<div class="callout secondary">
<p>Remember to post your question about it to the <code>#5-discussion</code> channel in …</p></div><p><span class="h3"><a href="http://spring18.cds101.com/doc/Diez_Barr_%C3%87etinkaya-Rundel_IntroductoryStatisticsWithRandomizationAndSimulation.pdf">Introductory Statistics with Randomization and Simulation</a></span></p>
<p>Read the following:</p>
<ul>
<li><p>From chapter 2: section 2.3 through to the end of section 2.5</p></li>
<li><p>From chapter 4: section 4.5 (skip 4.5.3)</p></li>
</ul>
<p><span class="h3">Reading discussion</span></p>
<div class="callout primary">
<p><strong>Discussion hashtag</strong><br />
<code>#reading13</code></p>
</div>
<div class="callout secondary">
<p>Remember to post your question about it to the <code>#5-discussion</code> channel in <a href="https://masoncds101.slack.com">Slack</a> by the due date. To receive an answer credit, reply to a posted question no later than 11:59pm on Saturday, April 19th.</p>
</div>
<hr />
<p><a href="http://spring18.cds101.com/syllabus.html#readings"><i class="fas fa-link"></i> Posting guidelines can be found in the <strong>Readings</strong> section of the syllabus.</a></p>
Homework 32018-04-16T23:59:00-04:002018-04-16T23:59:00-04:00Dr. Glasbrennertag:spring18.cds101.com,2018-04-16:/assignments/homework-3/For this homework assignment, you will practice using the SelectorGadget Chrome extension to find the <span class="caps">CSS</span> selectors needed to scrape information from a webpage and use the rvest package to scrape data from the official Mason Patriots sports website.<p><span class="h3"><strong>Due:</strong> April 16, 2018 @ 11:59pm</span></p>
<div class="no-bullets">
<ul>
<li><p><a href="/doc/cds101_homework3.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-8"></i> Offline copy of instructions</a></p></li>
<li><p><a href="https://classroom.github.com/a/5GSCwIPV"><i class="fab fa-github-alt" data-fa-transform="grow-8"></i> Github Classroom repo for <strong>Homework 3</strong></a></p></li>
</ul>
</div>
<h2 id="instructions">Instructions</h2>
<p>For this homework assignment, you will practice using the <a href="https://chrome.google.com/webstore/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb"><span class="monospace">SelectorGadget</span> Chrome extension</a> to find the <span class="caps">CSS</span> selectors needed to scrape information from a webpage and use the <span class="monospace">rvest</span> package to scrape data from the <a href="http://gomason.com">official Mason Patriots sports website</a>.</p>
<p><a href="https://classroom.github.com/a/5GSCwIPV">Obtain the Github repository you will use to complete homework 3</a> that contains a starter RMarkdown file named <span class="monospace">homework_3.Rmd</span>, which you will use to do your work and write-up when completing the questions below. Remember to fill in your name at the top of the RMarkdown document and be sure to save, commit, and push (upload) frequently to Github so that you have incremental snapshots of your work. When you’re done, follow the <a href="#how-to-submit">How to submit</a> section below to setup a Pull Request, which will be used for feedback.</p>
<h2 id="part-1-selectorgadget-practice">Part 1 – <span class="monospace">SelectorGadget</span> practice</h2>
<p>One of the ways to target specific information on the webpage is through the use of <span class="caps">CSS</span> selectors, and becoming more comfortable with them will help you build more effective webscraping code. The <a href="https://chrome.google.com/webstore/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb"><span class="monospace">SelectorGadget</span> Chrome extension</a> is a convenient tool for determining the <span class="caps">CSS</span> selectors you need for a webscraping task. For a refresher on how to use the extension, <a href="http://spring18.cds101.com/doc/rvest-selectorgadget-vignette.html">review the <span class="monospace">SelectorGadget</span> vignette</a>.</p>
<p>For this part of the homework, use the <span class="monospace">SelectorGadget</span> tool to figure out the <span class="caps">CSS</span> selectors needed to scrape specific data from a linked page. <strong>You only need to report the <span class="caps">CSS</span> selectors needed to get the information and do not need to write <span class="monospace">rvest</span> code to formally extract it</strong>, although you are welcome to write code to test the <span class="caps">CSS</span> selectors if you like.</p>
<ol class="example" type="1">
<li><p>Using the <span class="monospace">SelectorGadget</span> tool, find the <span class="caps">CSS</span> selectors for the following information on the <span class="caps">IMDB</span> page for the television show <em>The Office</em>, <a href="https://www.imdb.com/title/tt0386676/" class="uri">https://www.imdb.com/title/tt0386676/</a>:</p>
<ul>
<li><p>Number of episodes</p></li>
<li><p>Certificate (<span class="caps">TV</span> Rating)</p></li>
<li><p>First five plot keywords</p></li>
<li><p>Genres</p></li>
<li><p>Runtime</p></li>
<li><p>Country</p></li>
<li><p>Language</p></li>
</ul></li>
<li><p>Using the <span class="monospace">SelectorGadget</span> tool, find the <span class="caps">CSS</span> selectors for the following information on the data.gov <em>Data Catalog</em>, <a href="https://catalog.data.gov/dataset" class="uri">https://catalog.data.gov/dataset</a>:</p>
<ul>
<li><p>Number of datasets found</p></li>
<li><p>Dataset names (example: <em>Demographic Statistics By Zip Code</em>)</p></li>
<li><p>Dataset organization (example: <em>City of New York</em>)</p></li>
<li><p>Dataset description (example: “Demographic statistics broken down by zip code”)</p></li>
<li><p>Dataset type (the ribbons on the upper-right of each row, for example: <em>Federal</em>, <em>City</em>)</p></li>
</ul></li>
</ol>
<h2 id="part-2-scraping-mason-patriots-scores">Part 2 – Scraping Mason Patriots Scores</h2>
<p>Webscrapers can be used for all kinds of purposes, such as building movie review databases, tracking prices for goods and services, and analyzing how a news story is reported on different news sites. Collecting sports data is another example, which can be used to <a href="https://nyti.ms/2FmQhgC">quantify how valuable players are</a> when <a href="https://www.villanovau.com/resources/bi/how-to-use-big-data-in-fantasy-football/">putting together a fantasy sports team</a>. Actual sports teams also employ statistical methods when drafting players and developing strategies, with <a href="https://en.wikipedia.org/wiki/Sabermetrics">Sabermetrics</a> (depicted in the movie <a href="https://www.imdb.com/title/tt1210166/"><em>Moneyball</em></a>) being one of the better-known examples.</p>
<p>For this part of the homework, we will scrape the 2017 – 2018 season schedules and scores for the men’s and women’s basketball teams on the official <a href="http://gomason.com">Mason Patriots sports site</a>. For reference, the 2017-2018 schedule and scores page for the men’s team should look like this:</p>
<p><img src="http://spring18.cds101.com/img/gomason-2017-2018-mens-basketball.png" title="plot of chunk womens-basketball-17-18-season-page" alt="plot of chunk womens-basketball-17-18-season-page" width="70%" style="display: block; margin: auto;" /></p>
<p>and the 2017 – 2018 schedule and scores page for the women’s team should look like this:</p>
<p><img src="http://spring18.cds101.com/img/gomason-2017-2018-womens-basketball.png" title="plot of chunk mens-basketball-17-18-season-page" alt="plot of chunk mens-basketball-17-18-season-page" width="70%" style="display: block; margin: auto;" /></p>
<p>The following questions will guide you through the process of scraping this data. You are encouraged to review the examples provided in the <a href="https://masoncds101.slack.com/archives/C8WQJ0GTB/p1522347635000427">Web scraping activity</a>, the <a href="http://spring18.cds101.com/doc/class16_slides.pdf">Class 16 slides</a>, and the <a href="http://spring18.cds101.com/doc/class19_slides.pdf">Class 19 slides</a> while completing this part of the homework assignment.</p>
<h3 id="mens-basketball-schedule-and-scores">Men’s basketball schedule and scores</h3>
<ol start="3" class="example" type="1">
<li><p>To start, you need to load the men’s basketball schedule and scores page into R. Do this using one line of code, and assign the accessed page data to a variable called <code>mens_bb</code>.</p></li>
<li><p>Mason’s opponent for each game is listed on the left side of each row, just after a small box that says <em><span class="caps">VS</span></em> or <em><span class="caps">AT</span></em>. Use the <span class="monospace">SelectorGadget</span> tool to determine the <span class="caps">CSS</span> selector needed to scrape this information, and then write the code that scrapes this information. Assign the scraped data to a variable called <code>mens_opponents</code>. If done right, <code>mens_opponents</code> should be a character vector containing 33 teams.</p></li>
<li><p>The location for each game is listed to the right of the opponent’s name. For games played in the United States it lists the city and state, for example “Fairfax, <span class="caps">VA</span>” is the location for home games. Use the <span class="monospace">SelectorGadget</span> tool to determine the <span class="caps">CSS</span> selector needed to scrape this information, and then write the code that scrapes this information. Assign the scraped data to a variable called <code>mens_locations</code>.</p></li>
<li><p>The date for each game is listed above the opponent’s name, and has the format <em>Month Day (Day of the Week)</em>. For example, the first listed game has the date <em>Nov 10 (<span class="caps">FRI</span>)</em>. Use the <span class="monospace">SelectorGadget</span> tool to determine the <span class="caps">CSS</span> selector needed to scrape this information, and then write the code that scrapes this information. Assign the scraped data to a variable called <code>mens_dates</code>.</p></li>
<li><p>The time for each game is listed to the right of the game date. For example, the first listed game has the time <em>7:00 <span class="caps">P.M.</span></em> Use the <span class="monospace">SelectorGadget</span> tool to determine the <span class="caps">CSS</span> selector needed to scrape this information, and then write the code that scrapes this information. Assign the scraped data to a variable called <code>mens_times</code>.</p></li>
<li><p>The score for each game is listed on the right side of each row in the format <em>Mason’s score-Opponent’s score</em>. For example, the first listed game has the score <em>67-65</em>. Use the <span class="monospace">SelectorGadget</span> tool to determine the <span class="caps">CSS</span> selector needed to scrape this information, and then write the code that scrapes this information. Assign the scraped data to a variable called <code>mens_scores</code>. Your <code>mens_scores</code> vector should only have 33 pieces of data in it, if it has more or less then you need to try another <span class="caps">CSS</span> selector.</p></li>
<li><p>The <strong>W</strong> or <strong>L</strong> to the left of each game score indicates whether Mason won (<strong>W</strong>) or lost (<strong>L</strong>) the game. Use the <span class="monospace">SelectorGadget</span> tool to determine the <span class="caps">CSS</span> selector needed to scrape this information, and then write the code that scrapes this information. You’ll note that for each game you’ll actually get <span class="monospace">W,</span> or <span class="monospace">L,</span>. Pipe (<span class="monospace">%>%</span>) your scraped data into the <code>str_remove()</code> function and tell it to get rid of the comma. Assign the scraped data to a variable called <code>mens_win_loss</code>.</p></li>
<li><p>Use the <code>data_frame()</code> function to create a tibble containing your scraped data. The columns should have the following names and be in this order:</p>
<ul>
<li><span class="monospace">date</span></li>
<li><span class="monospace">time</span></li>
<li><span class="monospace">opponent</span></li>
<li><span class="monospace">location</span></li>
<li><span class="monospace">score</span></li>
<li><span class="monospace">win_loss</span></li>
</ul>
<p>Assign the tibble to a variable called <code>mens_df</code>. This table should have 33 rows.</p></li>
</ol>
<h3 id="womens-basketball-schedule-and-scores">Women’s basketball schedule and scores</h3>
<ol start="11" class="example" type="1">
<li><p>The code you created for scraping the men’s basketball team schedule and score should also work on the page for the women’s team with minimal changes. Copy the code you wrote in the blocks for the men’s page and paste it here. <strong>Change the prefix of the variable names you assign each output into from <code>mens_</code> to <code>womens_</code>, and the code so that it loads the women’s schedule and scores page</strong>. The final result should be a tibble assigned to a variable called <code>womens_df</code>, which has the following columns in this order:</p>
<ul>
<li><span class="monospace">date</span></li>
<li><span class="monospace">time</span></li>
<li><span class="monospace">opponent</span></li>
<li><span class="monospace">location</span></li>
<li><span class="monospace">score</span></li>
<li><span class="monospace">win_loss</span></li>
</ul>
<p>This table should have 34 rows.</p></li>
</ol>
<h3 id="quick-data-exploration">Quick data exploration</h3>
<p>Collecting data doesn’t serve much of a purpose if we don’t explore or analyze it. Create the summary reports and visualizations requested below to help you better understand the data you just collected.</p>
<ol start="12" class="example" type="1">
<li><p>What was the average score for the men’s team (Mason only) when they won a game and when they lost a game? What was the average score for the women’s team (Mason only) when they won a game and when they lost a game?</p>
<p><strong>Hint: To answer this, you will need to use the <code>separate()</code> function.</strong></p></li>
<li><p>Plot the men’s histogram of scores and the women’s histogram of scores (just for the Mason teams, not the opponents), and then compare the two histograms. Which histogram is centered at a higher score? Which histogram has the larger spread? Are there any other notable differences?</p></li>
</ol>
<h2 id="how-to-submit">How to submit</h2>
<p>When you are ready to submit, be sure to save, commit, and push your final result so that everything is synchronized to Github. Then, navigate to <strong>your copy</strong> of the <a href="https://classroom.github.com/a/5GSCwIPV">Github repository</a> you used for this assignment. You should see your repository, along with the updated files that you just synchronized to Github. Confirm that your files are up-to-date, and then do the following steps:</p>
<ol type="1">
<li><p>Click the <em>Pull Requests</em> tab near the top of the page.</p></li>
<li><p>Click the green button that says “New pull request”.</p></li>
<li><p>Click the dropdown menu button labeled “base:”, and select the option <code>starting</code>.</p></li>
<li><p>Confirm that the dropdown menu button labeled “compare:” is set to <code>master</code>.</p></li>
<li><p>Click the green button that says “Create pull request”.</p></li>
<li><p>Give the <em>pull request</em> the following title: <span class="monospace">Submission: Homework 2, FirstName LastName</span>, replacing <span class="monospace">FirstName</span> and <span class="monospace">LastName</span> with your actual first and last name.</p></li>
<li><p>In the messagebox, write: <span class="monospace">My homework submission is ready for grading @shuaibm @jkglasbrenner</span>.</p></li>
<li><p>Click “Create pull request” to lock in your submission.</p></li>
</ol>
<h2 id="cheatsheets">Cheatsheets</h2>
<p>You are encouraged to review and keep the following cheatsheets handy while working on this assignment:</p>
<ul>
<li><p><a href="http://spring18.cds101.com/doc/rstudio-IDE-cheatsheet.pdf">RStudio cheatsheet</a></p></li>
<li><p><a href="http://spring18.cds101.com/doc/rmarkdown-cheatsheet.pdf">RMarkdown cheatsheet</a></p></li>
<li><p><a href="http://spring18.cds101.com/doc/rmarkdown-reference.pdf">RMarkdown reference</a></p></li>
<li><p><a href="http://spring18.cds101.com/doc/ggplot2-cheatsheet.pdf"><span class="monospace">ggplot2</span> cheatsheet</a></p></li>
<li><p><a href="http://spring18.cds101.com/doc/data-transformation-cheatsheet.pdf">Data transformation cheatsheet</a></p></li>
<li><p><a href="http://spring18.cds101.com/doc/data-import-cheatsheet.pdf">Data import cheatsheet</a></p></li>
</ul>
Reading 122018-04-13T17:00:00-04:002018-04-13T17:00:00-04:00Dr. Glasbrennertag:spring18.cds101.com,2018-04-13:/assignments/reading-12/<p><span class="h3"><a href="http://spring18.cds101.com/doc/Diez_Barr_%C3%87etinkaya-Rundel_IntroductoryStatisticsWithRandomizationAndSimulation.pdf">Introductory Statistics with Randomization and Simulation</a></span></p>
<p>Read the following:</p>
<ul>
<li>From chapter 1: sections 1.3 (skip 1.3.4), 1.4.1, and 1.5</li>
</ul>
<p><span class="h3">Writeups</span></p>
<p><span class="h3">Reading discussion</span></p>
<p>Read the following writeups that supplement the content from <a href="/assignments/reading10.html">reading 10</a>:</p>
<ul>
<li><p><a href="/materials/advanced-pmf-visualization/">An advanced example of a <span class="caps">PMF</span> visualization</a></p></li>
<li><p><a href="/materials/class-size-paradox/">Class-size paradox</a></p></li>
</ul>
<div class="callout primary">
<p><strong>Discussion hashtag …</strong></p></div><p><span class="h3"><a href="http://spring18.cds101.com/doc/Diez_Barr_%C3%87etinkaya-Rundel_IntroductoryStatisticsWithRandomizationAndSimulation.pdf">Introductory Statistics with Randomization and Simulation</a></span></p>
<p>Read the following:</p>
<ul>
<li>From chapter 1: sections 1.3 (skip 1.3.4), 1.4.1, and 1.5</li>
</ul>
<p><span class="h3">Writeups</span></p>
<p><span class="h3">Reading discussion</span></p>
<p>Read the following writeups that supplement the content from <a href="/assignments/reading10.html">reading 10</a>:</p>
<ul>
<li><p><a href="/materials/advanced-pmf-visualization/">An advanced example of a <span class="caps">PMF</span> visualization</a></p></li>
<li><p><a href="/materials/class-size-paradox/">Class-size paradox</a></p></li>
</ul>
<div class="callout primary">
<p><strong>Discussion hashtag</strong><br />
<code>#reading12</code></p>
</div>
<div class="callout secondary">
<p>Remember to post your question about it to the <code>#5-discussion</code> channel in <a href="https://masoncds101.slack.com">Slack</a> by the due date. To receive an answer credit, reply to a posted question no later than 11:59pm on Sunday, April 15th.</p>
</div>
<hr />
<p><a href="http://spring18.cds101.com/syllabus.html#readings"><i class="fas fa-link"></i> Posting guidelines can be found in the <strong>Readings</strong> section of the syllabus.</a></p>
Class 22: Inference and simulations I2018-04-12T13:30:00-04:002018-04-12T13:30:00-04:00Dr. Glasbrennertag:spring18.cds101.com,2018-04-12:/materials/class-22/<p><span class="h3">Slides</span></p>
<div class="no-bullets">
<ul>
<li><a href="/doc/class22_slides.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-16"></i> <span class="card-downloads-format"><span class="caps">PDF</span></span></a></li>
</ul>
</div>
<p><span class="h3">Slides</span></p>
<div class="no-bullets">
<ul>
<li><a href="/doc/class22_slides.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-16"></i> <span class="card-downloads-format"><span class="caps">PDF</span></span></a></li>
</ul>
</div>
Reading 112018-04-12T13:30:00-04:002018-04-12T13:30:00-04:00Dr. Glasbrennertag:spring18.cds101.com,2018-04-12:/assignments/reading-11/<p><span class="h3"><a href="http://spring18.cds101.com/doc/Diez_Barr_%C3%87etinkaya-Rundel_IntroductoryStatisticsWithRandomizationAndSimulation.pdf">Introductory Statistics with Randomization and Simulation</a></span></p>
<p>Read the following:</p>
<ul>
<li>From chapter 2: from the beginning through to the end of section 2.2</li>
</ul>
<p><span class="h3">Reading discussion</span></p>
<div class="callout primary">
<p><strong>Discussion hashtag</strong><br />
<code>#reading11</code></p>
</div>
<div class="callout secondary">
<p>Remember to post your question about it to the <code>#5-discussion</code> channel in <a href="https://masoncds101.slack.com">Slack</a> by the due date. To receive an answer credit …</p></div><p><span class="h3"><a href="http://spring18.cds101.com/doc/Diez_Barr_%C3%87etinkaya-Rundel_IntroductoryStatisticsWithRandomizationAndSimulation.pdf">Introductory Statistics with Randomization and Simulation</a></span></p>
<p>Read the following:</p>
<ul>
<li>From chapter 2: from the beginning through to the end of section 2.2</li>
</ul>
<p><span class="h3">Reading discussion</span></p>
<div class="callout primary">
<p><strong>Discussion hashtag</strong><br />
<code>#reading11</code></p>
</div>
<div class="callout secondary">
<p>Remember to post your question about it to the <code>#5-discussion</code> channel in <a href="https://masoncds101.slack.com">Slack</a> by the due date. To receive an answer credit, reply to a posted question no later than 11:59pm on Saturday, April 14th.</p>
</div>
<hr />
<p><a href="http://spring18.cds101.com/syllabus.html#readings"><i class="fas fa-link"></i> Posting guidelines can be found in the <strong>Readings</strong> section of the syllabus.</a></p>
An advanced example of a PMF visualization2018-04-10T15:00:00-04:002018-04-10T15:00:00-04:00Dr. Glasbrennertag:spring18.cds101.com,2018-04-10:/materials/advanced-pmf-visualization/Reading showing an example for how to use PMFs to compare the difference between two data distributions.<div class="no-bullets">
<ul>
<li><a href="/doc/advanced_pmf_example.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-8"></i> Download this reading</a></li>
</ul>
</div>
<pre class="r"><code>library(tidyverse)
county_complete <- read_rds(
path = url("http://spring18.cds101.com/files/datasets/county_complete.rds"))
nebraska_iowa <- county_complete %>%
filter(state == "Iowa" | state == "Nebraska")</code></pre>
<p><strong>The following example assumes you have already read the standard reading on <a href="/materials/representing-distributions-pmf/"><em>Probability mass functions</em></a>.</strong></p>
<p>Histograms and PMFs are useful while you are exploring data and trying to identify patterns and relationships. Once you have an idea what is going on, a good next step is to design a visualization that makes the patterns you have identified as clear as possible.</p>
<p>In our prior comparison of the average work travel times for Nebraska and Iowa, we saw that there was a large overlap in travel times between 11 and 29 minutes, but that the overlap wasn’t exact. So it makes sense to zoom in on that part of the graph, and to transform the data to emphasize differences. To do this, we need the values of the <span class="caps">PMF</span> in each of the bins. The <code>ggplot_build()</code> function allows us to extract these numerical values, although it takes a couple of steps to do. First, we need to create a histogram of the travel times in Iowa and Nebraska and assign it to a variable:</p>
<pre class="r"><code>nebraska_iowa_histogram <- nebraska_iowa %>%
ggplot() +
geom_histogram(
mapping = aes(x = mean_work_travel, fill = state), binwidth = 1)</code></pre>
<p>Next, we use <code>ggplot_build()</code> to convert the figure into a <code>list()</code> of information about the plot:</p>
<pre class="r"><code>nebraska_iowa_figure_list <- ggplot_build(nebraska_iowa_histogram)</code></pre>
<p>A <code>list()</code> is a data type that we haven’t used in the course yet. It’s a convenient alternative to the <code>tibble</code> when you need to store uneven or very different kinds of information. Like the <code>tibble</code>, you can label the entries in a list. Our list <code>nebraska_iowa_figure_list</code> has several labels containing metadata about the plot:</p>
<pre class="r"><code>names(nebraska_iowa_figure_list)</code></pre>
<pre><code>## [1] "data" "layout" "plot"</code></pre>
<p>The one that we want to use is named <code>data</code>. To get the information inside of the <code>data</code> label, we use the <code>pluck()</code> function from <code>tidyverse</code>.</p>
<pre class="r"><code>nebraska_iowa_figure_df <- nebraska_iowa_figure_list %>%
pluck("data", 1) %>%
as_tibble()</code></pre>
<p>The <code>1</code> inside of <code>pluck()</code> is necessary to get the data table stored inside of <code>data</code> (without it, we just get a <code>list()</code> data type back, which isn’t helpful). We’ve also converted the data table into a <code>tibble</code> for convenience.</p>
<p>There are 17 columns in <code>nebraska_iowa_figure_df</code>. Let’s use <code>glimpse()</code> to get a list of the variable names and previews of the first few entries:</p>
<pre class="r"><code>nebraska_iowa_figure_df %>%
glimpse()</code></pre>
<pre><code>## Observations: 38
## Variables: 17
## $ fill <chr> "#00BFC4", "#F8766D", "#00BFC4", "#F8766D", "#00BFC4"...
## $ y <dbl> 1, 1, 2, 3, 5, 7, 10, 14, 12, 19, 8, 17, 8, 19, 11, 2...
## $ count <dbl> 1, 0, 2, 1, 5, 2, 10, 4, 12, 7, 8, 9, 8, 11, 11, 17, ...
## $ x <dbl> 11, 11, 12, 12, 13, 13, 14, 14, 15, 15, 16, 16, 17, 1...
## $ xmin <dbl> 10.5, 10.5, 11.5, 11.5, 12.5, 12.5, 13.5, 13.5, 14.5,...
## $ xmax <dbl> 11.5, 11.5, 12.5, 12.5, 13.5, 13.5, 14.5, 14.5, 15.5,...
## $ density <dbl> 0.01075269, 0.00000000, 0.02150538, 0.01010101, 0.053...
## $ ncount <dbl> 0.08333333, 0.00000000, 0.16666667, 0.05882353, 0.416...
## $ ndensity <dbl> 7.750000, 0.000000, 15.500000, 5.823529, 38.750000, 1...
## $ PANEL <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ group <int> 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1,...
## $ ymin <dbl> 0, 1, 0, 2, 0, 5, 0, 10, 0, 12, 0, 8, 0, 8, 0, 11, 0,...
## $ ymax <dbl> 1, 1, 2, 3, 5, 7, 10, 14, 12, 19, 8, 17, 8, 19, 11, 2...
## $ colour <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ size <dbl> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5...
## $ linetype <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ alpha <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...</code></pre>
<p>For our purposes, we want the <code>x</code> (values along horizontal axis), <code>density</code> (same as <span class="caps">PMF</span>), and <code>group</code> (created by <code>fill = state</code>) columns. We extract those using <code>select()</code> and then use <code>rename()</code> and <code>recode()</code> to give better names to the columns and categorical labels:</p>
<pre class="r"><code>nebraska_iowa_pmf <- nebraska_iowa_figure_df %>%
select(x, density, group) %>%
rename(mean_travel_time = x, pmf = density, state = group) %>%
mutate(state = recode(state, `1` = "Iowa", `2` = "Nebraska"))</code></pre>
<p>With all of that work done, we can now calculate the difference in the Iowa and Nebraska PMFs. To do that, we need to <code>spread()</code> the <code>state</code> column into separate <code>Nebraska</code> and <code>Iowa</code> columns, then use <code>mutate()</code> to subtract the <code>Nebraska</code> <span class="caps">PMF</span> from the <code>Iowa</code> <span class="caps">PMF</span>:</p>
<pre class="r"><code>nebraska_iowa_percent_difference <- nebraska_iowa_pmf %>%
spread(key = state, value = pmf) %>%
mutate(percent_difference = 100 * (Iowa - Nebraska)) %>%
select(-Iowa, -Nebraska)</code></pre>
<p>We remove the <code>Iowa</code> and <code>Nebraska</code> columns afterward, as we no longer need them after taking the difference. Now we can create a bar chart of the differences between Nebraska and Iowa travel times, which was the goal of this procedure:</p>
<pre class="r"><code>nebraska_iowa_percent_difference %>%
ggplot() +
geom_col(mapping = aes(x = mean_travel_time, y = percent_difference)) +
coord_cartesian(xlim = c(9.5, 30), ylim = c(-7, 7))</code></pre>
<p><img src="http://spring18.cds101.com/img/Materials/advanced_pmf_example-ia-ne-travel-times-percent-diff-labeled-1.svg" title="plot of chunk ia-ne-travel-times-percent-diff-labeled" alt="plot of chunk ia-ne-travel-times-percent-diff-labeled" width="60%" style="display: block; margin: auto;" /></p>
<p>The arrows indicate that a taller bar in the <span class="math inline">\(y > 0\)</span> region means the travel time is greater in Iowa, while a taller bar in the <span class="math inline">\(y < 0\)</span> region means the travel time is greater in Nebraska. This figure makes the pattern clearer: longer work travel times are more common in Iowa than in Nebraska. For now we should hold this conclusion only tentatively. We used the same dataset to identify an apparent difference and then chose a visualization that makes the difference apparent. We can’t be sure this effect is real; it might be due to random variation. When we learn about statistical inference later on, we’ll have the tools necessary to better answer that question.</p>
<h2 id="credits">Credits</h2>
<div class="license">
<p>This work, <em>An advanced example of a <span class="caps">PMF</span> visualization</em>, is a derivative of <a href="http://a.co/grOJGrv">Allen B. Downey, “Chapter 3 Probability mass functions” in <em>Think Stats: Exploratory Data Analysis</em>, 2nd ed. (O’Reilly Media, Sebastopol, <span class="caps">CA</span>, 2014)</a>, used under <a href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><span class="caps">CC</span> <span class="caps">BY</span>-<span class="caps">NC</span>-<span class="caps">SA</span> 4.0</a>. <em>An Advanced Example of a <span class="caps">PMF</span> Visualization</em> is licensed under <a href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><span class="caps">CC</span> <span class="caps">BY</span>-<span class="caps">NC</span>-<span class="caps">SA</span> 4.0</a> by James Glasbrenner.</p>
</div>
Class size paradox2018-04-10T15:00:00-04:002018-04-10T15:00:00-04:00Dr. Glasbrennertag:spring18.cds101.com,2018-04-10:/materials/class-size-paradox/A practical example that shows how to use PMFs to resolve a paradox.<div class="no-bullets">
<ul>
<li><a href="/doc/class_size_paradox.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-8"></i> Download this reading</a></li>
</ul>
</div>
<pre class="r"><code>library(tidyverse)</code></pre>
<p><strong>The following example assumes you have already read the standard reading on <a href="/materials/representing-distributions-pmf/"><em>Probability mass functions</em></a>.</strong></p>
<p>Let’s consider another <span class="caps">PMF</span> computation that illustrates something that we may call the “class size paradox”.</p>
<p>At many American colleges and universities, the student-to-faculty ratio is about 10:1. But students are often surprised to discover that their average class size is bigger than 10. There are two reasons for the discrepancy:</p>
<ul>
<li>Students typically take 4 – 5 classes per semester, but professors often teach 1 or 2.</li>
<li>The number of students who enjoy a small class is small, but the number of students in a large class is (unsurprisingly) large.</li>
</ul>
<p>The first effect is obvious, at least once it is pointed out; the second is more subtle. Let’s look at an example. Suppose that a college offers 65 classes in a given semester, with the following distribution of sizes:</p>
<table>
<thead>
<tr class="header">
<th>size</th>
<th>count</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>5–9</td>
<td>8</td>
</tr>
<tr class="even">
<td>10–14</td>
<td>8</td>
</tr>
<tr class="odd">
<td>15–19</td>
<td>14</td>
</tr>
<tr class="even">
<td>20–24</td>
<td>4</td>
</tr>
<tr class="odd">
<td>25–29</td>
<td>6</td>
</tr>
<tr class="even">
<td>30–34</td>
<td>12</td>
</tr>
<tr class="odd">
<td>35–39</td>
<td>8</td>
</tr>
<tr class="even">
<td>40–44</td>
<td>3</td>
</tr>
<tr class="odd">
<td>45–49</td>
<td>2</td>
</tr>
</tbody>
</table>
<p>If you ask the Dean for the average class size, he or she would construct a <span class="caps">PMF</span>, compute the mean, and report that the average class size is 23.7. Here’s the code:</p>
<pre class="r"><code>class_sizes <- tribble(
~`class size`, ~count,
7, 8,
12, 8,
17, 14,
22, 4,
27, 6,
32, 12,
37, 8,
42, 3,
47, 2)
class_sizes %>%
summarize(`Average class size` = round(weighted.mean(`class size`, count), 1))</code></pre>
<table>
<thead>
<tr>
<th style="text-align:right;">
Average class size
</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:right;">
23.7
</td>
</tr>
</tbody>
</table>
<p>But if you survey a group of students, ask them how many students are in their classes, and compute the mean, you would think the average class was bigger. Let’s see how much bigger.</p>
<p>First, let’s compute the distribution as observed by students, where the probability associated with each class size is “biased” by the number of students in the class. For each class size we multiply the probability by the number of students who observe that class size. The result is a new <span class="caps">PMF</span> that represents the biased distribution:</p>
<pre class="r"><code>class_sizes2 <- class_sizes %>%
mutate(actual = count / sum(count)) %>%
mutate(observed = (actual * `class size`) / sum(actual * `class size`)) %>%
gather(key = distribution, value = PMF, actual:observed)</code></pre>
<p>Now we can plot the actual and observed distributions together:</p>
<pre class="r"><code>class_sizes2 %>%
ggplot(
mapping = aes(x = `class size`, y = PMF, fill = distribution,
color = distribution)) +
geom_col(position = "identity", alpha = 0.5)</code></pre>
<p><img src="http://spring18.cds101.com/img/Materials/class_size_paradox-actual-vs-observed-class-size-labeled-1.svg" title="plot of chunk actual-vs-observed-class-size-labeled" alt="plot of chunk actual-vs-observed-class-size-labeled" width="60%" style="display: block; margin: auto;" /></p>
<p>As we see in the above figure, the biased distribution corresponds to fewer small classes and more large ones. The mean of the biased distribution is,</p>
<pre class="r"><code>class_sizes2 %>%
group_by(distribution) %>%
summarize(`Average class size` = round(weighted.mean(`class size`, `PMF`), 1))</code></pre>
<table>
<thead>
<tr>
<th style="text-align:left;">
distribution
</th>
<th style="text-align:right;">
Average class size
</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:left;">
actual
</td>
<td style="text-align:right;">
23.7
</td>
</tr>
<tr>
<td style="text-align:left;">
observed
</td>
<td style="text-align:right;">
29.1
</td>
</tr>
</tbody>
</table>
<p>which is almost 25% higher than the actual mean.</p>
<p>It is also possible to invert this operation. Suppose you want to find the distribution of class sizes at a college, but you can’t get reliable data from the Dean. An alternative is to choose a random sample of students and ask how many students are in their classes. The result would be biased for the reasons we’ve just seen, but you can use it to estimate the actual distribution. Here’s the code that can be used to unbias a <span class="caps">PMF</span>:</p>
<pre class="r"><code>class_sizes2 %>%
spread(key = distribution, value = PMF) %>%
mutate(
unbiased = (observed * 1 / `class size`) /
sum(observed * 1 / `class size`)) %>%
select(`class size`, count, observed, unbiased, actual)</code></pre>
<table>
<thead>
<tr>
<th style="text-align:right;">
class size
</th>
<th style="text-align:right;">
count
</th>
<th style="text-align:right;">
observed
</th>
<th style="text-align:right;">
unbiased
</th>
<th style="text-align:right;">
actual
</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:right;">
7
</td>
<td style="text-align:right;">
8
</td>
<td style="text-align:right;">
0.0363636
</td>
<td style="text-align:right;">
0.1230769
</td>
<td style="text-align:right;">
0.1230769
</td>
</tr>
<tr>
<td style="text-align:right;">
12
</td>
<td style="text-align:right;">
8
</td>
<td style="text-align:right;">
0.0623377
</td>
<td style="text-align:right;">
0.1230769
</td>
<td style="text-align:right;">
0.1230769
</td>
</tr>
<tr>
<td style="text-align:right;">
17
</td>
<td style="text-align:right;">
14
</td>
<td style="text-align:right;">
0.1545455
</td>
<td style="text-align:right;">
0.2153846
</td>
<td style="text-align:right;">
0.2153846
</td>
</tr>
<tr>
<td style="text-align:right;">
22
</td>
<td style="text-align:right;">
4
</td>
<td style="text-align:right;">
0.0571429
</td>
<td style="text-align:right;">
0.0615385
</td>
<td style="text-align:right;">
0.0615385
</td>
</tr>
<tr>
<td style="text-align:right;">
27
</td>
<td style="text-align:right;">
6
</td>
<td style="text-align:right;">
0.1051948
</td>
<td style="text-align:right;">
0.0923077
</td>
<td style="text-align:right;">
0.0923077
</td>
</tr>
<tr>
<td style="text-align:right;">
32
</td>
<td style="text-align:right;">
12
</td>
<td style="text-align:right;">
0.2493506
</td>
<td style="text-align:right;">
0.1846154
</td>
<td style="text-align:right;">
0.1846154
</td>
</tr>
<tr>
<td style="text-align:right;">
37
</td>
<td style="text-align:right;">
8
</td>
<td style="text-align:right;">
0.1922078
</td>
<td style="text-align:right;">
0.1230769
</td>
<td style="text-align:right;">
0.1230769
</td>
</tr>
<tr>
<td style="text-align:right;">
42
</td>
<td style="text-align:right;">
3
</td>
<td style="text-align:right;">
0.0818182
</td>
<td style="text-align:right;">
0.0461538
</td>
<td style="text-align:right;">
0.0461538
</td>
</tr>
<tr>
<td style="text-align:right;">
47
</td>
<td style="text-align:right;">
2
</td>
<td style="text-align:right;">
0.0610390
</td>
<td style="text-align:right;">
0.0307692
</td>
<td style="text-align:right;">
0.0307692
</td>
</tr>
</tbody>
</table>
<p>It’s similar to before; the only difference is that it divides each probability by the class size instead of multiplying.</p>
<h2 id="credits">Credits</h2>
<div class="license">
<p>This work, <em>Class size paradox</em>, is a derivative of <a href="http://a.co/grOJGrv">Allen B. Downey, “Chapter 3 Probability mass functions” in <em>Think Stats: Exploratory Data Analysis</em>, 2nd ed. (O’Reilly Media, Sebastopol, <span class="caps">CA</span>, 2014)</a>, used under <a href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><span class="caps">CC</span> <span class="caps">BY</span>-<span class="caps">NC</span>-<span class="caps">SA</span> 4.0</a>. <em>Class Size Paradox</em> is licensed under <a href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><span class="caps">CC</span> <span class="caps">BY</span>-<span class="caps">NC</span>-<span class="caps">SA</span> 4.0</a> by James Glasbrenner.</p>
</div>
Comparing percentile rank2018-04-10T15:00:00-04:002018-04-10T15:00:00-04:00Dr. Glasbrennertag:spring18.cds101.com,2018-04-10:/materials/comparing-percentile-rank/An example for how to use the <span class="caps">CDF</span> to compare measurements across different groups.<div class="no-bullets">
<ul>
<li><a href="/doc/comparing_percentile_rank.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-8"></i> Download this guide</a></li>
</ul>
</div>
<p><strong>The following example assumes you have already read the standard reading on <a href="/materials/representing-distributions-cdf/"><em>Cumulative distribution functions</em></a>.</strong></p>
<p>Percentile ranks are useful for comparing measurements across different groups. For example, people who compete in foot races are usually grouped by age and sex. To compare people in different age groups, you can convert race times to percentile ranks.</p>
<p>As an example, suppose a male runner in his forties completes a 10K race in 42:44, placing them 97th in a field of 1633. This means that the runner beat or tied 1537 runners out of 1633, which corresponds to a percentile rank of 94%.</p>
<p>More generally, given position and field size, we can compute percentile rank as follows: <span class="math display">\[100.0 \times \frac{\text{field}\_\text{size} - \text{position} + 1}{\text{field}\_\text{size}}\]</span> The runner belonged to the “male between 40 and 49 years of age” group, and within that grouping came in 26th out of 256. Using the above formula:</p>
<pre class="r"><code>100.0 * (256 - 26 + 1) / 256</code></pre>
<pre><code>## [1] 90.23438</code></pre>
<p>Thus the runner’s percentile range in this age group was 90%.</p>
<p>If the runner continues to compete in ten years time, then he will be placed into the “male between 50 and 59 years of age” group. We can use the runner’s current percentile rank in the “40 to 49 years of age” group to estimate how he would perform in this new group, everything else remaining equal. The formula is as follows: <span class="math display">\[\text{field}\_\text{size} - \left(\text{percentile} \times \frac{\text{field}\_\text{size}}{100.0}\right) + 1\]</span> There were 171 people in the “50 to 59 years of age” group, so we need to compute:</p>
<pre class="r"><code>171 - (90.23438 * (171 / 100.0)) + 1</code></pre>
<pre><code>## [1] 17.69921</code></pre>
<p>This means that the runner would have to finish somewhere between 17th and 18th place to maintain the same percentile rank. In this particular race, the finishing time for 17th place in the “50 to 59 years of age” group was 46:05, so that’s the time that the runner needs to train for in order to maintain his percentile rank as he ages.</p>
<h2 id="credits">Credits</h2>
<div class="license">
<p>This work, <em>Comparing percentile rank</em>, is a derivative of <a href="http://a.co/grOJGrv">Allen B. Downey, “Chapter 4 Cumulative distribution functions” in <em>Think Stats: Exploratory Data Analysis</em>, 2nd ed. (O’Reilly Media, Sebastopol, <span class="caps">CA</span>, 2014)</a>, used under <a href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><span class="caps">CC</span> <span class="caps">BY</span>-<span class="caps">NC</span>-<span class="caps">SA</span> 4.0</a>. <em>Comparing percentile rank</em> is licensed under <a href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><span class="caps">CC</span> <span class="caps">BY</span>-<span class="caps">NC</span>-<span class="caps">SA</span> 4.0</a> by James Glasbrenner.</p>
</div>
Class 21: Statistical distributions II2018-04-10T13:30:00-04:002018-04-10T13:30:00-04:00Dr. Glasbrennertag:spring18.cds101.com,2018-04-10:/materials/class-21/<p><span class="h3">Slides</span></p>
<div class="no-bullets">
<ul>
<li><a href="/doc/class21_slides.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-16"></i> <span class="card-downloads-format"><span class="caps">PDF</span></span></a></li>
</ul>
</div>
<p><span class="h3">Slides</span></p>
<div class="no-bullets">
<ul>
<li><a href="/doc/class21_slides.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-16"></i> <span class="card-downloads-format"><span class="caps">PDF</span></span></a></li>
</ul>
</div>
Reading 102018-04-10T13:30:00-04:002018-04-10T13:30:00-04:00Dr. Glasbrennertag:spring18.cds101.com,2018-04-10:/assignments/reading-10/<p><span class="h3">Writeups</span></p>
<p>Read the following writeups on the <em>probability mass function</em> and <em>cumulative distribution function</em>:</p>
<ul>
<li><p><a href="/materials/representing-distributions-pmf/">Representing distributions: Probability mass function</a></p></li>
<li><p><a href="/materials/representing-distributions-cdf/">Representing distributions: Cumulative distribution function</a></p></li>
</ul>
<p><span class="h3">Reading discussion</span></p>
<div class="callout primary">
<p><strong>Discussion hashtag</strong><br />
<code>#reading10</code></p>
</div>
<div class="callout secondary">
<p>Remember to post your question about it to the <code>#5-discussion</code> channel in <a href="https://masoncds101.slack.com">Slack</a> by the due date. To receive an answer …</p></div><p><span class="h3">Writeups</span></p>
<p>Read the following writeups on the <em>probability mass function</em> and <em>cumulative distribution function</em>:</p>
<ul>
<li><p><a href="/materials/representing-distributions-pmf/">Representing distributions: Probability mass function</a></p></li>
<li><p><a href="/materials/representing-distributions-cdf/">Representing distributions: Cumulative distribution function</a></p></li>
</ul>
<p><span class="h3">Reading discussion</span></p>
<div class="callout primary">
<p><strong>Discussion hashtag</strong><br />
<code>#reading10</code></p>
</div>
<div class="callout secondary">
<p>Remember to post your question about it to the <code>#5-discussion</code> channel in <a href="https://masoncds101.slack.com">Slack</a> by the due date. To receive an answer credit, reply to a posted question no later than 11:59pm on Thursday, April 12th.</p>
</div>
<hr />
<p><a href="http://spring18.cds101.com/syllabus.html#readings"><i class="fas fa-link"></i> Posting guidelines can be found in the <strong>Readings</strong> section of the syllabus.</a></p>
Cumulative distribution functions2018-04-05T15:00:00-04:002018-04-05T15:00:00-04:00Dr. Glasbrennertag:spring18.cds101.com,2018-04-05:/materials/representing-distributions-cdf/Reading about how to use R to compute, visualize, and apply percentiles of a dataset.<div class="no-bullets">
<ul>
<li><a href="/doc/representing_distributions_cdf.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-8"></i> Download this reading</a></li>
</ul>
</div>
<pre class="r"><code>library(tidyverse)
county_complete <- read_rds(
path = url("http://spring18.cds101.com/files/datasets/county_complete.rds"))</code></pre>
<h2 id="the-limits-of-probability-mass-functions">The limits of probability mass functions</h2>
<p>Probability mass functions (PMFs) work well if the number of unique values is small. But as the number of unique values increases, the probability associated with each value gets smaller and the effect of random noise increases.</p>
<p>Let’s recall that, in the previous reading, we plotted and compared PMFs of the average work travel time in Virginia and New Jersey, which resulted in this figure:</p>
<p><img src="http://spring18.cds101.com/img/Materials/representing_distributions_cdf-va-nj-travel-times-pmf-1.svg" title="plot of chunk va-nj-travel-times-pmf" alt="plot of chunk va-nj-travel-times-pmf" width="60%" style="display: block; margin: auto;" /></p>
<p>What happens if we choose <code>binwidth = 0.1</code> for plotting the <code>mean_work_travel</code> distribution? The values in <code>mean_work_travel</code> are reported to the first decimal place, so <code>binwidth = 0.1</code> does not “smooth out” the data. This increases the number of distinct values in <code>mean_work_travel</code> from 41 to 304. The comparison between the Virginia and New Jersey PMFs will then look like this:</p>
<p><img src="http://spring18.cds101.com/img/Materials/representing_distributions_cdf-mean-work-travel-pmf-no-rounding-1.svg" title="plot of chunk mean-work-travel-pmf-no-rounding" alt="plot of chunk mean-work-travel-pmf-no-rounding" width="60%" style="display: block; margin: auto;" /></p>
<p>This visualization has a lot of spikes of similar heights, which makes this difficult to interpret and limits its usefulness. Also, it can be hard to see overall patterns; for example, what is the approximate difference in means between these two distributions?</p>
<p>This illustrates the tradeoff when using histograms and PMFs for visualizing single variables. If we smooth things out by using larger bin sizes, then we can lose information that may be useful. On the other hand, using small bin sizes creates plots like the one above, which is of limited (if any) utility.</p>
<p>An alternative that avoids these problems is the cumulative distribution function (<span class="caps">CDF</span>), which we turn to describing next. But before we discuss CDFs, we first have to understand the concept of percentiles.</p>
<h2 id="percentiles">Percentiles</h2>
<p>If you have taken a standardized test, you probably got your results in the form of a raw score and a <strong>percentile rank</strong>. In this context, the percentile rank is the fraction of people who scored lower than you (or the same). So if you are “in the 90th percentile”, you did as well as or better than 90% of the people who took the exam.</p>
<p>As an example, say that you and 4 other people took a test and received the following scores:</p>
<pre><code>55 66 77 88 99</code></pre>
<p>If you received the score of 88, then what is your percentile rank? We can calculate it as follows:</p>
<pre class="r"><code>test_scores <- tribble(
~score,
55,
66,
77,
88,
99)
number_of_tests <- test_scores %>%
count() %>%
pull(n)
number_of_lower_scores <- test_scores %>%
filter(score <= 88) %>%
count() %>%
pull(n)
percentile_rank <- 100.0 * number_of_lower_scores / number_of_tests</code></pre>
<p>From this, we find that the percentile rank for a score of 88 is 80. Mathematically, the calculation is <span class="math inline">\(100 \times \dfrac{4}{5} = 80\)</span>.</p>
<p>As you can see, if you are given a value, it is easy to find its percentile rank; going the other way is slightly harder. One way to do this is to sort the scores and find the row number that corresponds to a percentile rank. To find the row number, divide the total number of scores by 100, multiply that number by the desired percentile rank, and then <em>round up</em> to the nearest integer value. The rounding up operation can be handled via the <code>ceiling()</code> function. So, for our example, the value with percentile rank 55 is:</p>
<pre class="r"><code>percentile_rank_row_number <- ceiling(55 * number_of_tests / 100)
test_scores %>%
arrange(score) %>%
slice(percentile_rank_row_number)</code></pre>
<p>The result of this calculation is called a <strong>percentile</strong>. So this means that, in the distribution of exam scores, the 55th percentile corresponds to a score of 77.</p>
<p>In R, there is a function called <code>quantile()</code> that can do the above calculation automatically, although you need to take care with the inputs. Let’s first show what happens when we aren’t careful. We might think that we can calculate the 55th percentile by running:</p>
<pre class="r"><code>test_scores %>%
pull(score) %>%
quantile(probs = c(0.55))</code></pre>
<p>We get a score of 79.2, which isn’t in our dataset. This happens because <code>quantile()</code> interpolates between the scores by default. Sometimes you will want this behavior, other times you will not. When the dataset is this small, it doesn’t make as much sense to permit interpolation, as it can be based on rather aggressive assumptions about what intermediate scores might look like. To tell <code>quantile()</code> to compute scores in the same manner as we did above, add the input <code>type = 1</code>:</p>
<pre class="r"><code>test_scores %>%
pull(score) %>%
quantile(probs = c(0.55), type = 1)</code></pre>
<p>This, as expected, agrees with the manual calculation.</p>
<p>It is worth emphasizing that the difference between “percentile” and “percentile rank” can be confusing, and people do not always use the terms precisely. To summarize, if we want to know the percentage of people obtained scores equal to or lower than ours, then we are computing a percentile rank. If we start with a percentile, then we are computing the score in the distribution that corresponds with it.</p>
<h2 id="cdfs">CDFs</h2>
<p>Now that we understand percentiles and percentile ranks, we are ready to tackle the <strong>cumulative distribution function</strong> (<span class="caps">CDF</span>). The <span class="caps">CDF</span> is the function that maps from a value to its percentile rank. To find the <span class="caps">CDF</span> for any particular value in our distribution, we compute the fraction of values in the distribution less than or equal to our selected value. Computing this is similar to how we calculated the percentile rank, except that the result is a probability in the range 0–1 rather than a percentile rank in the range 0–100. For our test scores example, we can manually compute the <span class="caps">CDF</span> in the following way:</p>
<pre class="r"><code>test_scores_cdf <- test_scores %>%
arrange(score) %>%
mutate(cdf = row_number() / n())</code></pre>
<p>The visualization of the <span class="caps">CDF</span> looks like:</p>
<p><img src="http://spring18.cds101.com/img/Materials/representing_distributions_cdf-test-scores-cdf-plot-1.svg" title="plot of chunk test-scores-cdf-plot" alt="plot of chunk test-scores-cdf-plot" width="60%" style="display: block; margin: auto;" /></p>
<p>As you can see, the <span class="caps">CDF</span> of a sample looks like a sequence of steps. Appropriately enough, this is called a step function, and the <span class="caps">CDF</span> of <em>any</em> sample is a step function.</p>
<p>Also note that we can evaluate the <span class="caps">CDF</span> for any value, not just values that appear in the sample. If we select a value that is less than the smallest value in the sample, then the <span class="caps">CDF</span> is 0. If we select a value that is greater than the largest value, then the <span class="caps">CDF</span> is 1.</p>
<h2 id="representing-cdfs">Representing CDFs</h2>
<p>While it’s good to know how to manually compute the <span class="caps">CDF</span>, R provides the <code>ecdf()</code> function, which constructs the <span class="caps">CDF</span> of a sample automatically. Let’s return to the average work travel times dataset we used when discussing the <span class="caps">PMF</span> and compute the <span class="caps">CDF</span> of the full distribution (no grouping by states). We do this as follows:</p>
<pre class="r"><code>mean_work_travel_ecdf <- county_complete %>%
pull(mean_work_travel) %>%
ecdf()</code></pre>
<p>Now we can input an arbitrary travel time and find the percentile. For example, the percentile for an average work travel time of 30 minutes is:</p>
<pre class="r"><code>mean_work_travel_ecdf(30)</code></pre>
<pre><code>## [1] 0.9045498</code></pre>
<p>Thus, 30 minutes corresponds to the 90th percentile.</p>
<p>We can also use this to create a plot. To do this, we need to generate a sequence of travel times and calculate the <span class="caps">CDF</span> for each. A recommended way to do this is to find the minimum and maximum values of the distribution using <code>min()</code> and <code>max()</code>, and then use <code>seq()</code> to generate a long list of values that sit inbetween the minimum and maximum. Let’s show a couple examples so it’s clear what we’re doing. First we find the minimum and maximum:</p>
<pre class="r"><code>mean_work_travel_min <- county_complete %>%
pull(mean_work_travel) %>%
min()
mean_work_travel_max <- county_complete %>%
pull(mean_work_travel) %>%
max()</code></pre>
<p>By themselves, these correspond to two points along the horizontal axis, like so</p>
<p><img src="http://spring18.cds101.com/img/Materials/representing_distributions_cdf-min-max-mean-work-travel-plot-1.svg" title="plot of chunk min-max-mean-work-travel-plot" alt="plot of chunk min-max-mean-work-travel-plot" width="60%" style="display: block; margin: auto;" /></p>
<p>To generate horizontal values that increase by 1 between the minimum and maximum, we use <code>seq()</code> as follows:</p>
<pre class="r"><code>mean_work_travel_range1 <- seq(
from = mean_work_travel_min, to = mean_work_travel_max, by = 1)</code></pre>
<p>Visually, this would look like</p>
<p><img src="http://spring18.cds101.com/img/Materials/representing_distributions_cdf-min-max-mean-work-travel-short-range-1.svg" title="plot of chunk min-max-mean-work-travel-short-range" alt="plot of chunk min-max-mean-work-travel-short-range" width="60%" style="display: block; margin: auto;" /></p>
<p>Generally, it’s better to use a smaller incremental value inside <code>seq()</code> in order to make smoother figures. Let’s use <code>by = 0.1</code>:</p>
<pre class="r"><code>mean_work_travel_range2 <- seq(
from = mean_work_travel_min, to = mean_work_travel_max, by = 0.1)</code></pre>
<p>Next, we want to feed all these horizontal axis values into <code>mean_work_travel_ecdf()</code>, which we do as follows:</p>
<pre class="r"><code>mean_work_travel_computed_cdf <- mean_work_travel_ecdf(mean_work_travel_range2)</code></pre>
<p>To create a plot, we combine <code>mean_work_travel_range2</code> and <code>mean_work_travel_computed_cdf</code> into a <code>tibble</code>:</p>
<pre class="r"><code>mean_work_travel_cdf_tibble <- data_frame(
mean_work_travel = mean_work_travel_range2, cdf = mean_work_travel_computed_cdf)</code></pre>
<p>Now we can visualize it:</p>
<pre class="r"><code>mean_work_travel_cdf_tibble %>%
ggplot() +
geom_line(mapping = aes(x = mean_work_travel, y = cdf))</code></pre>
<p><img src="http://spring18.cds101.com/img/Materials/representing_distributions_cdf-mean-work-travel-cdf-visualization-labeled-1.svg" title="plot of chunk mean-work-travel-cdf-visualization-labeled" alt="plot of chunk mean-work-travel-cdf-visualization-labeled" width="60%" style="display: block; margin: auto;" /></p>
<p>Now we can easily specify an average work travel time percentile and read the associated time from the plot and vice-versa.</p>
<p>It takes some time to get used to CDFs, but over time it should become clear that they show more information, more clearly, than PMFs.</p>
<h2 id="comparing-cdfs">Comparing CDFs</h2>
<p>CDFs are especially useful for comparing distributions. Let’s revisit the comparison we made between the average work travel times in Nebraska and Iowa. Here is the full code that converts those distributions into CDFs:</p>
<pre class="r"><code>ia_travel_times_ecdf <- county_complete %>%
filter(state == "Iowa") %>%
pull(mean_work_travel) %>%
ecdf
ne_travel_times_ecdf <- county_complete %>%
filter(state == "Nebraska") %>%
pull(mean_work_travel) %>%
ecdf
mean_work_travel_range <- seq(mean_work_travel_min, mean_work_travel_max, 0.1)
ia_ne_mean_work_travel_cdfs <- data_frame(
mean_work_travel = mean_work_travel_range,
cdf_iowa = ia_travel_times_ecdf(mean_work_travel),
cdf_nebraska = ne_travel_times_ecdf(mean_work_travel)) %>%
gather(key = state, value = CDF, cdf_iowa:cdf_nebraska) %>%
mutate(state = recode(state, cdf_iowa = "Iowa", cdf_nebraska = "Nebraska"))</code></pre>
<p>and then plots them against each other:</p>
<pre class="r"><code>ggplot(data = ia_ne_mean_work_travel_cdfs,
mapping = aes(x = mean_work_travel, y = CDF)) +
geom_line(mapping = aes(color = state))</code></pre>
<p><img src="http://spring18.cds101.com/img/Materials/representing_distributions_cdf-compare-cdf-of-ne-ia-travel-times-plot-labeled-1.svg" title="plot of chunk compare-cdf-of-ne-ia-travel-times-plot-labeled" alt="plot of chunk compare-cdf-of-ne-ia-travel-times-plot-labeled" width="60%" style="display: block; margin: auto;" /></p>
<p>This visualization makes the shapes of the distributions and the relative differences between them much clearer. We see that Nebraska has shorter average work travel times for most of the distribution, at least until you reach an average time of 25 minutes, after which the Nebraska and Iowa distributions become similar to one another.</p>
<h2 id="credits">Credits</h2>
<div class="license">
<p>This work, <em>Cumulative distribution functions</em>, is a derivative of <a href="http://a.co/grOJGrv">Allen B. Downey, “Chapter 4 Cumulative distribution functions” in <em>Think Stats: Exploratory Data Analysis</em>, 2nd ed. (O’Reilly Media, Sebastopol, <span class="caps">CA</span>, 2014)</a>, used under <a href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><span class="caps">CC</span> <span class="caps">BY</span>-<span class="caps">NC</span>-<span class="caps">SA</span> 4.0</a>. <em>Cumulative distribution functions</em> is licensed under <a href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><span class="caps">CC</span> <span class="caps">BY</span>-<span class="caps">NC</span>-<span class="caps">SA</span> 4.0</a> by James Glasbrenner.</p>
</div>
Probability mass functions2018-04-05T15:00:00-04:002018-04-05T15:00:00-04:00Dr. Glasbrennertag:spring18.cds101.com,2018-04-05:/materials/representing-distributions-pmf/Reading about how to connect probabilities with values in a dataset.<div class="no-bullets">
<ul>
<li><a href="/doc/representing_distributions_pmf.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-8"></i> Download this reading</a></li>
</ul>
</div>
<pre class="r"><code>library(tidyverse)
county_complete <- read_rds(
path = url("http://spring18.cds101.com/files/datasets/county_complete.rds"))</code></pre>
<h2 id="statistical-distributions">Statistical distributions</h2>
<p>During the first part of the course, we learned how to use visualizations to explore a dataset. Now we will extend this approach using concepts from probability and statistics to build a scientific foundation for interpreting data distributions. We start with discussing univariate (single variable) distributions, which we’ve previously visualized as frequency histograms (by default, <code>geom_histogram()</code> sorts data into different bins and tells you how many end up in each one). Frequency histograms are useful for examining the particulars of a single variable, but have limited utility when directly comparing distributions that contain different numbers of observations. Here we introduce the normalized version of the frequency histogram, the <strong>probability mass function</strong> (<span class="caps">PMF</span>).</p>
<h2 id="example-dataset">Example dataset</h2>
<p>We use an example dataset of the average time it takes for people to commute to work across 3143 counties in the United States (collected between 2006-2010) to help illustrate the meaning and uses of the probability mass function. The frequency histogram for these times can be plotted using the following code snippet:</p>
<pre class="r"><code>county_complete %>%
ggplot(mapping = aes(x = mean_work_travel)) +
geom_histogram(binwidth = 1)</code></pre>
<p><img src="http://spring18.cds101.com/img/Materials/representing_distributions_pmf-mean-travel-time-freq-hist-labeled-1.svg" title="plot of chunk mean-travel-time-freq-hist-labeled" alt="plot of chunk mean-travel-time-freq-hist-labeled" width="60%" style="display: block; margin: auto;" /></p>
<h2 id="pmfs">PMFs</h2>
<p>The <strong>probability mass function</strong> (<span class="caps">PMF</span>) represents a distribution by sorting the data into bins (much like the frequency histogram) and then associates a probability with each bin in the distribution. A <strong>probability</strong> is a frequency expressed as a fraction of the sample size <em>n</em>. Therefore we can directly convert a frequency histogram to a <span class="caps">PMF</span> by dividing the count in each bin by the sample size <em>n</em>. This process is called <strong>normalization</strong>.</p>
<p>As an example, consider the following short sample,</p>
<pre><code>1 2 2 3 5</code></pre>
<p>If we choose a binwidth of 1, then we get a frequency histogram that looks like this:</p>
<p><img src="http://spring18.cds101.com/img/Materials/representing_distributions_pmf-simple-freq-hist-1.svg" title="plot of chunk simple-freq-hist" alt="plot of chunk simple-freq-hist" width="60%" style="display: block; margin: auto;" /></p>
<p>There are 5 observations in this sample. So, we can convert to a <span class="caps">PMF</span> by dividing the count within each bin by 5, getting a histogram that looks like this:</p>
<p><img src="http://spring18.cds101.com/img/Materials/representing_distributions_pmf-simple-pmf-hist-1.svg" title="plot of chunk simple-pmf-hist" alt="plot of chunk simple-pmf-hist" width="60%" style="display: block; margin: auto;" /></p>
<p>The relative shape stays the same, but compare the values along the vertical axis between the two figures. You’ll find that they are no longer integers and are instead probabilities. The normalization procedure (dividing by 5) guarantees that adding together the probabilities of all bins will equal 1. For this example, we find that the probability of drawing the number 1 is 0.2, drawing 2 is 0.4, drawing 3 is 0.2, drawing 4 is 0, and drawing 5 is 0.2. That is the biggest difference between a frequency histogram and a <span class="caps">PMF</span>, the frequency histogram maps from values to integer counters, while the <span class="caps">PMF</span> maps from values to fractional probabilities.</p>
<h2 id="plotting-pmfs">Plotting PMFs</h2>
<p>The syntax for plotting a <span class="caps">PMF</span> using <code>ggplot2</code> is nearly identical to what you would use to create a frequency histogram. The one modification is that you need to include <code>y = ..density..</code> inside <code>aes()</code>. As a simple example, let’s take the full distribution of the average work travel times from earlier and plot it as a <span class="caps">PMF</span>:</p>
<pre class="r"><code>county_complete %>%
ggplot(mapping = aes(x = mean_work_travel, y = ..density..)) +
geom_histogram(binwidth = 1)</code></pre>
<p><img src="http://spring18.cds101.com/img/Materials/representing_distributions_pmf-mean-travel-time-pmf-labeled-1.svg" title="plot of chunk mean-travel-time-pmf-labeled" alt="plot of chunk mean-travel-time-pmf-labeled" width="60%" style="display: block; margin: auto;" /></p>
<p>Let’s do a comparison to show how one might use a <span class="caps">PMF</span> for analysis. For example, we could ask if two midwestern states such as Nebraska and Iowa have the same distribution of work travel times, or if there is a meaningful difference between the two. First, let’s filter the dataset to only include these two states:</p>
<pre class="r"><code>nebraska_iowa <- county_complete %>%
filter(state == "Iowa" | state == "Nebraska")</code></pre>
<p>Now let’s plot the frequency histogram:</p>
<pre class="r"><code>nebraska_iowa %>%
ggplot() +
geom_histogram(
mapping = aes(x = mean_work_travel, fill = state, color = state),
position = "identity", alpha = 0.5, binwidth = 1)</code></pre>
<p><img src="http://spring18.cds101.com/img/Materials/representing_distributions_pmf-ia-ne-travel-times-freq-hist-labeled-1.svg" title="plot of chunk ia-ne-travel-times-freq-hist-labeled" alt="plot of chunk ia-ne-travel-times-freq-hist-labeled" width="60%" style="display: block; margin: auto;" /></p>
<p>The <code>position = "identity"</code> input overlaps the two distributions (instead of stacking them) and <code>alpha = 0.5</code> makes the distributions translucent, so that you can see both despite the overlap. On our first glance, it looks like the center of the Nebraska times is lower than the center of the Iowa times, and that both have a long tail on the right-hand side. However, if we do a count summary,</p>
<pre class="r"><code>nebraska_iowa %>%
count(state)</code></pre>
<p>we find that the two states do not have the exact same number of counties, although they are close in this particular example. Nonetheless, any comparisons should be done using a <span class="caps">PMF</span> in order to account for differences in the sample size. We use the following code to create a <span class="caps">PMF</span> plot:</p>
<pre class="r"><code>nebraska_iowa %>%
ggplot() +
geom_histogram(
mapping = aes(x = mean_work_travel, y = ..density..,
fill = state, color = state),
position = "identity", alpha = 0.5, binwidth = 1)</code></pre>
<p><img src="http://spring18.cds101.com/img/Materials/representing_distributions_pmf-ia-ne-travel-times-pmf-labeled-1.svg" title="plot of chunk ia-ne-travel-times-pmf-labeled" alt="plot of chunk ia-ne-travel-times-pmf-labeled" width="60%" style="display: block; margin: auto;" /></p>
<p>The trend that the center of the travel times in Nebraska is slightly smaller than in Iowa continues to hold even after converting to a <span class="caps">PMF</span>.</p>
<p>To provide an example where a <span class="caps">PMF</span> is clearly necessary, what if we compare New Jersey with Virginia? Virginia has many more counties than New Jersey:</p>
<pre class="r"><code>county_complete %>%
filter(state == "New Jersey" | state == "Virginia") %>%
count(state)</code></pre>
<p>As a result, comparing their frequency histograms gives you this:</p>
<p><img src="http://spring18.cds101.com/img/Materials/representing_distributions_pmf-va-nj-travel-times-freq-hist-1.svg" title="plot of chunk va-nj-travel-times-freq-hist" alt="plot of chunk va-nj-travel-times-freq-hist" width="60%" style="display: block; margin: auto;" /></p>
<p>The New Jersey distribution is dwarfed by the Virginia distribution and it makes it difficult to make comparisons. However, if we instead compare PMFs, we get this:</p>
<p><img src="http://spring18.cds101.com/img/Materials/representing_distributions_pmf-va-nj-travel-times-pmf-1.svg" title="plot of chunk va-nj-travel-times-pmf" alt="plot of chunk va-nj-travel-times-pmf" width="60%" style="display: block; margin: auto;" /></p>
<p>So, for example, we can now make statements like “a randomly selected resident in New Jersey is twice as likely as a randomly chosen resident in Virginia to have an average work travel time of 30 minutes.” The <span class="caps">PMF</span> allows for an “apples-to-apples” comparison of the average travel times.</p>
<h2 id="credits">Credits</h2>
<div class="license">
<p>This work, <em>Probability mass functions</em>, is a derivative of <a href="http://a.co/grOJGrv">Allen B. Downey, “Chapter 3 Probability mass functions” in <em>Think Stats: Exploratory Data Analysis</em>, 2nd ed. (O’Reilly Media, Sebastopol, <span class="caps">CA</span>, 2014)</a>, used under <a href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><span class="caps">CC</span> <span class="caps">BY</span>-<span class="caps">NC</span>-<span class="caps">SA</span> 4.0</a>. <em>Probability mass functions</em> is licensed under <a href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><span class="caps">CC</span> <span class="caps">BY</span>-<span class="caps">NC</span>-<span class="caps">SA</span> 4.0</a> by James Glasbrenner.</p>
</div>
Class 20: Statistical distributions I2018-04-05T13:30:00-04:002018-04-05T13:30:00-04:00Dr. Glasbrennertag:spring18.cds101.com,2018-04-05:/materials/class-20/<p><span class="h3">Slides</span></p>
<div class="no-bullets">
<ul>
<li><a href="/doc/class20_slides.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-16"></i> <span class="card-downloads-format"><span class="caps">PDF</span></span></a></li>
</ul>
</div>
<p><span class="h3">Slides</span></p>
<div class="no-bullets">
<ul>
<li><a href="/doc/class20_slides.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-16"></i> <span class="card-downloads-format"><span class="caps">PDF</span></span></a></li>
</ul>
</div>
Reading 92018-04-05T13:30:00-04:002018-04-05T13:30:00-04:00Dr. Glasbrennertag:spring18.cds101.com,2018-04-05:/assignments/reading-9/<p><span class="h3">Tutorials</span></p>
<p>Read the following tutorials on the <code>rvest</code> package and <code>SelectorGadget</code> Chrome extension.</p>
<p><span class="h4">Beginner’s Guide on Web Scraping in R (using rvest) with hands-on example</span></p>
<div class="no-bullets">
<ul>
<li><p><a href="/doc/saurav_kaushik__beginners_guide_on_web_scraping_in_r_with_hands-on_example.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-16"></i> <span class="card-downloads-format"><span class="caps">PDF</span></span></a></p></li>
<li><p><a href="https://www.analyticsvidhya.com/blog/2017/03/beginners-guide-on-web-scraping-in-r-using-rvest-with-hands-on-knowledge/"><i class="fas fa-link" data-fa-transform="grow-4"></i> <span class="card-downloads-format"><span class="caps">LINK</span></span></a></p></li>
</ul>
</div>
<p><span class="h4">SelectorGadget<br>Vignette</span></p>
<div class="no-bullets">
<ul>
<li><a href="/doc/rvest-selectorgadget-vignette.html"><i class="fab fa-html5" data-fa-transform="grow-16"></i> <span class="card-downloads-format"><span class="caps">HTML</span></span></a></li>
</ul>
</div>
<p><br></p>
<p><span class="h3">Reading discussion</span></p>
<div class="callout primary">
<p><strong>Discussion hashtag</strong><br />
<code>#reading9</code></p>
</div>
<div class="callout secondary">
<p>Remember to post your question about it to the <code>#5-discussion</code> channel in <a href="https://masoncds101.slack.com">Slack</a> by …</p></div><p><span class="h3">Tutorials</span></p>
<p>Read the following tutorials on the <code>rvest</code> package and <code>SelectorGadget</code> Chrome extension.</p>
<p><span class="h4">Beginner’s Guide on Web Scraping in R (using rvest) with hands-on example</span></p>
<div class="no-bullets">
<ul>
<li><p><a href="/doc/saurav_kaushik__beginners_guide_on_web_scraping_in_r_with_hands-on_example.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-16"></i> <span class="card-downloads-format"><span class="caps">PDF</span></span></a></p></li>
<li><p><a href="https://www.analyticsvidhya.com/blog/2017/03/beginners-guide-on-web-scraping-in-r-using-rvest-with-hands-on-knowledge/"><i class="fas fa-link" data-fa-transform="grow-4"></i> <span class="card-downloads-format"><span class="caps">LINK</span></span></a></p></li>
</ul>
</div>
<p><span class="h4">SelectorGadget<br>Vignette</span></p>
<div class="no-bullets">
<ul>
<li><a href="/doc/rvest-selectorgadget-vignette.html"><i class="fab fa-html5" data-fa-transform="grow-16"></i> <span class="card-downloads-format"><span class="caps">HTML</span></span></a></li>
</ul>
</div>
<p><br></p>
<p><span class="h3">Reading discussion</span></p>
<div class="callout primary">
<p><strong>Discussion hashtag</strong><br />
<code>#reading9</code></p>
</div>
<div class="callout secondary">
<p>Remember to post your question about it to the <code>#5-discussion</code> channel in <a href="https://masoncds101.slack.com">Slack</a> by the due date. To receive an answer credit, reply to a posted question no later than 11:59pm on Saturday, April 7th.</p>
</div>
<hr />
<p><a href="http://spring18.cds101.com/syllabus.html#readings"><i class="fas fa-link"></i> Posting guidelines can be found in the <strong>Readings</strong> section of the syllabus.</a></p>
Beginner’s Guide on Web Scraping in R (using rvest) with hands-on example2018-04-04T14:25:00-04:002018-04-04T14:25:00-04:00Saurav Kaushiktag:spring18.cds101.com,2018-04-04:/materials/beginners-guide-on-web-scraping-in-r/<p>A beginner’s guide on how to perform Web Scraping in R.</p>
<p><span class="h3">Download</span></p>
<div class="no-bullets">
<ul>
<li><a href="/doc/saurav_kaushik__beginners_guide_on_web_scraping_in_r_with_hands-on_example.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-8"></i> <span class="card-downloads-format"><span class="caps">PDF</span></span></a></li>
</ul>
</div>
<hr />
<p><a href="https://www.analyticsvidhya.com/blog/2017/03/beginners-guide-on-web-scraping-in-r-using-rvest-with-hands-on-knowledge/"><i class="fas fa-link"></i> <span class="card-downloads-format"><span class="caps">LINK</span></span></a></p>
<p>A beginner’s guide on how to perform Web Scraping in R.</p>
<p><span class="h3">Download</span></p>
<div class="no-bullets">
<ul>
<li><a href="/doc/saurav_kaushik__beginners_guide_on_web_scraping_in_r_with_hands-on_example.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-8"></i> <span class="card-downloads-format"><span class="caps">PDF</span></span></a></li>
</ul>
</div>
<hr />
<p><a href="https://www.analyticsvidhya.com/blog/2017/03/beginners-guide-on-web-scraping-in-r-using-rvest-with-hands-on-knowledge/"><i class="fas fa-link"></i> <span class="card-downloads-format"><span class="caps">LINK</span></span></a></p>
SelectorGadgetVignette2018-04-04T14:25:00-04:002018-04-04T14:25:00-04:00Hadley Wickhamtag:spring18.cds101.com,2018-04-04:/materials/selectorgadget-vignette/<p>Selectorgadget is a javascript bookmarklet that allows you to interactively figure out what css selector you need to extract desired components from a page.</p>
<p><span class="h3">Download</span></p>
<div class="no-bullets">
<ul>
<li><a href="/doc/rvest-selectorgadget-vignette.html"><i class="fab fa-html5" data-fa-transform="grow-16"></i> <span class="card-downloads-format"><span class="caps">HTML</span></span></a></li>
</ul>
</div>
<p>Selectorgadget is a javascript bookmarklet that allows you to interactively figure out what css selector you need to extract desired components from a page.</p>
<p><span class="h3">Download</span></p>
<div class="no-bullets">
<ul>
<li><a href="/doc/rvest-selectorgadget-vignette.html"><i class="fab fa-html5" data-fa-transform="grow-16"></i> <span class="card-downloads-format"><span class="caps">HTML</span></span></a></li>
</ul>
</div>
Class 19: Introduction to Web Scraping II/Principles of Data Collection2018-04-03T13:30:00-04:002018-04-03T13:30:00-04:00Dr. Glasbrennertag:spring18.cds101.com,2018-04-03:/materials/class-19/<p><span class="h3">Slides</span></p>
<div class="no-bullets">
<ul>
<li><a href="/doc/class19_slides.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-16"></i> <span class="card-downloads-format"><span class="caps">PDF</span></span></a></li>
</ul>
</div>
<p><span class="h3">Slides</span></p>
<div class="no-bullets">
<ul>
<li><a href="/doc/class19_slides.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-16"></i> <span class="card-downloads-format"><span class="caps">PDF</span></span></a></li>
</ul>
</div>
Web scraping2018-03-29T13:30:00-04:002018-03-29T13:30:00-04:00Dr. Glasbrennertag:spring18.cds101.com,2018-03-29:/materials/web-scraping-activity/<p>Interactive demo on how to use the <span class="monospace">rvest</span> web-scraping tools.</p>
<div class="no-bullets">
<ul>
<li><a href="https://masoncds101.slack.com/archives/C8WQJ0GTB/p1522347635000427"><i class="fab fa-github-alt" data-fa-transform="grow-16"></i> <span class="card-downloads-format"><strong>Web Scraping</strong> repo</span></a></li>
</ul>
</div>
<p>Interactive demo on how to use the <span class="monospace">rvest</span> web-scraping tools.</p>
<div class="no-bullets">
<ul>
<li><a href="https://masoncds101.slack.com/archives/C8WQJ0GTB/p1522347635000427"><i class="fab fa-github-alt" data-fa-transform="grow-16"></i> <span class="card-downloads-format"><strong>Web Scraping</strong> repo</span></a></li>
</ul>
</div>
Midterm Project2018-03-27T13:30:00-04:002018-03-27T13:30:00-04:00Dr. Glasbrennertag:spring18.cds101.com,2018-03-27:/assignments/midterm-project/For the midterm, you will conduct an exploratory data analysis of the <span class="caps">U.S.</span> Department of Education’s <emph>College Scorecard</emph> dataset in teams.<p><span class="h3"><strong>Due:</strong> March 27, 2018 @ 11:59pm</span></p>
<div class="no-bullets">
<ul>
<li><p><a href="/doc/midterm_project.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-8"></i> Offline copy of instructions</a></p></li>
<li><p><a href="https://classroom.github.com/g/dA9B0Qw5"><i class="fab fa-github-alt" data-fa-transform="grow-8"></i> Github Classroom repo for <strong>Midterm Project</strong></a></p></li>
</ul>
</div>
<h2 id="instructions">Instructions</h2>
<p>For the midterm, you will be assigned into teams that will conduct an exploratory data analysis using the skills you’ve developed over the first half of the semester. Teams will be responsible for creating a report that summarizes their exploratory data analysis and for giving an in-class presentation of their results. Each team will motivate their exploration by formulating interesting questions that can be answered with the dataset, which are then answered by wrangling the data into a form that allows you to visualize it. The key word here is <strong>visualize</strong>; all data transformations should be in service of creating visualizations that answer your team’s questions. Teams are welcome to supplement their work by performing basic statistics calculations, but it is not a requirement. Teams may also bring in additional data to strengthen the analysis, but please note that any additional data must be documented, which includes describing how you obtained it and how you’ve integrated it with the main dataset to further your analysis.</p>
<h2 id="the-dataset">The Dataset</h2>
<p>All teams will be working with the <a href="https://collegescorecard.ed.gov">College Scorecard</a> dataset started by The Obama Administration in September 2015. The dataset is available at <a href="https://collegescorecard.ed.gov/data/" class="uri">https://collegescorecard.ed.gov/data/</a>. <strong>The primary data file that you will be using is labeled as <em>Most recent data</em></strong>. The direct link to the dataset is <a href="https://ed-public-download.app.cloud.gov/downloads/Most-Recent-Cohorts-All-Data-Elements.csv" class="uri">https://ed-public-download.app.cloud.gov/downloads/Most-Recent-Cohorts-All-Data-Elements.csv</a>.</p>
<p><strong>The data code-book</strong> is available at <a href="https://collegescorecard.ed.gov/assets/CollegeScorecardDataDictionary.xlsx" class="uri">https://collegescorecard.ed.gov/assets/CollegeScorecardDataDictionary.xlsx</a>, which describes all of the variables that are in the dataset. You will have to look through the code-book to understand the meaning of the variables, and this should be your starting point before you start running an analysis on the dataset.</p>
<p>For further information about the dataset, consult the documentation pages at <a href="https://collegescorecard.ed.gov/data/documentation/" class="uri">https://collegescorecard.ed.gov/data/documentation/</a>.</p>
<p>Finally, loading the dataset, you will need to specify values that will count as <code>NA</code> entries, otherwise it will give you errors. The simplest way to load without errors is to run</p>
<pre class="r"><code>college <- read_csv("Most-Recent-Cohorts-All-Data-Elements.csv", na = c("NA", "NULL"))</code></pre>
<p>For your convenience, this command is included in the setup block in your team’s project template.</p>
<p><strong>This is a large dataset that is over 100 megabytes in size and contains millions of individual cells.</strong> <strong>As such, there is no one right way to approach this project.</strong> <strong>There are many different avenues that you can take, so have fun with it!</strong></p>
<h2 id="submission-guidelines">Submission guidelines</h2>
<p>Your Midterm Project submission is expected to meet the following guidelines:</p>
<ul>
<li><p><strong>Your team submission will only be graded if it is on Github and in your team’s copy of the midterm project repository.</strong> <strong>Your team must also submit a Pull Request against the <span class="monospace">starting</span> branch.</strong></p></li>
<li><p>Your submission must contain the team’s RMarkdown file for the final report and a copy of your team’s presentation slides. The RMarkdown report file must knit to <span class="caps">HTML</span> and <span class="caps">PDF</span> without error.</p></li>
<li><p>The report is to have two sections, <strong>Cleaning and tidying the dataset</strong>, and <strong>Exploratory data analysis</strong>, see further down the document for a description.</p></li>
<li><p>The data analysis must be completed using the <span class="monospace">tidyverse</span> tools described in <a href="http://r4ds.had.co.nz/"><em>R for Data Science</em></a> and the plots must be generated using <span class="monospace">ggplot2</span>.</p></li>
<li><p>Each team member is expected to formulate one question that he/she then answers in the final report.</p></li>
<li><p>Each team member must contribute substantive commits to the team repository on Github that reflect his/her own work.</p></li>
<li><p>The report represents your team’s final results and should only contain the methods used to obtain them. <strong>Do not include questions in the report that you cannot answer.</strong></p></li>
<li><p>Your R code should be clean and readable. For guidelines, take a look at <a href="https://google.github.io/styleguide/Rguide.xml">Google’s R Style Guide</a> and <a href="http://r-pkgs.had.co.nz/style.html">Hadley Wickham’s R Style Guide</a>.</p></li>
<li><p>Your work is to be documented using Markdown blocks. Each block of R code should have Markdown text above it that explains the code’s purpose and what is being done.</p></li>
<li><p><strong>Teams must do a final edit on the report so that the final submission has a coherent writing style and structure.</strong> Each student has a different writing style, and it is distracting if the writing style changes from question to question, so select an editor for the team to make it uniform before submission. In addition, the code blocks should look uniform and clean upon knitting, and any spelling and grammar errors need to be fixed.</p></li>
<li><p><strong>The report’s tone should be professional and should not read like a social media feed or personal blog</strong>. Refrain from editorializing about the project as a whole or about a specific question, this is not an opinion paper. Do not speculate, instead support your claims and explanations using data and analysis. Avoid self-narration or writing about how you felt or what you were thinking as you complete each question, instead write as if you are constructing a step-by-step tutorial for others to use.</p></li>
<li><p><strong>Late submissions for the midterm project will not be accepted and your presentation must be given on the scheduled date, no exceptions.</strong></p></li>
</ul>
<h2 id="presentation-guidelines">Presentation guidelines</h2>
<p>Your presentation is expected to meet the following guidelines:</p>
<ul>
<li><p>For teams of three, the presentation must be between 8 to 10 minutes in length. For teams of four, the presentation must be between 12 to 14 minutes in length.</p></li>
<li><p>The presentation summarizes key aspects of your report, such as what your questions are, what you needed to find that would answer them, and the choices you made that led to answers. The presentation should <strong>not</strong> be a laundry list of everything you tried to do that didn’t work. It also shouldn’t contain much R code, only include the most important snippets.</p></li>
<li><p>The presentation must include slides that you created using either PowerPoint or Google Slides. Teams that are interested in using the R package I use to create the <span class="caps">HTML5</span> slides from class can talk to me separately.</p></li>
<li><p>Each team member must speak during the presentation. Also, while it is understood that each team member will offer different contributions, every team member should be able to speak independently about the steps taken in your project and answer basic questions.</p></li>
<li><p>The team should be able to explain the reasoning for taking any particular step during the project.</p></li>
</ul>
<h2 id="grade">Grade</h2>
<p>The submitted write-up is worth 60% of your midterm project grade and your in-class presentation is worth the remaining 40%. Grading criteria for the written submission will be based on the correctness and readability of your R code, if your write-up is structured, coherent, and has proper spelling and follows standard rules of grammar, and the general quality of how you answer each of your presented questions. Grading criteria for the presentation will be based on falling within the time length, how long each team member speaks during the presentation, whether the spoken content is substantive, and the general quality of how the group presents questions and shows how they arrived at their answers. Finally, your classmates will peer review your presentations using a form, which will be averaged together and factored into the presentation grade.</p>
<p>You will be graded as an individual, even though this is a team project. Any team members that are judged to have not sufficiently contributed to the final product will have their grade penalized.</p>
<p><a href="/syllabus.html#breakdown">As stated in the class syllabus</a>, this project is worth 25% of your class grade.</p>
<h2 id="cleaning-and-tidying-the-dataset-section">Cleaning and tidying the dataset section</h2>
<p>This dataset is semi-structured and at least somewhat clean, but it is likely that you will have to perform a small amount of cleaning and/or tidying. A simple example are cell entries containing “PrivacySuppressed”, which may reside in columns that otherwise contain numerical data. You may need to fix the data type for some columns after dealing with the “PrivacySuppressed” entries. Also, some of the columns don’t follow the “each variable must have its own column” rule, so you will likely have to do some data reshaping before starting your analysis.</p>
<p>Attempting to reshape the full dataset so that it is tidy would be an involved task. To keep things manageable, it is recommended that, after you’ve written down your questions and decided on the variables you will be analyzing and visualizing, that you extract those columns using <code>select()</code> in order to reduce the size of the dataset. Assign the reduced dataset to a different variable (don’t overwrite the original dataset). Afterwards you can perform your tidying and cleaning operations on the <strong>reduced</strong> dataset.</p>
<p>In your write-up, the data cleaning and tidying section should include documentation of your procedure. This would be in the form of code blocks with explanations for why the cleaning/tidying is necessary (for example, cite the tidy data rule that you’re addressing).</p>
<h2 id="exploratory-data-analysis-section">Exploratory data analysis section</h2>
<p><strong>Each team member will construct and answer 1 question about the dataset in this section</strong>. So, for example, a team of three members will have 3 questions in total. Each question must involve one or more visualizations. <strong>Your questions must be about comparing relationships between two or more variables in the dataset</strong>, which can include how a variable is distributed across several different categories. In addition, answering the question must require that you make use of both <em>data transformation</em> (<span class="monospace">dplyr</span>) and <em>data visualization</em> (<span class="monospace">ggplot2</span>). Running visualizations on different columns “out of the box” without any kind of filtering or grouping/summarizing is not sufficient for this project.</p>
<p>In the report, each question should be clearly stated and followed by the procedure used to answer it. The procedure takes the form of both code blocks and plain text. Then, after you obtain your final result in the form of a visualization, <strong>be sure to interpret it for the reader</strong>. For example, if it’s a distribution, what is it’s shape and center? If it’s a scatter plot, what is the trend of the points? After analyzing the various outputs, synthesize it and provide a formal answer to your stated question.</p>
<h2 id="how-to-submit">How to submit</h2>
<p>When your team is ready to submit, first confirm that everything has been saved, committed, and pushed to Github. Then, have one of your team members navigate to <strong>team’s copy</strong> of the <a href="https://classroom.github.com/g/dA9B0Qw5">Github repository</a> you used for this project. He/she should confirm that the repository webpage contains the final version of report and Powerpoint slides. After confirming that the files are up-to-date, then follow these steps:</p>
<ol type="1">
<li><p>Click the <em>Pull Requests</em> tab near the top of the page.</p></li>
<li><p>Click the green button that says “New pull request”.</p></li>
<li><p>Click the dropdown menu button labeled “base:”, and select the option <code>starting</code>.</p></li>
<li><p>Confirm that the dropdown menu button labeled “compare:” is set to <code>master</code>.</p></li>
<li><p>Click the green button that says “Create pull request”.</p></li>
<li><p>Give the <em>pull request</em> the following title: <span class="monospace">Submission: Midterm Project, Team #</span>, replacing <span class="monospace">#</span> with your assigned team number.</p></li>
<li><p>In the messagebox, write: <span class="monospace">Our team’s midterm project report is ready for grading @jkglasbrenner</span>.</p></li>
<li><p>Click “Create pull request” to lock in your submission.</p></li>
</ol>
Class 16: Introduction to Web Scraping I2018-03-22T13:30:00-04:002018-03-22T13:30:00-04:00Dr. Glasbrennertag:spring18.cds101.com,2018-03-22:/materials/class-16/<p><span class="h3">Slides</span></p>
<div class="no-bullets">
<ul>
<li><a href="/doc/class16_slides.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-16"></i> <span class="card-downloads-format"><span class="caps">PDF</span></span></a></li>
</ul>
</div>
<p><span class="h3">Slides</span></p>
<div class="no-bullets">
<ul>
<li><a href="/doc/class16_slides.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-16"></i> <span class="card-downloads-format"><span class="caps">PDF</span></span></a></li>
</ul>
</div>
Homework 22018-03-09T23:59:00-05:002018-03-09T23:59:00-05:00Dr. Glasbrennertag:spring18.cds101.com,2018-03-09:/assignments/homework-2/For your second homework assignment, you will explore a dataset about the passengers on the <em>Titanic</em>, the British passenger liner that crashed into an iceberg during its maiden voyage and sank early in the morning on April 15, 1912.<p><span class="h3"><strong>Due:</strong> March 9, 2018 @ 11:59pm</span></p>
<div class="no-bullets">
<ul>
<li><p><a href="/doc/cds101_homework2.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-8"></i> Offline copy of instructions</a></p></li>
<li><p><a href="https://classroom.github.com/a/5gVLthTb"><i class="fab fa-github-alt" data-fa-transform="grow-8"></i> Github Classroom repo for <strong>Homework 2</strong></a></p></li>
</ul>
</div>
<h2 id="instructions">Instructions</h2>
<p><a href="https://classroom.github.com/a/5gVLthTb">Obtain the Github repository you will use to complete homework 2</a> that contains a starter RMarkdown file named <code>homework_2.Rmd</code>, which you will use to do your work and write-up when completing the questions below. Remember to fill in your name at the top of the RMarkdown document and be sure to save, commit, and push (upload) frequently to Github so that you have incremental snapshots of your work. When you?re done, follow the <a href="#how-to-submit">How to submit</a> section below to setup a Pull Request, which will be used for feedback.</p>
<ul>
<li><p>Remember that the point of us using RMarkdown documents is to combine code and writeups! Each block of R code should have some sort of explanation or justification using full sentences.</p></li>
<li><p><strong>Your grade will take into account your code, your explanations, and whether your document looks nice when “knitted” to <span class="caps">HTML</span> or <span class="caps">PDF</span>.</strong></p></li>
</ul>
<h2 id="overview">Overview</h2>
<p>
<figure>
<img src="/img/titanic_photograph.jpg" width="300" style="display: block; margin: auto;" />
<figcaption style="text-align: center;">
A photograph of the <em>Titanic</em> leaving Southampton on April 10, 1912.
</figcaption>
</figure>
</p>
<p>For this homework assignment, you will be exploring a dataset about the passengers on the <em>Titanic</em>, the British passenger liner that crashed into an iceberg during its maiden voyage and sank early in the morning on April 15, 1912. The tragedy stands out as one of the deadliest commercial maritime disasters during peacetime in history. More than half of the passengers and crew died, due in large part to poor safety standards, such as not having enough lifeboats or not ensuring all lifeboats were filled to capacity during evacuation.</p>
<p>This dataset presents the most up-to-date knowledge about the passengers that were on the <em>Titanic</em>, including whether or not they survived. This dataset is frequently used to introduce using machine learning techniques that take multiple inputs and use them to predict an outcome, in this case whether a passenger is likely to have survived. While we won’t be using a machine learning model in this assignment, there is still a lot of information that can be learned by exploring the dataset using the <code>tidyverse</code> suite.</p>
<p>The dataset is included in your <a href="https://classroom.github.com/a/5gVLthTb">Github starter code repository</a>.</p>
<h2 id="about-the-dataset">About the dataset</h2>
<p>The following are the variable (column) descriptions for the dataset:<a href="#fn1" class="footnote-ref" id="fnref1"><sup>1</sup></a></p>
<table>
<colgroup>
<col style="width: 14%" />
<col style="width: 85%" />
</colgroup>
<thead>
<tr class="header">
<th style="text-align: left;">Variable</th>
<th style="text-align: left;">Description</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: left;"><span class="monospace">pclass</span></td>
<td style="text-align: left;">Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)</td>
</tr>
<tr class="even">
<td style="text-align: left;"><span class="monospace">survival</span></td>
<td style="text-align: left;">Survival (0 = No; 1 = Yes)</td>
</tr>
<tr class="odd">
<td style="text-align: left;"><span class="monospace">name</span></td>
<td style="text-align: left;">Name</td>
</tr>
<tr class="even">
<td style="text-align: left;"><span class="monospace">sex</span></td>
<td style="text-align: left;">Sex</td>
</tr>
<tr class="odd">
<td style="text-align: left;"><span class="monospace">age</span></td>
<td style="text-align: left;">Age</td>
</tr>
<tr class="even">
<td style="text-align: left;"><span class="monospace">sibsp</span></td>
<td style="text-align: left;">Number of Siblings/Spouses Aboard</td>
</tr>
<tr class="odd">
<td style="text-align: left;"><span class="monospace">parch</span></td>
<td style="text-align: left;">Number of Parents/Children Aboard</td>
</tr>
<tr class="even">
<td style="text-align: left;"><span class="monospace">ticket</span></td>
<td style="text-align: left;">Ticket Number</td>
</tr>
<tr class="odd">
<td style="text-align: left;"><span class="monospace">fare</span></td>
<td style="text-align: left;">Passenger Fare (British pound)</td>
</tr>
<tr class="even">
<td style="text-align: left;"><span class="monospace">cabin</span></td>
<td style="text-align: left;">Cabin</td>
</tr>
<tr class="odd">
<td style="text-align: left;"><span class="monospace">embarked</span></td>
<td style="text-align: left;">Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)</td>
</tr>
<tr class="even">
<td style="text-align: left;"><span class="monospace">boat</span></td>
<td style="text-align: left;">Lifeboat</td>
</tr>
<tr class="odd">
<td style="text-align: left;"><span class="monospace">body</span></td>
<td style="text-align: left;">Body Identification Number</td>
</tr>
<tr class="even">
<td style="text-align: left;"><span class="monospace">home.dest</span></td>
<td style="text-align: left;">Home/Destination</td>
</tr>
</tbody>
</table>
<p>Also note that the following definitions were used for <code>sibsp</code> and <code>parch</code>:</p>
<table>
<colgroup>
<col style="width: 14%" />
<col style="width: 85%" />
</colgroup>
<thead>
<tr class="header">
<th style="text-align: left;">Label</th>
<th style="text-align: left;">Definition</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: left;">Sibling</td>
<td style="text-align: left;">Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic</td>
</tr>
<tr class="even">
<td style="text-align: left;">Spouse</td>
<td style="text-align: left;">Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)</td>
</tr>
<tr class="odd">
<td style="text-align: left;">Parent</td>
<td style="text-align: left;">Mother or Father of Passenger Aboard Titanic</td>
</tr>
<tr class="even">
<td style="text-align: left;">Child</td>
<td style="text-align: left;">Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic</td>
</tr>
</tbody>
</table>
<h2 id="questions">Questions</h2>
<ol class="example" type="1">
<li><p>When reading in the dataset using <code>read_csv(file = "titanic_dataset.csv")</code>, several of the columns are converted into inconvenient data types. Fix this so that your later analysis does not run into problems. Use the <code>col_types = cols()</code> argument within <code>read_csv()</code>, <a href="http://r4ds.had.co.nz/data-import.html#problems">see this section of <em>R for Data Science</em> for a review</a>, to change the data type defaults for the following columns:</p>
<ul>
<li><p>Convert <code>survived</code> to the logical data type</p></li>
<li><p>Convert <code>pclass</code> to the character data type</p></li>
<li><p>Convert <code>sibsp</code> to the character data type</p></li>
<li><p>Convert <code>parch</code> to the character data type</p></li>
</ul></li>
<li><p>Compute how many known passengers were on the Titanic. <em>Do not just print the table, use a function to count the passengers</em>.</p></li>
<li><p>A famous directive for evacuating the <em>Titanic</em> was “women and children first”. Use your <code>dplyr</code> functions to verify the first part of this statement by counting the number of men and women that survived and that died. Then, using those counts, calculate the fraction of women that survived, <span class="math display">\[\dfrac{\text{Number of female survivors}}{\text{Total number of female passengers}}\]</span> and the fraction of men that survived, <span class="math display">\[\dfrac{\text{Number of male survivors}}{\text{Total number of male passengers}}\]</span> Do your computations support the idea that women were more likely to survive? Why or why not?</p></li>
<li><p>Verify the second part of the “women and children first” directive. This will not be as straightforward as it was in the previous question, as the dataset only contains people’s ages, which can take on many values. By default, there are no columns with labels of <em>child</em> or <em>adult</em>, so you will need to create your own.</p>
<p>Create a new column named <code>child_or_adult</code> that uses the age data to label each passenger. For our purposes, we want to label anyone aged 0–9 as a child and anyone age 10 and up are as adults. If the age cell is blank (<code>NA</code>) for a passenger, also label them as an “adult”. Assign this updated dataset to the variable <code>titanic_age_groups</code>.</p>
<p><strong>Hint:</strong> You will need to use the <code>ifelse()</code> function to complete this task. An example usage of <code>ifelse()</code> is the following:</p>
<pre class="r"><code>titanic %>%
mutate(cheap_or_expensive = ifelse(test = fare < 15,
yes = "cheap ticket",
no = "not cheap"))</code></pre>
<p>To handle blank entries, you will also need to use <code>is.na()</code> somewhere inside your <code>ifelse()</code> test.</p></li>
<li><p>Using the <code>titanic_age_groups</code> dataset you created in the previous question, count the number of children that survived and the number that did not. Do your computations support the idea that children were also more likely to survive? Why or why not?</p></li>
<li><p>A passenger’s age group and sex are not the only predictors of survival. For example, social standing and wealth can play a factor in survival. One of the parameters within this dataset acts as a proxy for distinguishing between the upper and lower classes. Which parameter is it? How do you know?</p></li>
<li><p>Group your dataset by <code>sex</code> and the variable you determined in question 6 and count the number that survived and the number that did not. Create a bar chart that summarizes the data, where <code>survived</code> is along the horizontal axis and the passenger counts are along the vertical axis. Use the bar chart <code>fill =</code> aesthetic to break the bar charts down by your variable from question 6. Additionally, facet over the <code>sex</code> variable. Interpret this visualization and describe any survival patterns that you notice.</p></li>
<li><p>Create two visualizations:</p>
<ul>
<li><p>The first visualization should be a bar chart displaying the fraction of the passengers that survived for different values of <code>parch</code>, <span class="math display">\[\dfrac{\text{For a given parch, the number of survivors}}{\text{Total number of passengers}}.\]</span> Doing this requires grouping your data properly, counting the number of passengers in each grouping, and then dividing this by the total number of passengers on the ship.</p></li>
<li><p>The second visualization should be a bar chart displaying the fraction of the passengers that survived for different values of <code>sibsp</code>, <span class="math display">\[\dfrac{\text{For a given sibsp, the number of survivors}}{\text{Total number of passengers}}.\]</span> Like above, doing this requires grouping your data properly, counting the number of passengers in each grouping, and then dividing this by the total number of passengers on the ship.</p></li>
</ul>
<p>Interpret the patterns that you see in the visualizations.</p></li>
<li><p>Based on your analysis, write a list of the factors that affected the chances of survival for each passenger. You should be able to identify 4 different attributes that had a noticeable impact on survival. Justify each attribute that you list by referencing back to a table or visualization you created in a previous question.</p></li>
</ol>
<h2 id="how-to-submit">How to submit</h2>
<p>When you are ready to submit, be sure to save, commit, and push your final result so that everything is synchronized to Github. Then, navigate to <strong>your copy</strong> of the <a href="https://classroom.github.com/a/5gVLthTb">Github repository</a> you used for this assignment. You should see your repository, along with the updated files that you just synchronized to Github. Confirm that your files are up-to-date, and then do the following steps:</p>
<ol type="1">
<li><p>Click the <em>Pull Requests</em> tab near the top of the page.</p></li>
<li><p>Click the green button that says “New pull request”.</p></li>
<li><p>Click the dropdown menu button labeled “base:”, and select the option <code>starting</code>.</p></li>
<li><p>Confirm that the dropdown menu button labeled “compare:” is set to <code>master</code>.</p></li>
<li><p>Click the green button that says “Create pull request”.</p></li>
<li><p>Give the <em>pull request</em> the following title: <span class="monospace">Submission: Homework 2, FirstName LastName</span>, replacing <span class="monospace">FirstName</span> and <span class="monospace">LastName</span> with your actual first and last name.</p></li>
<li><p>In the messagebox, write: <span class="monospace">My homework submission is ready for grading @shuaibm @jkglasbrenner</span>.</p></li>
<li><p>Click “Create pull request” to lock in your submission.</p></li>
</ol>
<h2 id="cheatsheets">Cheatsheets</h2>
<p>You are encouraged to review and keep the following cheatsheets handy while working on this assignment:</p>
<ul>
<li><p><a href="/doc/rstudio-IDE-cheatsheet.pdf">RStudio cheatsheet</a></p></li>
<li><p><a href="/doc/rmarkdown-cheatsheet.pdf">RMarkdown cheatsheet</a></p></li>
<li><p><a href="/doc/rmarkdown-reference.pdf">RMarkdown reference</a></p></li>
<li><p><a href="/doc/ggplot2-cheatsheet.pdf"><span class="monospace">ggplot2</span> cheatsheet</a></p></li>
<li><p><a href="/doc/data-transformation-cheatsheet.pdf">Data transformation cheatsheet</a></p></li>
<li><p><a href="/doc/data-import-cheatsheet.pdf">Data import cheatsheet</a></p></li>
</ul>
<section class="footnotes">
<hr />
<ol>
<li id="fn1"><p><a href="http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.html" class="uri">http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.html</a><a href="#fnref1" class="footnote-back">↩</a></p></li>
</ol>
</section>
Exploring the Medicare dataset II2018-03-08T13:30:00-05:002018-03-08T13:30:00-05:00Dr. Glasbrennertag:spring18.cds101.com,2018-03-08:/materials/exploring-medicare-dataset-activity-ii/<p>Continuation of the instructor-led exploration of the Medicare inpatient payments dataset.</p>
<div class="no-bullets">
<ul>
<li><a href="https://masoncds101.slack.com/archives/C8WQJ0GTB/p1520361454000399"><i class="fab fa-github-alt" data-fa-transform="grow-16"></i> <span class="card-downloads-format"><strong>Exploratory Data Analysis</strong> repo</span></a></li>
</ul>
</div>
<p>Continuation of the instructor-led exploration of the Medicare inpatient payments dataset.</p>
<div class="no-bullets">
<ul>
<li><a href="https://masoncds101.slack.com/archives/C8WQJ0GTB/p1520361454000399"><i class="fab fa-github-alt" data-fa-transform="grow-16"></i> <span class="card-downloads-format"><strong>Exploratory Data Analysis</strong> repo</span></a></li>
</ul>
</div>
Exploring the Medicare dataset I2018-03-06T13:30:00-05:002018-03-06T13:30:00-05:00Dr. Glasbrennertag:spring18.cds101.com,2018-03-06:/materials/exploring-medicare-dataset-activity-i/<p>Instructor-led exploration of the Medicare inpatient payments dataset.</p>
<div class="no-bullets">
<ul>
<li><a href="https://masoncds101.slack.com/archives/C8WQJ0GTB/p1520361454000399"><i class="fab fa-github-alt" data-fa-transform="grow-16"></i> <span class="card-downloads-format"><strong>Exploratory Data Analysis</strong> repo</span></a></li>
</ul>
</div>
<p>Instructor-led exploration of the Medicare inpatient payments dataset.</p>
<div class="no-bullets">
<ul>
<li><a href="https://masoncds101.slack.com/archives/C8WQJ0GTB/p1520361454000399"><i class="fab fa-github-alt" data-fa-transform="grow-16"></i> <span class="card-downloads-format"><strong>Exploratory Data Analysis</strong> repo</span></a></li>
</ul>
</div>
Reading 82018-03-01T13:30:00-05:002018-03-01T13:30:00-05:00Dr. Glasbrennertag:spring18.cds101.com,2018-03-01:/assignments/reading-8/<p><span class="h3"><a href="http://r4ds.had.co.nz/">R for Data Science</a></span></p>
<p>Read the following:</p>
<ul>
<li>From <a href="http://r4ds.had.co.nz/tidy-data.html">chapter 12</a>: section <a href="http://r4ds.had.co.nz/tidy-data.html#introduction-6">12.1</a> through to the end of section <a href="http://r4ds.had.co.nz/tidy-data.html#spreading-and-gathering">12.3</a></li>
</ul>
<p><span class="h3">Reading discussion</span></p>
<div class="callout primary">
<p><strong>Discussion hashtag</strong><br />
<code>#reading8</code></p>
</div>
<div class="callout secondary">
<p>Remember to post your question about it to the <code>#5-discussion</code> channel in <a href="https://masoncds101.slack.com">Slack</a> by the due date. To receive an answer credit, reply to …</p></div><p><span class="h3"><a href="http://r4ds.had.co.nz/">R for Data Science</a></span></p>
<p>Read the following:</p>
<ul>
<li>From <a href="http://r4ds.had.co.nz/tidy-data.html">chapter 12</a>: section <a href="http://r4ds.had.co.nz/tidy-data.html#introduction-6">12.1</a> through to the end of section <a href="http://r4ds.had.co.nz/tidy-data.html#spreading-and-gathering">12.3</a></li>
</ul>
<p><span class="h3">Reading discussion</span></p>
<div class="callout primary">
<p><strong>Discussion hashtag</strong><br />
<code>#reading8</code></p>
</div>
<div class="callout secondary">
<p>Remember to post your question about it to the <code>#5-discussion</code> channel in <a href="https://masoncds101.slack.com">Slack</a> by the due date. To receive an answer credit, reply to a posted question no later than 11:59pm on Saturday, March 3rd.</p>
</div>
<hr />
<p><a href="http://spring18.cds101.com/syllabus.html#readings"><i class="fas fa-link"></i> Posting guidelines can be found in the <strong>Readings</strong> section of the syllabus.</a></p>
Class 11: Data wrangling IV2018-02-27T17:00:00-05:002018-02-27T17:00:00-05:00Dr. Glasbrennertag:spring18.cds101.com,2018-02-27:/materials/class-11/<p><span class="h3">Slides</span></p>
<div class="no-bullets">
<ul>
<li><a href="/doc/class11_slides.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-16"></i> <span class="card-downloads-format"><span class="caps">PDF</span></span></a></li>
</ul>
</div>
<p><span class="h3">Slides</span></p>
<div class="no-bullets">
<ul>
<li><a href="/doc/class11_slides.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-16"></i> <span class="card-downloads-format"><span class="caps">PDF</span></span></a></li>
</ul>
</div>
Tidy gradebook2018-02-27T17:00:00-05:002018-02-27T17:00:00-05:00Dr. Glasbrennertag:spring18.cds101.com,2018-02-27:/materials/tidy-gradebook-activity/<p>Interactive demonstration showing how to apply the <a href="http://r4ds.had.co.nz/tidy-data.html">Tidy Data principles</a> to a typical classroom gradebook.</p>
<div class="no-bullets">
<ul>
<li><a href="https://masoncds101.slack.com/archives/C8WQJ0GTB/p1519755978000359"><i class="fab fa-github-alt" data-fa-transform="grow-16"></i> <span class="card-downloads-format"><strong>Tidy Gradebook</strong> repo</span></a></li>
</ul>
</div>
<p>Interactive demonstration showing how to apply the <a href="http://r4ds.had.co.nz/tidy-data.html">Tidy Data principles</a> to a typical classroom gradebook.</p>
<div class="no-bullets">
<ul>
<li><a href="https://masoncds101.slack.com/archives/C8WQJ0GTB/p1519755978000359"><i class="fab fa-github-alt" data-fa-transform="grow-16"></i> <span class="card-downloads-format"><strong>Tidy Gradebook</strong> repo</span></a></li>
</ul>
</div>
Reading 72018-02-27T13:30:00-05:002018-02-27T13:30:00-05:00Dr. Glasbrennertag:spring18.cds101.com,2018-02-27:/assignments/reading-7/<p><span class="h3"><a href="http://r4ds.had.co.nz/">R for Data Science</a></span></p>
<p>Read the following:</p>
<ul>
<li>All of <a href="http://r4ds.had.co.nz/exploratory-data-analysis.html">chapter 7</a></li>
</ul>
<p><span class="h3">Reading discussion</span></p>
<div class="callout primary">
<p><strong>Discussion hashtag</strong><br />
<code>#reading7</code></p>
</div>
<div class="callout secondary">
<p>Remember to post your question about it to the <code>#5-discussion</code> channel in <a href="https://masoncds101.slack.com">Slack</a> by the due date. To receive an answer credit, reply to a posted question no later than 11:59pm on Thursday …</p></div><p><span class="h3"><a href="http://r4ds.had.co.nz/">R for Data Science</a></span></p>
<p>Read the following:</p>
<ul>
<li>All of <a href="http://r4ds.had.co.nz/exploratory-data-analysis.html">chapter 7</a></li>
</ul>
<p><span class="h3">Reading discussion</span></p>
<div class="callout primary">
<p><strong>Discussion hashtag</strong><br />
<code>#reading7</code></p>
</div>
<div class="callout secondary">
<p>Remember to post your question about it to the <code>#5-discussion</code> channel in <a href="https://masoncds101.slack.com">Slack</a> by the due date. To receive an answer credit, reply to a posted question no later than 11:59pm on Thursday, March 1st.</p>
</div>
<hr />
<p><a href="http://spring18.cds101.com/syllabus.html#readings"><i class="fas fa-link"></i> Posting guidelines can be found in the <strong>Readings</strong> section of the syllabus.</a></p>
Homework 12018-02-23T23:59:00-05:002018-02-23T23:59:00-05:00Dr. Glasbrennertag:spring18.cds101.com,2018-02-23:/assignments/homework-1/Your first major assignment is a set of exercises based around a single dataset called <code>rail_trail</code>, which will provide you with practice in creating visualizations using R and <code>ggplot2</code>.<p><span class="h3"><strong>Due:</strong> February 23, 2018 @ 11:59pm</span></p>
<div class="no-bullets">
<ul>
<li><p><a href="/doc/cds101_homework1.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-8"></i> Offline copy of instructions</a></p></li>
<li><p><a href="https://classroom.github.com/a/tHjWBa6i"><i class="fab fa-github-alt" data-fa-transform="grow-8"></i> Github Classroom repo for <strong>Homework 1</strong></a></p></li>
</ul>
</div>
<h2 id="instructions">Instructions</h2>
<p><a href="https://classroom.github.com/a/tHjWBa6i">Obtain the Github repository you will use to complete homework 1</a>, which contains a starter RMarkdown file named <code>homework_1.Rmd</code>, which you will use to do your work and write-up when completing the questions below. Be sure to save, commit, and push (upload) frequently to Github so that you have incremental snapshots of your work. When you’re done, follow the <a href="#how-to-submit"><strong>How to submit</strong> section</a> below to setup a Pull Request, which will be used for feedback.</p>
<ul>
<li><p>Remember that the point of us using RMarkdown documents is to combine code and writeups! Each block of R code should have some sort of explanation or justification using full sentences.</p></li>
<li><p><strong>Your grade will take into account your code, your explanations, and whether your document looks nice when “knitted” to <span class="caps">HTML</span> or <span class="caps">PDF</span></strong>.</p></li>
</ul>
<h2 id="the-rail-trail-dataset">The rail trail dataset</h2>
<p><img src="http://spring18.cds101.com/img/mass-central-logo.jpg" title="plot of chunk mass-central-logo" alt="plot of chunk mass-central-logo" width="300px" style="display: block; margin: auto;" /></p>
<p>For this homework assignment, you will be working though a set of visualization problems based on the <code>rail_trail</code> dataset. The <code>rail_trail</code> dataset was collected by the Pioneer Valley Planning Commission (<span class="caps">PVPC</span>) and counts the number of people that walked through a sensor on a <em>rail trail</em> during a ninety day period. <a href="https://wikipedia.org/wiki/Rail_trail">A <em>rail trail</em> is a retired or abandoned railway that was converted into a walking trail</a>. The data was collected from April 5, 2005 to November 15, 2005 using a laser sensor placed at a location north of Chestnut Street in Florence, <span class="caps">MA</span>.</p>
<p><img src="http://spring18.cds101.com/img/florence-chestnut-street-rail-trail.png" title="plot of chunk florence-ma-chestnut" alt="plot of chunk florence-ma-chestnut" width="50%" style="display: block; margin: auto;" /></p>
<p>The dataset contains the following variables:</p>
<table>
<colgroup>
<col style="width: 12%" />
<col style="width: 87%" />
</colgroup>
<thead>
<tr class="header">
<th style="text-align: left;">Variable</th>
<th style="text-align: left;">Description</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: left;"><code>hightemp</code></td>
<td style="text-align: left;">daily high temperature (in degrees Fahrenheit)</td>
</tr>
<tr class="even">
<td style="text-align: left;"><code>lowtemp</code></td>
<td style="text-align: left;">daily low temperature (in degrees Fahrenheit)</td>
</tr>
<tr class="odd">
<td style="text-align: left;"><code>avgtemp</code></td>
<td style="text-align: left;">average of daily low and daily high temperature (in degrees Fahrenheit)</td>
</tr>
<tr class="even">
<td style="text-align: left;"><code>season</code></td>
<td style="text-align: left;">indicates whether the season was Spring, Summer, or Fall</td>
</tr>
<tr class="odd">
<td style="text-align: left;"><code>cloudcover</code></td>
<td style="text-align: left;">measure of cloud cover (in oktas)</td>
</tr>
<tr class="even">
<td style="text-align: left;"><code>precip</code></td>
<td style="text-align: left;">measure of precipitation (in inches)</td>
</tr>
<tr class="odd">
<td style="text-align: left;"><code>volume</code></td>
<td style="text-align: left;">estimated number of trail users that day (number of breaks recorded)</td>
</tr>
<tr class="even">
<td style="text-align: left;"><code>weekday</code></td>
<td style="text-align: left;">indicator of whether the day was a non-holiday weekday</td>
</tr>
</tbody>
</table>
<h2 id="questions">Questions</h2>
<p>When describing the contents of a visualization, follow the general ideas summarized here: <a href="http://spring18.cds101.com/materials/describing-univariate-and-bivariate-data/" class="uri">http://spring18.cds101.com/materials/describing-univariate-and-bivariate-data/</a></p>
<ol class="example" type="1">
<li><p>In the <code>rail_trail</code> dataset, how many rows are there? How many columns? Which variables in the dataset are continuous/numerical and which are categorical?</p></li>
<li><p>Create a histogram of the variable <code>volume</code> using the following code:</p>
<pre class="r"><code>ggplot(data = rail_trail) +
geom_histogram(mapping = aes(x = volume))</code></pre>
<p>Describe the <em>shape</em> and <em>center</em> of the distribution. Afterward, try adjusting the size of the histogram bins by adding the <code>binwidth</code> input. To start with, use <code>binwidth = 21</code>. If you need help with where to place <code>binwidth</code>, read the documentation by running <code>?geom_histogram</code> in your <em>Console</em> window. Then, find a binwidth that’s too narrow and another one that’s too wide to produce a meaningful histogram.</p></li>
<li><p>Create a histogram for each of the remaining numerical variables, and describe the <em>shape</em> and <em>center</em> of each distribution. Are there any distributions that are similar in <em>shape</em> to each other?</p></li>
<li><p>Use <code>geom_point()</code> to create a scatterplot that plots <code>weekday</code> versus <code>season</code>. Why is this plot not useful?</p></li>
<li><p>Create a <code>geom_count()</code> plot (an alternative to a mosaic plot) using the same variables you considered in question 4:</p>
<pre class="r"><code>ggplot(data = rail_trail) +
geom_count(mapping = aes(x = season, y = weekday))</code></pre>
<p>Which circle in the plot takes up the most area? Explain the meaning of the different size circles in the plot and what information it contains that is missing in the previous scatter plot.</p></li>
<li><p>Run <code>?geom_bar</code> in the <em>Console</em> window and read the documentation for <code>geom_bar()</code>, and then look at the entry for it on the <a href="/doc/ggplot2-cheatsheet.pdf">ggplot2 cheatsheet</a> Use <code>geom_bar()</code> to reproduce the following bar chart:</p>
<p><img src="http://spring18.cds101.com/img/Assignments/homework01-seasons-bar-chart-1.svg" title="plot of chunk seasons-bar-chart" alt="plot of chunk seasons-bar-chart" width="70%" style="display: block; margin: auto;" /></p>
<p>After reproducing the plot, explain what the height of each bar means.</p></li>
<li><p>Starting from the code snippet you deduced in question 6, create two more bar charts:</p>
<ul>
<li><p>Create a bar chart by supplying the input <code>position = "dodge"</code> to <code>geom_bar()</code></p></li>
<li><p>Create a bar chart by supplying the input <code>position = "fill"</code> to <code>geom_bar()</code>.</p></li>
</ul>
<p>After creating the visualizations, describe the feature that <code>position</code> controls.</p></li>
<li><p>Create a bar chart that maps its aesthetic <code>aes()</code> to <code>precip > 0</code>. Interpret what this bar chart means.</p></li>
<li><p>Create a scatter plot of <code>volume</code> versus <code>hightemp</code> using <code>geom_point()</code>. Describe any trends that you see.</p></li>
<li><p>Take the code snippet you wrote for question 9 and map the <code>weekday</code> variable to <code>color</code>. Then create a second plot where, instead of mapping <code>weekday</code> to <code>color</code>, you <em>facet</em> over <code>weekday</code> using either <code>facet_wrap()</code> or <code>facet_grid()</code>. Discuss the advantages and disadvantages to faceting instead of mapping to the <code>color</code> aesthetic. How might the balance change if you had a larger dataset?</p></li>
<li><p>Take the code snippet that you wrote down in question 10 that faceted over <code>weekday</code> and create a model for each facet panel using <code>geom_smooth()</code>. Discuss the trends in the number of rail trail users that <code>geom_smooth()</code> picks up.</p></li>
<li><p>Copy the code snippet you deduced in question 11 and use the input <code>se = FALSE</code> for <code>geom_smooth()</code>. What does the <code>se</code> input option for <code>geom_smooth()</code> control?</p></li>
</ol>
<h2 id="how-to-submit">How to submit</h2>
<p>When you are ready to submit, be sure to save, commit, and push your final result so that everything is synchronized to Github. Then, navigate to <strong>your copy</strong> of the <a href="https://classroom.github.com/a/tHjWBa6i">Github repository</a> you used for this assignment. You should see your repository, along with the updated files that you just synchronized to Github. Confirm that your files are up-to-date, and then do the following steps:</p>
<ol type="1">
<li>Click the <em>Pull Requests</em> tab near the top of the page.</li>
<li>Click the green button that says “New pull request”.</li>
<li>Click the dropdown menu button labeled “base:”, and select the option <code>starting</code>.</li>
<li>Confirm that the dropdown menu button labled “compare:” is set to <code>master</code>.</li>
<li>Click the green button that says “Create pull request”.</li>
<li>Give the <em>pull request</em> the following title: <code>Submission: Homework 1, FirstName LastName</code>, replacing <code>FirstName</code> and <code>LastName</code> with your actual first and last name.</li>
<li>In the messagebox, write <code>My homework 1 submission is ready for grading @shuaibm @jkglasbrenner</code>.</li>
<li>Click “Create pull request” to lock in your submission.</li>
</ol>
<h2 id="cheatsheets">Cheatsheets</h2>
<p>You are encouraged to review and keep the following cheatsheets handy while working on this assignment:</p>
<ul>
<li><a href="/doc/rstudio-IDE-cheatsheet.pdf">RStudio cheatsheet</a></li>
<li><a href="/doc/rmarkdown-cheatsheet.pdf">RMarkdown cheatsheet</a></li>
<li><a href="/doc/rmarkdown-reference.pdf">RMarkdown reference</a></li>
<li><a href="/doc/ggplot2-cheatsheet.pdf"><code>ggplot2</code> cheatsheet</a></li>
</ul>
Class 10: Data wrangling III2018-02-22T13:30:00-05:002018-02-22T13:30:00-05:00Dr. Glasbrennertag:spring18.cds101.com,2018-02-22:/materials/class-10/<p><span class="h3">Slides</span></p>
<div class="no-bullets">
<ul>
<li><a href="/doc/class10_slides.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-16"></i> <span class="card-downloads-format"><span class="caps">PDF</span></span></a></li>
</ul>
</div>
<p><span class="h3">Slides</span></p>
<div class="no-bullets">
<ul>
<li><a href="/doc/class10_slides.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-16"></i> <span class="card-downloads-format"><span class="caps">PDF</span></span></a></li>
</ul>
</div>
Reading 62018-02-22T13:30:00-05:002018-02-22T13:30:00-05:00Dr. Glasbrennertag:spring18.cds101.com,2018-02-22:/assignments/reading-6/<p><span class="h3"><a href="http://r4ds.had.co.nz/">R for Data Science</a></span></p>
<p>Read the following:</p>
<ul>
<li>All of <a href="http://r4ds.had.co.nz/tibbles.html">chapter 10</a></li>
<li>From <a href="http://r4ds.had.co.nz/data-import.html">chapter 11</a>: sections <a href="http://r4ds.had.co.nz/data-import.html#introduction-5">11.1</a>, <a href="http://r4ds.had.co.nz/data-import.html#getting-started">11.2</a>, <a href="http://r4ds.had.co.nz/data-import.html#problems">11.4.2</a>, and <a href="http://r4ds.had.co.nz/data-import.html#writing-to-a-file">11.5</a></li>
</ul>
<p><span class="h3">Reading discussion</span></p>
<div class="callout primary">
<p><strong>Discussion hashtag</strong><br />
<code>#reading6</code></p>
</div>
<div class="callout secondary">
<p>Remember to post your question about it to the <code>#5-discussion</code> channel in <a href="https://masoncds101.slack.com">Slack</a> by the due date. To receive an …</p></div><p><span class="h3"><a href="http://r4ds.had.co.nz/">R for Data Science</a></span></p>
<p>Read the following:</p>
<ul>
<li>All of <a href="http://r4ds.had.co.nz/tibbles.html">chapter 10</a></li>
<li>From <a href="http://r4ds.had.co.nz/data-import.html">chapter 11</a>: sections <a href="http://r4ds.had.co.nz/data-import.html#introduction-5">11.1</a>, <a href="http://r4ds.had.co.nz/data-import.html#getting-started">11.2</a>, <a href="http://r4ds.had.co.nz/data-import.html#problems">11.4.2</a>, and <a href="http://r4ds.had.co.nz/data-import.html#writing-to-a-file">11.5</a></li>
</ul>
<p><span class="h3">Reading discussion</span></p>
<div class="callout primary">
<p><strong>Discussion hashtag</strong><br />
<code>#reading6</code></p>
</div>
<div class="callout secondary">
<p>Remember to post your question about it to the <code>#5-discussion</code> channel in <a href="https://masoncds101.slack.com">Slack</a> by the due date. To receive an answer credit, reply to a posted question no later than 11:59pm on Saturday, February 24th.</p>
</div>
<hr />
<p><a href="http://spring18.cds101.com/syllabus.html#readings"><i class="fas fa-link"></i> Posting guidelines can be found in the <strong>Readings</strong> section of the syllabus.</a></p>
Class 9: Data wrangling II2018-02-20T13:30:00-05:002018-02-20T13:30:00-05:00Dr. Glasbrennertag:spring18.cds101.com,2018-02-20:/materials/class-9/<p><span class="h3">Slides</span></p>
<div class="no-bullets">
<ul>
<li><a href="/doc/class09_slides.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-16"></i> <span class="card-downloads-format"><span class="caps">PDF</span></span></a></li>
</ul>
</div>
<p><span class="h3">Slides</span></p>
<div class="no-bullets">
<ul>
<li><a href="/doc/class09_slides.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-16"></i> <span class="card-downloads-format"><span class="caps">PDF</span></span></a></li>
</ul>
</div>
dplyr demos II2018-02-20T13:30:00-05:002018-02-20T13:30:00-05:00Dr. Glasbrennertag:spring18.cds101.com,2018-02-20:/materials/dplyr-demos-activity-ii/<p>Continuation of the interactive demonstration of the major features found in the <span class="monospace">dplyr</span> package.</p>
<div class="no-bullets">
<ul>
<li><a href="https://classroom.github.com/a/xku1H3sP"><i class="fab fa-github-alt" data-fa-transform="grow-16"></i> <span class="card-downloads-format"><strong>dplyr demo</strong> repo</span></a></li>
</ul>
</div>
<p>Continuation of the interactive demonstration of the major features found in the <span class="monospace">dplyr</span> package.</p>
<div class="no-bullets">
<ul>
<li><a href="https://classroom.github.com/a/xku1H3sP"><i class="fab fa-github-alt" data-fa-transform="grow-16"></i> <span class="card-downloads-format"><strong>dplyr demo</strong> repo</span></a></li>
</ul>
</div>
Reading 52018-02-20T13:30:00-05:002018-02-20T13:30:00-05:00Dr. Glasbrennertag:spring18.cds101.com,2018-02-20:/assignments/reading-5/<p><span class="h3"><a href="http://r4ds.had.co.nz/">R for Data Science</a></span></p>
<p>Read the following:</p>
<ul>
<li>All of <a href="http://r4ds.had.co.nz/workflow-basics.html">chapter 4</a> (short)</li>
<li>All of <a href="http://r4ds.had.co.nz/transform.html">chapter 5</a></li>
</ul>
<p><span class="h3">Reading discussion</span></p>
<div class="callout primary">
<p><strong>Discussion hashtag</strong><br />
<code>#reading5</code></p>
</div>
<div class="callout secondary">
<p>Remember to post your question about it to the <code>#5-discussion</code> channel in <a href="https://masoncds101.slack.com">Slack</a> by the due date. To receive an answer credit, reply to a posted question no later …</p></div><p><span class="h3"><a href="http://r4ds.had.co.nz/">R for Data Science</a></span></p>
<p>Read the following:</p>
<ul>
<li>All of <a href="http://r4ds.had.co.nz/workflow-basics.html">chapter 4</a> (short)</li>
<li>All of <a href="http://r4ds.had.co.nz/transform.html">chapter 5</a></li>
</ul>
<p><span class="h3">Reading discussion</span></p>
<div class="callout primary">
<p><strong>Discussion hashtag</strong><br />
<code>#reading5</code></p>
</div>
<div class="callout secondary">
<p>Remember to post your question about it to the <code>#5-discussion</code> channel in <a href="https://masoncds101.slack.com">Slack</a> by the due date. To receive an answer credit, reply to a posted question no later than 11:59pm on Thursday, February 22th.</p>
</div>
<hr />
<p><a href="http://spring18.cds101.com/syllabus.html#readings"><i class="fas fa-link"></i> Posting guidelines can be found in the <strong>Readings</strong> section of the syllabus.</a></p>
Class 8: Data wrangling I2018-02-15T13:30:00-05:002018-02-15T15:50:00-05:00Dr. Glasbrennertag:spring18.cds101.com,2018-02-15:/materials/class-8/<p><span class="h3">Slides</span></p>
<div class="no-bullets">
<ul>
<li><a href="/doc/class08_slides.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-16"></i> <span class="card-downloads-format"><span class="caps">PDF</span></span></a></li>
</ul>
</div>
<p><span class="h3">Slides</span></p>
<div class="no-bullets">
<ul>
<li><a href="/doc/class08_slides.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-16"></i> <span class="card-downloads-format"><span class="caps">PDF</span></span></a></li>
</ul>
</div>
dplyr demos I2018-02-15T13:30:00-05:002018-02-15T13:30:00-05:00Dr. Glasbrennertag:spring18.cds101.com,2018-02-15:/materials/dplyr-demos-activity/<p>An interactive demonstration of the major features found in the <span class="monospace">dplyr</span> package.</p>
<div class="no-bullets">
<ul>
<li><a href="https://classroom.github.com/a/xku1H3sP"><i class="fab fa-github-alt" data-fa-transform="grow-16"></i> <span class="card-downloads-format"><strong>dplyr demo</strong> repo</span></a></li>
</ul>
</div>
<p>An interactive demonstration of the major features found in the <span class="monospace">dplyr</span> package.</p>
<div class="no-bullets">
<ul>
<li><a href="https://classroom.github.com/a/xku1H3sP"><i class="fab fa-github-alt" data-fa-transform="grow-16"></i> <span class="card-downloads-format"><strong>dplyr demo</strong> repo</span></a></li>
</ul>
</div>
Describing univariate and bivariate data2018-02-14T15:45:00-05:002018-02-14T15:45:00-05:00Dr. Glasbrennertag:spring18.cds101.com,2018-02-14:/materials/describing-univariate-and-bivariate-data/How to write about visualizations of univariate (one variable) and bivariate (two variables) data.<div class="no-bullets">
<ul>
<li><a href="/doc/describing_univariate_and_bivariate_data.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-8"></i> Download this guide</a></li>
</ul>
</div>
<p>Many of the figures that we will be creating and analyzing during the course will be representations of univariate (meaning one variable) and bivariate (meaning two variables) data. You will frequently be asked to write a description about a visualization, and you should aim to be precise and consistent in your terms. Use the short summaries below as a guide and a reminder when writing about the features contained in a univariate or bivariate plot.</p>
<h2 id="describing-univariate-data">Describing univariate data</h2>
<p>When describing the visual properties of univariate data, remember to discuss the following traits:</p>
<ul>
<li><p>shape:</p>
<ul>
<li><p>right-skewed, left-skewed, symmetric (skew is to the side of the longer tail)</p></li>
<li><p>unimodal, bimodal, multimodal, uniform</p></li>
</ul></li>
<li><p>center: mean (<code>mean</code>), median (<code>median</code>), mode (not always useful)</p></li>
<li><p>spread: range (<code>range</code>), standard deviation (<code>sd</code>), inter-quartile range (<code>IQR</code>)</p></li>
<li><p>unusual observations</p></li>
</ul>
<p>For additional guidance, <a href="http://stattrek.com/statistics/charts/data-patterns.aspx">follow this link for a summary of what the above terms mean</a>.</p>
<h2 id="describing-bivariate-data">Describing bivariate data</h2>
<p>When describing the visual properties of univariate data, you will frequently be looking at a scatterplot. When describing the shapes of scatterplots we highlight:</p>
<ul>
<li><p><strong>Direction</strong>: What direction is the data trending? Positive direction or negative direction?</p></li>
<li><p><strong>Form</strong>: This is analogous to <em>shape</em> for univariate data. Is the dataset linear? Is is curved? Does it not have a form?</p></li>
<li><p><strong>Strength</strong>: How clustered are the data points around the underlying form? Stated another way, what are the strength of the correlations? Typical descriptors are <em>strong</em>, <em>moderate</em>, or <em>weak</em>.</p></li>
</ul>
Class 7: Data visualization III2018-02-13T13:30:00-05:002018-02-13T13:30:00-05:00Dr. Glasbrennertag:spring18.cds101.com,2018-02-13:/materials/class-7/<p><span class="h3">Slides</span></p>
<div class="no-bullets">
<ul>
<li><a href="/doc/class07_slides.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-16"></i> <span class="card-downloads-format"><span class="caps">PDF</span></span></a></li>
</ul>
</div>
<p><span class="h3">Slides</span></p>
<div class="no-bullets">
<ul>
<li><a href="/doc/class07_slides.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-16"></i> <span class="card-downloads-format"><span class="caps">PDF</span></span></a></li>
</ul>
</div>
Reading 42018-02-13T13:30:00-05:002018-02-13T13:30:00-05:00Dr. Glasbrennertag:spring18.cds101.com,2018-02-13:/assignments/reading-4/<p><span class="h3"><a href="http://r4ds.had.co.nz/">R for Data Science</a></span></p>
<p>Read the following:</p>
<ul>
<li><a href="http://r4ds.had.co.nz/data-visualisation.html">From chapter 3</a>: section 3.7 through to the end of section 3.10</li>
</ul>
<p><span class="h3">Reading discussion</span></p>
<div class="callout primary">
<p><strong>Discussion hashtag</strong><br />
<code>#reading4</code></p>
</div>
<div class="callout secondary">
<p>Remember to post your question about it to the <code>#5-discussion</code> channel in <a href="https://masoncds101.slack.com">Slack</a> by the due date. To receive an answer credit, reply to …</p></div><p><span class="h3"><a href="http://r4ds.had.co.nz/">R for Data Science</a></span></p>
<p>Read the following:</p>
<ul>
<li><a href="http://r4ds.had.co.nz/data-visualisation.html">From chapter 3</a>: section 3.7 through to the end of section 3.10</li>
</ul>
<p><span class="h3">Reading discussion</span></p>
<div class="callout primary">
<p><strong>Discussion hashtag</strong><br />
<code>#reading4</code></p>
</div>
<div class="callout secondary">
<p>Remember to post your question about it to the <code>#5-discussion</code> channel in <a href="https://masoncds101.slack.com">Slack</a> by the due date. To receive an answer credit, reply to a posted question no later than 11:59pm on Thursday, February 15th.</p>
</div>
<hr />
<p><a href="http://spring18.cds101.com/syllabus.html#readings"><i class="fas fa-link"></i> Posting guidelines can be found in the <strong>Readings</strong> section of the syllabus.</a></p>
Visualization mini-assignment2018-02-09T17:00:00-05:002018-02-09T17:00:00-05:00Dr. Glasbrennertag:spring18.cds101.com,2018-02-09:/assignments/visualization-mini-assignment/Mini-assignment to practice using RStudio to run code blocks in RMarkdown files and to create visualizations using <code>ggplot2</code>.<p><span class="h3"><strong>Due:</strong> February 9, 2018 @ 5:00pm</span></p>
<div class="no-bullets">
<ul>
<li><p><a href="/doc/cds101_visualization_mini-assignment.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-8"></i> Offline copy of instructions</a></p></li>
<li><p><a href="/doc/mini-assignment-no-code-preview.html"><i class="fab fa-html5" data-fa-transform="grow-8"></i> Offline, <span class="caps">HTML</span>-rendered copy of worksheet (code blocks left unevaluated)</a></p></li>
<li><p><a href="https://classroom.github.com/a/U5FXsZVU"><i class="fab fa-github-alt" data-fa-transform="grow-8"></i> Github Classroom repo for <strong>Visualization mini-assignment</strong></a></p></li>
</ul>
</div>
<h2 id="instructions">Instructions</h2>
<p><a href="https://classroom.github.com/a/U5FXsZVU">Obtain the Github repository you will use to complete the mini-assignment</a>, which contains a starter file named <code>visualization_mini-assignment.Rmd</code>. The demos and exercises you will be completing are written inside <code>visualization_mini-assignment.Rmd</code>, complete them within the indicated spaces. You should knit your file to the <span class="caps">HTML</span> format to check how it looks. When you’re done, save your file, commit, push (upload) it to Github, and follow the <a href="#submit-pull-request"><strong>How to submit</strong></a> section below to setup a Pull Request, which will be used for feedback.</p>
<h2 id="submit-pull-request">How to submit</h2>
<p>When you are ready to submit, be sure to save, commit, and push your final result so that everything is synchronized to Github. Then, navigate to <strong>your copy</strong> of the <a href="https://classroom.github.com/a/U5FXsZVU">Github repository</a> you used for this assignment. You should see your repository, along with the updated files that you just synchronized to Github. Confirm that your files are up-to-date, and then do the following:</p>
<ol type="1">
<li><p>Click the <em>Pull Requests</em> tab near the top of the page.</p></li>
<li><p>Click the green button that says “New pull request”.</p></li>
<li><p>Click the dropdown menu button labeled “base:”, and select the option <code>starting</code>.</p></li>
<li><p>Confirm that the dropdown menu button labled “compare:” is set to <code>master</code>.</p></li>
<li><p>Click the green button that says “Create pull request”.</p></li>
<li><p>Give the <em>pull request</em> the following title: <code>Submission: Visualization mini-assignment, FirstName LastName</code>, replacing <code>FirstName</code> and <code>LastName</code> with your actual first and last name.</p></li>
<li><p>In the messagebox, write <code>My Visualization mini-assignment is ready for grading @shuaibm @jkglasbrenner</code>.</p></li>
<li><p>Click “Create pull request” to lock in your submission.</p></li>
</ol>
<h2 id="cheatsheets">Cheatsheets</h2>
<p>You are encouraged to review and keep the following cheatsheets handy while working on this assignment:</p>
<ul>
<li><p><a href="/doc/rstudio-IDE-cheatsheet.pdf">RStudio cheatsheet</a></p></li>
<li><p><a href="/doc/rmarkdown-cheatsheet.pdf">RMarkdown cheatsheet</a></p></li>
<li><p><a href="/doc/rmarkdown-reference.pdf">RMarkdown reference</a></p></li>
</ul>
Class 6: Data visualization II2018-02-08T13:30:00-05:002018-02-08T13:30:00-05:00Dr. Glasbrennertag:spring18.cds101.com,2018-02-08:/materials/class-6/<p><span class="h3">Slides</span></p>
<div class="no-bullets">
<ul>
<li><a href="/doc/class06_slides.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-16"></i> <span class="card-downloads-format"><span class="caps">PDF</span></span></a></li>
</ul>
</div>
<p><span class="h3">Slides</span></p>
<div class="no-bullets">
<ul>
<li><a href="/doc/class06_slides.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-16"></i> <span class="card-downloads-format"><span class="caps">PDF</span></span></a></li>
</ul>
</div>
Reading 32018-02-08T13:30:00-05:002018-02-08T13:30:00-05:00Dr. Glasbrennertag:spring18.cds101.com,2018-02-08:/assignments/reading-3/<p><span class="h3"><a href="http://r4ds.had.co.nz/">R for Data Science</a></span></p>
<p>Read the following:</p>
<ul>
<li><a href="http://r4ds.had.co.nz/data-visualisation.html">From chapter 3</a>: section 3.1 through to the end of section 3.6</li>
</ul>
<p><span class="h3">Reading discussion</span></p>
<div class="callout primary">
<p><strong>Discussion hashtag</strong><br />
<code>#reading3</code></p>
</div>
<div class="callout secondary">
<p>Remember to post your question about it to the <code>#5-discussion</code> channel in <a href="https://masoncds101.slack.com">Slack</a> by the due date. To receive an answer credit, reply to …</p></div><p><span class="h3"><a href="http://r4ds.had.co.nz/">R for Data Science</a></span></p>
<p>Read the following:</p>
<ul>
<li><a href="http://r4ds.had.co.nz/data-visualisation.html">From chapter 3</a>: section 3.1 through to the end of section 3.6</li>
</ul>
<p><span class="h3">Reading discussion</span></p>
<div class="callout primary">
<p><strong>Discussion hashtag</strong><br />
<code>#reading3</code></p>
</div>
<div class="callout secondary">
<p>Remember to post your question about it to the <code>#5-discussion</code> channel in <a href="https://masoncds101.slack.com">Slack</a> by the due date. To receive an answer credit, reply to a posted question no later than 11:59pm on Saturday, February 10th.</p>
</div>
<hr />
<p><a href="http://spring18.cds101.com/syllabus.html#readings"><i class="fas fa-link"></i> Posting guidelines can be found in the <strong>Readings</strong> section of the syllabus.</a></p>
Class 5: Introduction to data / Data visualization I2018-02-06T13:30:00-05:002018-02-06T13:30:00-05:00Dr. Glasbrennertag:spring18.cds101.com,2018-02-06:/materials/class-5/<p><span class="h3">Slides</span></p>
<div class="no-bullets">
<ul>
<li><a href="/doc/class05_slides.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-16"></i> <span class="card-downloads-format"><span class="caps">PDF</span></span></a></li>
</ul>
</div>
<p><span class="h3">Slides</span></p>
<div class="no-bullets">
<ul>
<li><a href="/doc/class05_slides.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-16"></i> <span class="card-downloads-format"><span class="caps">PDF</span></span></a></li>
</ul>
</div>
Reading 22018-02-06T13:30:00-05:002018-02-06T13:30:00-05:00Dr. Glasbrennertag:spring18.cds101.com,2018-02-06:/assignments/reading-2/<p><span class="h3"><a href="http://spring18.cds101.com/doc/Diez_Barr_%C3%87etinkaya-Rundel_IntroductoryStatisticsWithRandomizationAndSimulation.pdf">Introductory Statistics with Randomization and Simulation</a></span></p>
<p>Read the following:</p>
<ul>
<li>All of Chapter 1, except skip sections 1.3 (read subsection 1.3.4), 1.4, and 1.5</li>
</ul>
<p><span class="h3">Reading discussion</span></p>
<div class="callout primary">
<p><strong>Discussion hashtag</strong><br />
<code>#reading2</code></p>
</div>
<div class="callout secondary">
<p>Remember to post your question about it to the <code>#5-discussion</code> channel in <a href="https://masoncds101.slack.com">Slack</a> by the due date …</p></div><p><span class="h3"><a href="http://spring18.cds101.com/doc/Diez_Barr_%C3%87etinkaya-Rundel_IntroductoryStatisticsWithRandomizationAndSimulation.pdf">Introductory Statistics with Randomization and Simulation</a></span></p>
<p>Read the following:</p>
<ul>
<li>All of Chapter 1, except skip sections 1.3 (read subsection 1.3.4), 1.4, and 1.5</li>
</ul>
<p><span class="h3">Reading discussion</span></p>
<div class="callout primary">
<p><strong>Discussion hashtag</strong><br />
<code>#reading2</code></p>
</div>
<div class="callout secondary">
<p>Remember to post your question about it to the <code>#5-discussion</code> channel in <a href="https://masoncds101.slack.com">Slack</a> by the due date. To receive an answer credit, reply to a posted question no later than 11:59pm on Thursday, February 8th.</p>
</div>
<hr />
<p><a href="http://spring18.cds101.com/syllabus.html#readings"><i class="fas fa-link"></i> Posting guidelines can be found in the <strong>Readings</strong> section of the syllabus.</a></p>
RMarkdown mini-assignment2018-02-06T13:30:00-05:002018-02-06T13:30:00-05:00Dr. Glasbrennertag:spring18.cds101.com,2018-02-06:/assignments/rmarkdown-mini-assignment/Mini-assignment to practice editing RMarkdown files and saving to Github.<p><span class="h3"><strong>Due:</strong> February 6, 2018 @ 1:30pm</span></p>
<div class="no-bullets">
<ul>
<li><p><a href="/doc/cds101_rmarkdown_mini-assignment.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-8"></i> Offline copy of instructions</a></p></li>
<li><p><a href="/doc/rmarkdown_practice.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-8"></i> <span class="monospace">rmarkdown_practice.pdf</span></a></p></li>
<li><p><a href="https://classroom.github.com/a/LD7Hgqkj"><i class="fab fa-github-alt" data-fa-transform="grow-8"></i> Github Classroom repo for <strong>RMarkdown mini-assignment</strong></a></p></li>
</ul>
</div>
<h2 id="instructions">Instructions</h2>
<p><a href="https://classroom.github.com/a/LD7Hgqkj">Obtain the Github repository you will use to complete the mini-assignment</a>, which contains a starter file named <code>rmarkdown_mini-assignment.Rmd</code>. Use this file to re-create the layout of <a href="/doc/rmarkdown_practice.pdf"><span class="monospace">rmarkdown_practice.pdf</span></a> using Markdown. You will knit your file to the <span class="caps">HTML</span> format. When you’re done, save your file, commit, push (upload) it to Github, and follow the <a href="#submit-pull-request"><strong>How to submit</strong></a> section below to setup a Pull Request, which will be used for feedback.</p>
<h2 id="submit-pull-request">How to submit</h2>
<p>When you are ready to submit, be sure to save, commit, and push your final result so that everything is synchronized to Github. Then, navigate to <strong>your copy</strong> of the <a href="https://classroom.github.com/a/LD7Hgqkj">Github repository</a> you used for this assignment. You should see your repository, along with the updated files that you just synchronized to Github. Confirm that your files are up-to-date, and then do the following:</p>
<ol type="1">
<li><p>Click the <em>Pull Requests</em> tab near the top of the page.</p></li>
<li><p>Click the green button that says “New pull request”.</p></li>
<li><p>Click the dropdown menu button labeled “base:”, and select the option <code>starting</code>.</p></li>
<li><p>Confirm that the dropdown menu button labled “compare:” is set to <code>master</code>.</p></li>
<li><p>Click the green button that says “Create pull request”.</p></li>
<li><p>Give the <em>pull request</em> the following title: <code>Submission: RMarkdown mini-assignment, FirstName LastName</code>, replacing <code>FirstName</code> and <code>LastName</code> with your actual first and last name.</p></li>
<li><p>In the messagebox, write <code>My RMarkdown mini-assignment is ready for grading @shuaibm @jkglasbrenner</code>.</p></li>
<li><p>Click “Create pull request” to lock in your submission.</p></li>
</ol>
<h2 id="cheatsheets">Cheatsheets</h2>
<p>You are encouraged to review and keep the following cheatsheets handy while working on this assignment:</p>
<ul>
<li><p><a href="/doc/rstudio-IDE-cheatsheet.pdf">RStudio cheatsheet</a></p></li>
<li><p><a href="/doc/rmarkdown-cheatsheet.pdf">RMarkdown cheatsheet</a></p></li>
<li><p><a href="/doc/rmarkdown-reference.pdf">RMarkdown reference</a></p></li>
</ul>
Reading 12018-02-01T13:30:00-05:002018-02-01T13:30:00-05:00Dr. Glasbrennertag:spring18.cds101.com,2018-02-01:/assignments/reading-1/<p><span class="h3"><a href="http://r4ds.had.co.nz">R for Data Science</a></span></p>
<p>Read the following:</p>
<ul>
<li>All of <a href="http://r4ds.had.co.nz/introduction.html">chapter 1</a></li>
<li>All of <a href="http://r4ds.had.co.nz/communicate-intro.html">chapter 26</a> (short)</li>
<li>All of <a href="http://r4ds.had.co.nz/r-markdown.html">chapter 27</a></li>
</ul>
<p><span class="h3">Reading discussion</span></p>
<div class="callout primary">
<p><strong>Discussion hashtag</strong><br />
<code>#reading1</code></p>
</div>
<div class="callout secondary">
<p>Remember to post your question about it to the <code>#5-discussion</code> channel in <a href="https://masoncds101.slack.com">Slack</a> by the due date. To receive an answer credit, reply to a …</p></div><p><span class="h3"><a href="http://r4ds.had.co.nz">R for Data Science</a></span></p>
<p>Read the following:</p>
<ul>
<li>All of <a href="http://r4ds.had.co.nz/introduction.html">chapter 1</a></li>
<li>All of <a href="http://r4ds.had.co.nz/communicate-intro.html">chapter 26</a> (short)</li>
<li>All of <a href="http://r4ds.had.co.nz/r-markdown.html">chapter 27</a></li>
</ul>
<p><span class="h3">Reading discussion</span></p>
<div class="callout primary">
<p><strong>Discussion hashtag</strong><br />
<code>#reading1</code></p>
</div>
<div class="callout secondary">
<p>Remember to post your question about it to the <code>#5-discussion</code> channel in <a href="https://masoncds101.slack.com">Slack</a> by the due date. To receive an answer credit, reply to a posted question no later than 11:59pm on Saturday, February 3rd.</p>
</div>
<hr />
<p><a href="http://spring18.cds101.com/syllabus.html#readings"><i class="fas fa-link"></i> Posting guidelines can be found in the <strong>Readings</strong> section of the syllabus.</a></p>
Try R Tutorial2018-01-30T13:30:00-05:002018-01-30T13:30:00-05:00Dr. Glasbrennertag:spring18.cds101.com,2018-01-30:/assignments/try-r-tutorial-mini-assignment/<p><span class="h3">Instructions</span></p>
<p>Complete all the levels of the <a href="https://www.codeschool.com/courses/try-r">Try R tutorial on codeschool.com</a> before class begins on Tuesday, January 30th. After you complete the interactive tutorial, you will receive a certificate of completion.</p>
<p>It is recommended that you sign up for an account before starting, as this will let you …</p><p><span class="h3">Instructions</span></p>
<p>Complete all the levels of the <a href="https://www.codeschool.com/courses/try-r">Try R tutorial on codeschool.com</a> before class begins on Tuesday, January 30th. After you complete the interactive tutorial, you will receive a certificate of completion.</p>
<p>It is recommended that you sign up for an account before starting, as this will let you save your progress.</p>
<p><span class="h3">Submission</span></p>
<p>Take a desktop screenshot of the certificate (Print Screen button) and send it to Dr. Glasbrenner as a <a href="https://masoncds101.slack.com">Slack Direct Message</a>. The screenshot should show some sort of identifiable information. For example, open a small notepad window and type your name there, like this:</p>
<p><a href="/img/screenshot_example_of_identification.png"><img src="/img/screenshot_example_of_identification.png" style="width:100.0%" /></a></p>
Class 2: The data scientist’s toolbox I2018-01-25T13:30:00-05:002018-02-15T15:50:00-05:00Dr. Glasbrennertag:spring18.cds101.com,2018-01-25:/materials/class-2/<p><span class="h3">Slides</span></p>
<div class="no-bullets">
<ul>
<li><a href="/doc/class02_slides.html"><i class="fab fa-html5" data-fa-transform="grow-16"></i> <span class="card-downloads-format"><span class="caps">HTML</span></span></a></li>
</ul>
</div>
<p><span class="h3">Slides</span></p>
<div class="no-bullets">
<ul>
<li><a href="/doc/class02_slides.html"><i class="fab fa-html5" data-fa-transform="grow-16"></i> <span class="card-downloads-format"><span class="caps">HTML</span></span></a></li>
</ul>
</div>
Introduce yourself; Twitter Data Science Study; Github signup2018-01-25T13:30:00-05:002018-01-25T13:30:00-05:00Dr. Glasbrennertag:spring18.cds101.com,2018-01-25:/assignments/introductions-and-twitter-election-mini-assignment/<p><span class="h3">Introduce yourself</span></p>
<p>Write an introduction about yourself in the <code>#3-members</code> channel on <a href="https://masoncds101.slack.com">Slack</a>. Include your name, your major, and say one thing you know or have heard about data science before starting this class (this can be the news, in your major, etc.).</p>
<p><span class="h3">Can Twitter predict election results?</span></p>
<p>Finish reading …</p><p><span class="h3">Introduce yourself</span></p>
<p>Write an introduction about yourself in the <code>#3-members</code> channel on <a href="https://masoncds101.slack.com">Slack</a>. Include your name, your major, and say one thing you know or have heard about data science before starting this class (this can be the news, in your major, etc.).</p>
<p><span class="h3">Can Twitter predict election results?</span></p>
<p>Finish reading the editorial and skimming the white paper from the <a href="/materials/class-1-twitter-activity/">Can Twitter predict election results?</a> activity we started during class on January 23rd. Then, post your answer to question 1 in the <code>#5-discussion</code> channel on <a href="https://masoncds101.slack.com">Slack</a>, using the hashtag <code>#class01</code> somewhere in your message.</p>
<p><span class="h3">Github account</span></p>
<p>Sign up for an account on Github: <a href="http://github.com" class="uri">http://github.com</a> using your <code>@gmu.edu</code> email address. <em>After you signup, send me your username in a Direct Message</em>.</p>
Spring 2018 Schedule2018-01-23T15:00:00-05:002018-05-03T18:40:00-04:00Dr. Glasbrennertag:spring18.cds101.com,2018-01-23://index.html<table>
<colgroup>
<col style="width: 12%" />
<col style="width: 4%" />
<col style="width: 8%" />
<col style="width: 54%" />
<col style="width: 19%" />
</colgroup>
<thead>
<tr class="header">
<th style="text-align: left;">Week</th>
<th style="text-align: left;">Class</th>
<th style="text-align: left;">Date</th>
<th style="text-align: left;">Topic</th>
<th style="text-align: left;">Due Dates</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: left;">Week 1</td>
<td style="text-align: left;">1</td>
<td style="text-align: left;">Jan-23</td>
<td style="text-align: left;">Overview of Computational and Data Sciences</td>
<td style="text-align: left;"></td>
</tr>
<tr class="even">
<td style="text-align: left;"></td>
<td style="text-align: left;">2</td>
<td style="text-align: left;">Jan-25</td>
<td style="text-align: left;"><a href="/materials/class-2/">The data scientist’s toolbox I</a> <a href="/doc/class02_slides.html"><i class="fab fa-html5" data-fa-transform="grow-8"></i></a></td>
<td style="text-align: left;"><a href="/assignments/introductions-and-twitter-election-mini-assignment/">Twitter Study Article</a></td>
</tr>
<tr class="odd">
<td style="text-align: left;">Week 2</td>
<td style="text-align: left;">3</td>
<td style="text-align: left;">Jan-30</td>
<td style="text-align: left;">The data scientist’s toolbox <span class="caps">II</span></td>
<td style="text-align: left;"><a href="/assignments/try-r-tutorial-mini-assignment/">Try R Tutorial</a></td>
</tr>
<tr class="even">
<td style="text-align: left;"></td>
<td style="text-align: left;">4</td>
<td style="text-align: left;">Feb-01</td>
<td style="text-align: left;">The data scientist’s toolbox <span class="caps">III</span></td>
<td style="text-align: left;"><a href="/assignments/reading-1/">Reading 1 …</a></td></tr></tbody></table><table>
<colgroup>
<col style="width: 12%" />
<col style="width: 4%" />
<col style="width: 8%" />
<col style="width: 54%" />
<col style="width: 19%" />
</colgroup>
<thead>
<tr class="header">
<th style="text-align: left;">Week</th>
<th style="text-align: left;">Class</th>
<th style="text-align: left;">Date</th>
<th style="text-align: left;">Topic</th>
<th style="text-align: left;">Due Dates</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: left;">Week 1</td>
<td style="text-align: left;">1</td>
<td style="text-align: left;">Jan-23</td>
<td style="text-align: left;">Overview of Computational and Data Sciences</td>
<td style="text-align: left;"></td>
</tr>
<tr class="even">
<td style="text-align: left;"></td>
<td style="text-align: left;">2</td>
<td style="text-align: left;">Jan-25</td>
<td style="text-align: left;"><a href="/materials/class-2/">The data scientist’s toolbox I</a> <a href="/doc/class02_slides.html"><i class="fab fa-html5" data-fa-transform="grow-8"></i></a></td>
<td style="text-align: left;"><a href="/assignments/introductions-and-twitter-election-mini-assignment/">Twitter Study Article</a></td>
</tr>
<tr class="odd">
<td style="text-align: left;">Week 2</td>
<td style="text-align: left;">3</td>
<td style="text-align: left;">Jan-30</td>
<td style="text-align: left;">The data scientist’s toolbox <span class="caps">II</span></td>
<td style="text-align: left;"><a href="/assignments/try-r-tutorial-mini-assignment/">Try R Tutorial</a></td>
</tr>
<tr class="even">
<td style="text-align: left;"></td>
<td style="text-align: left;">4</td>
<td style="text-align: left;">Feb-01</td>
<td style="text-align: left;">The data scientist’s toolbox <span class="caps">III</span></td>
<td style="text-align: left;"><a href="/assignments/reading-1/">Reading 1</a></td>
</tr>
<tr class="odd">
<td style="text-align: left;">Week 3</td>
<td style="text-align: left;">5</td>
<td style="text-align: left;">Feb-06</td>
<td style="text-align: left;"><a href="/materials/class-5/">Introduction to data/Data visualization I</a> <a href="/doc/class05_slides.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-8"></i></a></td>
<td style="text-align: left;"><a href="/assignments/reading-2/">Reading 2</a><br> <a href="/assignments/rmarkdown-mini-assignment/">RMarkdown mini-assignment</a></td>
</tr>
<tr class="even">
<td style="text-align: left;"></td>
<td style="text-align: left;">6</td>
<td style="text-align: left;">Feb-08</td>
<td style="text-align: left;"><a href="/materials/class-6/">Data visualization <span class="caps">II</span></a> <a href="/doc/class06_slides.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-8"></i></a></td>
<td style="text-align: left;"><a href="/assignments/reading-3/">Reading 3</a></td>
</tr>
<tr class="odd">
<td style="text-align: left;"></td>
<td style="text-align: left;"></td>
<td style="text-align: left;">Feb-09</td>
<td style="text-align: left;"></td>
<td style="text-align: left;"><a href="/assignments/visualization-mini-assignment/">Visualization mini-assignment</a></td>
</tr>
<tr class="even">
<td style="text-align: left;">Week 4</td>
<td style="text-align: left;">7</td>
<td style="text-align: left;">Feb-13</td>
<td style="text-align: left;"><a href="/materials/class-7/">Data visualization <span class="caps">III</span></a> <a href="/doc/class07_slides.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-8"></i></a></td>
<td style="text-align: left;"><a href="/assignments/reading-4/">Reading 4</a></td>
</tr>
<tr class="odd">
<td style="text-align: left;"></td>
<td style="text-align: left;">8</td>
<td style="text-align: left;">Feb-15</td>
<td style="text-align: left;"><a href="/materials/class-8/">Data wrangling I</a> <a href="https://classroom.github.com/a/xku1H3sP"><i class="fab fa-github-alt" data-fa-transform="grow-8"></i></a>, <a href="/doc/class08_slides.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-8"></i></a></td>
<td style="text-align: left;"></td>
</tr>
<tr class="even">
<td style="text-align: left;">Week 5</td>
<td style="text-align: left;">9</td>
<td style="text-align: left;">Feb-20</td>
<td style="text-align: left;"><a href="/materials/class-9/">Data wrangling <span class="caps">II</span></a> <a href="https://classroom.github.com/a/xku1H3sP"><i class="fab fa-github-alt" data-fa-transform="grow-8"></i></a>, <a href="/doc/class09_slides.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-8"></i></a></td>
<td style="text-align: left;"><a href="/assignments/reading-5/">Reading 5</a></td>
</tr>
<tr class="odd">
<td style="text-align: left;"></td>
<td style="text-align: left;">10</td>
<td style="text-align: left;">Feb-22</td>
<td style="text-align: left;"><a href="/materials/class-10/">Data wrangling <span class="caps">III</span></a> <a href="/doc/class10_slides.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-8"></i></a></td>
<td style="text-align: left;"><a href="/assignments/reading-6/">Reading 6</a></td>
</tr>
<tr class="even">
<td style="text-align: left;"></td>
<td style="text-align: left;"></td>
<td style="text-align: left;">Feb-23</td>
<td style="text-align: left;"></td>
<td style="text-align: left;"><a href="/assignments/homework-1/">Homework 1</a></td>
</tr>
<tr class="odd">
<td style="text-align: left;">Week 6</td>
<td style="text-align: left;">11</td>
<td style="text-align: left;">Feb-27</td>
<td style="text-align: left;"><a href="/materials/class-11/">Data wrangling <span class="caps">IV</span></a> <a href="https://masoncds101.slack.com/archives/C8WQJ0GTB/p1519755978000359"><i class="fab fa-github-alt" data-fa-transform="grow-8"></i></a>, <a href="/doc/class11_slides.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-8"></i></a></td>
<td style="text-align: left;"><a href="/assignments/reading-7/">Reading 7</a></td>
</tr>
<tr class="even">
<td style="text-align: left;"></td>
<td style="text-align: left;">12</td>
<td style="text-align: left;">Mar-01</td>
<td style="text-align: left;">Midterm project brainstorming session</td>
<td style="text-align: left;"><a href="/assignments/reading-8/">Reading 8</a></td>
</tr>
<tr class="odd">
<td style="text-align: left;">Week 7</td>
<td style="text-align: left;">13</td>
<td style="text-align: left;">Mar-06</td>
<td style="text-align: left;">Exploratory data analysis I <a href="https://masoncds101.slack.com/archives/C8WQJ0GTB/p1520361454000399"><i class="fab fa-github-alt" data-fa-transform="grow-8"></i></a></td>
<td style="text-align: left;"></td>
</tr>
<tr class="even">
<td style="text-align: left;"></td>
<td style="text-align: left;">14</td>
<td style="text-align: left;">Mar-08</td>
<td style="text-align: left;">Exploratory data analysis <span class="caps">II</span> <a href="https://masoncds101.slack.com/archives/C8WQJ0GTB/p1520361454000399"><i class="fab fa-github-alt" data-fa-transform="grow-8"></i></a></td>
<td style="text-align: left;"></td>
</tr>
<tr class="odd">
<td style="text-align: left;"></td>
<td style="text-align: left;"></td>
<td style="text-align: left;">Mar-09</td>
<td style="text-align: left;"></td>
<td style="text-align: left;"><a href="/assignments/homework-2/">Homework 2</a></td>
</tr>
<tr class="even">
<td style="text-align: left;"></td>
<td style="text-align: left;"></td>
<td style="text-align: left;">Mar-13</td>
<td style="text-align: left;"><strong><span class="smallcaps">Spring Break</span></strong> (no class)</td>
<td style="text-align: left;"></td>
</tr>
<tr class="odd">
<td style="text-align: left;"></td>
<td style="text-align: left;"></td>
<td style="text-align: left;">Mar-15</td>
<td style="text-align: left;"><strong><span class="smallcaps">Spring Break</span></strong> (no class)</td>
<td style="text-align: left;"></td>
</tr>
<tr class="even">
<td style="text-align: left;">Week 8</td>
<td style="text-align: left;">15</td>
<td style="text-align: left;">Mar-20</td>
<td style="text-align: left;">Midterm project group conferences</td>
<td style="text-align: left;"></td>
</tr>
<tr class="odd">
<td style="text-align: left;"></td>
<td style="text-align: left;">16</td>
<td style="text-align: left;">Mar-22</td>
<td style="text-align: left;"><a href="/materials/class-16/">Introduction to Web Scraping I</a> <a href="/doc/class16_slides.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-8"></i></a></td>
<td style="text-align: left;"></td>
</tr>
<tr class="even">
<td style="text-align: left;">Week 9</td>
<td style="text-align: left;">17</td>
<td style="text-align: left;">Mar-27</td>
<td style="text-align: left;">Midterm project presentations I/Web scraping group activity</td>
<td style="text-align: left;"><a href="/assignments/midterm-project/">Midterm Project</a></td>
</tr>
<tr class="odd">
<td style="text-align: left;"></td>
<td style="text-align: left;">18</td>
<td style="text-align: left;">Mar-29</td>
<td style="text-align: left;">Midterm project presentations <span class="caps">II</span>/Web scraping group activity <a href="https://masoncds101.slack.com/archives/C8WQJ0GTB/p1522347635000427"><i class="fab fa-github-alt" data-fa-transform="grow-8"></i></a></td>
<td style="text-align: left;"></td>
</tr>
<tr class="even">
<td style="text-align: left;">Week 10</td>
<td style="text-align: left;">19</td>
<td style="text-align: left;">Apr-03</td>
<td style="text-align: left;"><a href="/materials/class-19/">Introduction to Web Scraping <span class="caps">II</span>/Principles of Data Collection</a> <a href="/doc/class19_slides.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-8"></i></a></td>
<td style="text-align: left;"></td>
</tr>
<tr class="odd">
<td style="text-align: left;"></td>
<td style="text-align: left;">20</td>
<td style="text-align: left;">Apr-05</td>
<td style="text-align: left;"><a href="/materials/class-20/">Statistical distributions I</a> <a href="/doc/class20_slides.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-8"></i></a></td>
<td style="text-align: left;"><a href="/assignments/reading-9/">Reading 9</a></td>
</tr>
<tr class="even">
<td style="text-align: left;">Week 11</td>
<td style="text-align: left;">21</td>
<td style="text-align: left;">Apr-10</td>
<td style="text-align: left;"><a href="/materials/class-21/">Statistical distributions <span class="caps">II</span></a> <a href="/doc/class21_slides.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-8"></i></a></td>
<td style="text-align: left;"><a href="/assignments/reading-10/">Reading 10</a></td>
</tr>
<tr class="odd">
<td style="text-align: left;"></td>
<td style="text-align: left;">22</td>
<td style="text-align: left;">Apr-12</td>
<td style="text-align: left;"><a href="/materials/class-22/">Inference and simulations I</a> <a href="/doc/class22_slides.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-8"></i></a></td>
<td style="text-align: left;"><a href="/assignments/reading-11/">Reading 11</a></td>
</tr>
<tr class="even">
<td style="text-align: left;"></td>
<td style="text-align: left;"></td>
<td style="text-align: left;">Apr-13</td>
<td style="text-align: left;"></td>
<td style="text-align: left;"><a href="/assignments/reading-12/">Reading 12</a></td>
</tr>
<tr class="odd">
<td style="text-align: left;"></td>
<td style="text-align: left;"></td>
<td style="text-align: left;">Apr-16</td>
<td style="text-align: left;"></td>
<td style="text-align: left;"><a href="/assignments/homework-3/">Homework 3</a></td>
</tr>
<tr class="even">
<td style="text-align: left;">Week 12</td>
<td style="text-align: left;">23</td>
<td style="text-align: left;">Apr-17</td>
<td style="text-align: left;"><a href="/materials/class-23/">Inference and simulations <span class="caps">II</span></a> <a href="https://classroom.github.com/a/o-KntOw5"><i class="fab fa-github-alt" data-fa-transform="grow-8"></i></a>, <a href="/doc/class23_slides.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-8"></i></a></td>
<td style="text-align: left;"><a href="/assignments/reading-13/">Reading 13</a></td>
</tr>
<tr class="odd">
<td style="text-align: left;"></td>
<td style="text-align: left;">24</td>
<td style="text-align: left;">Apr-19</td>
<td style="text-align: left;"><a href="/materials/class-24/">Inference and simulations <span class="caps">III</span></a> <a href="/doc/class24_slides.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-8"></i></a></td>
<td style="text-align: left;"><a href="/assignments/reading-14/">Reading 14</a></td>
</tr>
<tr class="even">
<td style="text-align: left;">Week 13</td>
<td style="text-align: left;">25</td>
<td style="text-align: left;">Apr-24</td>
<td style="text-align: left;"><a href="/materials/class-25/">Inference and simulations <span class="caps">IV</span>/Modeling I</a> <a href="/doc/class25_slides.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-8"></i></a></td>
<td style="text-align: left;"><a href="/assignments/reading-15/">Reading 15</a></td>
</tr>
<tr class="odd">
<td style="text-align: left;"></td>
<td style="text-align: left;">26</td>
<td style="text-align: left;">Apr-26</td>
<td style="text-align: left;"><a href="/materials/class-26/">Modeling <span class="caps">II</span></a> <a href="/doc/class26_slides.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-8"></i></a></td>
<td style="text-align: left;"><a href="/assignments/reading-16/">Reading 16</a></td>
</tr>
<tr class="even">
<td style="text-align: left;"></td>
<td style="text-align: left;"></td>
<td style="text-align: left;">Apr-27</td>
<td style="text-align: left;"></td>
<td style="text-align: left;"><a href="/assignments/homework-4/">Homework 4</a></td>
</tr>
<tr class="odd">
<td style="text-align: left;">Week 14</td>
<td style="text-align: left;">27</td>
<td style="text-align: left;">May-01</td>
<td style="text-align: left;"><a href="/materials/class-27/">Modeling <span class="caps">III</span></a> <a href="/doc/class27_slides.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-8"></i></a></td>
<td style="text-align: left;"></td>
</tr>
<tr class="even">
<td style="text-align: left;"></td>
<td style="text-align: left;">28</td>
<td style="text-align: left;">May-03</td>
<td style="text-align: left;"><a href="/materials/class-28/">Modeling <span class="caps">IV</span></a> <a href="/doc/class28_slides.pdf"><i class="fas fa-file-pdf" data-fa-transform="grow-8"></i></a>, <a href="https://classroom.github.com/a/Ce_Ci3HZ"><i class="fab fa-github-alt" data-fa-transform="grow-8"></i></a></td>
<td style="text-align: left;"></td>
</tr>
<tr class="odd">
<td style="text-align: left;"></td>
<td style="text-align: left;"></td>
<td style="text-align: left;">May-04</td>
<td style="text-align: left;"></td>
<td style="text-align: left;"><a href="/assignments/homework-5/">Homework 5</a></td>
</tr>
<tr class="even">
<td style="text-align: left;"></td>
<td style="text-align: left;"></td>
<td style="text-align: left;">May-08</td>
<td style="text-align: left;"><strong><span class="smallcaps">Reading Day</span></strong></td>
<td style="text-align: left;"></td>
</tr>
<tr class="odd">
<td style="text-align: left;"></td>
<td style="text-align: left;"></td>
<td style="text-align: left;">May-11</td>
<td style="text-align: left;"></td>
<td style="text-align: left;"><a href="/assignments/final-portfolio/">Final Portfolio</a></td>
</tr>
<tr class="even">
<td style="text-align: left;"></td>
<td style="text-align: left;"></td>
<td style="text-align: left;">May-15</td>
<td style="text-align: left;"><strong><span class="smallcaps">Final Interview</span></strong><br> <em>Time: 1:30pm – 4:15pm</em></td>
<td style="text-align: left;"></td>
</tr>
</tbody>
</table>
CDS-001-001 Syllabus2018-01-23T13:30:00-05:002018-02-02T23:00:00-05:00Dr. Glasbrennertag:spring18.cds101.com,2018-01-23:/<h2 id="course-description" data-magellan-target="course-description">Description</h2>
<p>During this course, students will develop basic skills for obtaining, cleaning, transforming, and visualizing real-world datasets using the R programming language and the RStudio integrated development environment. Statistical methods for analyzing, interpreting, and predicting dataset trends are then introduced and approached from a computational point of view using randomization …</p><h2 id="course-description" data-magellan-target="course-description">Description</h2>
<p>During this course, students will develop basic skills for obtaining, cleaning, transforming, and visualizing real-world datasets using the R programming language and the RStudio integrated development environment. Statistical methods for analyzing, interpreting, and predicting dataset trends are then introduced and approached from a computational point of view using randomization and simulation. Additional topics may be covered, such as an introduction to advanced or special topics like cross-validation. Throughout the course emphasis is placed on documenting one’s scientific work using RStudio in conjunction with Github to fulfill the principles of reproducible research. Connections are highlighted between statistical inference and the scientific method and how this is related to both the scientific method’s power and its limitations. These tools will also be used to critically examine statistical claims reported in mass media, demonstrating how scientific literacy and a basic knowledge in statistics are indispensable tools to making sense of our modern world.</p>
<ul>
<li><strong>Classroom:</strong> 1004 Exploratory Hall</li>
<li><strong>Meeting times:</strong> Tues/Thurs 1:30pm – 2:45pm</li>
<li><strong>University holidays:</strong> Spring break, Mar. 12 – Mar. 16 (no class)</li>
<li><strong>Credit hours:</strong> 3.0 credit hours</li>
<li><strong>Prerequisites:</strong> None, but a background in algebra is assumed.</li>
<li><strong>Mason Core:</strong> Natural science (+lab when taken with <span class="caps">CDS</span> 102)</li>
</ul>
<h2 id="course-instructor" data-magellan-target="course-instructor">Instructor</h2>
<p><img src="https://www.gravatar.com/avatar/49802fdfa5a0e63b3d932a5179d41c1e?s=128" /></p>
<p>Dr. James K. Glasbrenner</p>
<ul>
<li><strong>Office</strong>: 253 Research Hall</li>
<li><strong>Phone:</strong> (703) 993-4512</li>
<li><strong>Email:</strong> <a href="mailto:jglasbr2@gmu.edu"><span class="monospace">jglasbr2@gmu.edu</span></a></li>
<li><strong>Office hours:</strong> Tues. 3:00pm – 4:00pm and Fri. 11:00am – 12:00pm, or by appointment.</li>
</ul>
<h2 id="course-objectives" data-magellan-target="course-objectives">Objectives</h2>
<p>By the end of the course, students will be able to:</p>
<ul>
<li>Use Github for collaborating on a reproducible research document</li>
<li>Obtain, clean, transform, and visualize a dataset using the R programming language</li>
<li>Interpret, and predict dataset trends using statistical inference and models</li>
<li>Critically examine and interpret statistical claims reported in mass media</li>
</ul>
<h2 id="course-materials" data-magellan-target="course-materials">Materials</h2>
<h3 id="textbooks">Textbooks</h3>
<p>This course utilizes two free textbooks that are made available under Creative Commons licenses:</p>
<div class="textbook">
<ul>
<li>R for Data Science: Import, Tidy, Transform, Visualize, and Model Data
<ul>
<li>Hadley Wickham</li>
<li>Garrett Grolemund</li>
</ul></li>
</ul>
<p><img src="/img/r4ds_cover.png" alt="R for Data Science textbook cover" /></p>
<p><a href="http://r4ds.had.co.nz/">Click here to read the textbook for free online.</a></p>
<p>This is the free, <a href="https://github.com/hadley/r4ds">open-source</a> data science textbook that we will use throughout the course. This textbook is available under a <a href="http://creativecommons.org/licenses/by-nc-nd/3.0/us/">Creative Commons Attribution-NonCommercial-NoDerivs 3.0 United States License</a>.</p>
<p>If you would like a physical copy of the textbook, you can <a href="http://amzn.to/2aHLAQ1">order it from Amazon</a>.</p>
</div>
<div class="textbook">
<ul>
<li>Introductory Statistics with Randomization and Simulation
<ul>
<li>David M Diez</li>
<li>Christopher D Barr</li>
<li>Mine Çetinkaya-Rundel</li>
</ul></li>
</ul>
<p><img src="/img/isrs_cover.png" alt="Introductory Statistics with Randomization and Simulation textbook cover" /></p>
<p><a href="/doc/Diez_Barr_%C3%87etinkaya-Rundel_IntroductoryStatisticsWithRandomizationAndSimulation.pdf">Click here to download a free <span class="caps">PDF</span> copy of the textbook.</a></p>
<p>This is the free, <a href="https://github.com/OpenIntroOrg/randomization-and-simulation">open-source</a> statistics textbook that we will use during the second half of the course. This textbook is available under a <a href="/doc/isrs_license.txt">Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported license (<span class="caps">CC</span> <span class="caps">BY</span>-<span class="caps">NC</span>-<span class="caps">SA</span>)</a>.</p>
<p>If you would like a physical copy of the textbook, you can <a href="http://a.co/ixGTX7R">order it from Amazon</a>.</p>
</div>
<h3 id="software">Software</h3>
<p>During the course, we will use the following free and open-source software to manipulate data, perform computations, and write documents:</p>
<ul>
<li>Programming language: R (<a href="https://www.r-project.org" class="uri">https://www.r-project.org</a>)</li>
<li>Version control: Git (<a href="https://git-scm.com" class="uri">https://git-scm.com</a>)</li>
<li><span class="caps">PDF</span> export: LaTeX (<a href="https://www.latex-project.org" class="uri">https://www.latex-project.org</a>)</li>
<li>Programming software: RStudio (<a href="https://www.rstudio.com" class="uri">https://www.rstudio.com</a>)</li>
</ul>
<p>RStudio sits on top of R, Git, and LaTeX and provides a user-friendly interface to using these programs. You can think of R, Git, and LaTeX as plugins that activate RStudio’s main features. In practice, RStudio is the software application that we will learn to use during the course.</p>
<p><em>Students are highly encouraged to use Anaconda (<a href="https://anaconda.com" class="uri">https://anaconda.com</a>) to install the latest version of R, RStudio, and its other dependencies.</em></p>
<h3 id="platforms">Platforms</h3>
<p>The course will be administered through the following online platforms:</p>
<ul>
<li>Course website (<a href="http://spring18.cds101.com" class="uri">http://spring18.cds101.com</a>)</li>
<li>Github (<a href="https://github.com/">https://github.com</a>)</li>
<li>Slack (<a href="https://masoncds101.slack.com" class="uri">https://masoncds101.slack.com</a>)</li>
<li>Blackboard (<a href="https://mymasonportal.gmu.edu/">https://mymasonportal.gmu.edu</a>)</li>
</ul>
<p>The course website operates as the central repository for course materials, including an up-to-date schedule and posted slides and handouts. Slack is the primary communication medium, replacing email (see the <a href="#contact-policy">Contact policy</a> below) and serving as a discussion board. Github is used for storing your classroom files, distributing and collecting homework assignments, handing out example code, and for project collaborations.Blackboard will be used for posting grades.</p>
<h2 id="course-policies" data-magellan-target="course-policies">Policies</h2>
<h3 id="contact-policy">Contact policy</h3>
<p><em>All</em> correspondence is to be done using the private, invite-only Slack workspace for the course. Direct messages on Slack are to be used for contacting me instead of emails. Your Slack username <em>must</em> be registered and associated with your Mason <code>@gmu.edu</code> email address at all times.</p>
<p>My ground rules for direct messages are as follows:</p>
<ul>
<li>I check and respond to messages during normal university hours.</li>
<li>Allow up to approximately 24 hours for a response during normal hours.</li>
<li>Just because I view a message does not mean I will respond right away.</li>
<li>I generally don’t respond to messages over weekends and school holidays.</li>
<li>If your questions are involved enough, I will ask you to come to office hours or schedule a Skype chat.</li>
<li><em>On email:</em> During the first two weeks of the course, emails will be responded to, but I will send you my response on Slack. After two weeks, emails will be ignored.<a href="#fn1" class="footnote-ref" id="fnref1"><sup>1</sup></a></li>
</ul>
<p><strong><span class="smallcaps">Please Note:</span></strong> the <span class="caps">CDS</span>-102 labs are a separate course from <span class="caps">CDS</span>-101. I do not answer questions for the lab. All correspondence regarding labs are to be directed to the teaching assistant.</p>
<h3 id="attendance-policy">Attendance policy</h3>
<p>Students are expected to attend every class and are responsible for obtaining and understanding the material that they miss, including any examples that derive from resources not available online. Any missed work during an unexcused absence cannot be made up. Unexcused absences, frequent tardiness, or leaving class early will reduce your participation grade.</p>
<p>Absences due to religious holidays, scheduled varsity sports trips, and short-term illnesses should be brought to my attention as early as possible. Any make-up work is to be completed within 24 hours. Exemptions may be granted at my discretion.</p>
<h3 id="late-work-policy">Late work policy</h3>
<p>Unless otherwise noted, the main homework assignments are to be submitted by 11:59pm on the due date. The following penalties apply for most assignments (please note that weekends count as days):</p>
<ul>
<li>First day late, by 11:59pm: -10%</li>
<li>Second day late, by 11:59pm: -20%</li>
<li>Third day late, by 11:59pm: -30%</li>
<li>Fourth day or later: no credit</li>
</ul>
<p>The above does not pertain to the reading questions and answers, which must be submitted before the posted date and time to receive credit. The writeup for the midterm project and your final course portfolio are also exceptions and must be submitted by the due date; late submissions will not be accepted.</p>
<p>Presentations, such as for the midterm project or the final interview, must be given on the scheduled date and cannot be made up.</p>
<p>Extensions or exemptions may be granted in the case of illness or a family emergency at my discretion, and it is the student’s responsibility to inform me about these kinds of circumstances as soon as possible.</p>
<h3 id="regrading-appeals-policy">Regrading appeals policy</h3>
<p>Regrade appeals need to be in writing, printed out, and hand delivered to me within 48 hours of receiving back an assignment (not including weekends). Appeals via Slack or email will not be accepted, no exceptions. Appeals are only to be used for correct answers being marked as incorrect, misapplication of the grading rubric, or incorrectly tallied points. Submissions need to clearly state what you want regraded and to justify the request by citing evidence<a href="#fn2" class="footnote-ref" id="fnref2"><sup>2</sup></a>. The number of points a question, exercise, or rubric category is worth or that were deducted for an incorrect answer or mistake cannot be appealed and are not up for debate or negotiation.</p>
<p>If I’m not available when delivering a request, please see Natalie Lapidot–Croitoru (231 Research Hall) or Karen Underwood (373 Research Hall). Ask them to initial the request, write down the date and time of receipt, and to hold it for my pickup.</p>
<h3 id="extra-credit-and-grading-curves-policy">Extra credit and grading curves policy</h3>
<p>Individual requests for extra credit or a grading curve will not be granted, no exceptions. Any opportunities to earn extra points will be offered to the entire class. Grading curves are handled on a per-assignment basis and are applied to all students equally.</p>
<h3 id="accommodations-policy">Accommodations policy</h3>
<p>Students with disabilities who need academic accommodations, please see me and contact the Office of Disability Services (<span class="caps">ODS</span>) at (703) 993-2474. All academic accommodations must be arranged through the <span class="caps">ODS</span>: <a href="http://ods.gmu.edu/" class="uri">http://ods.gmu.edu/</a>.</p>
<h2 id="course-grading" data-magellan-target="course-grading">Grading</h2>
<h3 id="breakdown">Breakdown</h3>
<table>
<thead>
<tr class="header">
<th style="text-align: left;">Category</th>
<th style="text-align: left;">Weight</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: left;">Participation</td>
<td style="text-align: left;">10%</td>
</tr>
<tr class="even">
<td style="text-align: left;">Reading responses</td>
<td style="text-align: left;">15%</td>
</tr>
<tr class="odd">
<td style="text-align: left;">Homework</td>
<td style="text-align: left;">25%</td>
</tr>
<tr class="even">
<td style="text-align: left;">Midterm project</td>
<td style="text-align: left;">25%</td>
</tr>
<tr class="odd">
<td style="text-align: left;">Final project</td>
<td style="text-align: left;">25%</td>
</tr>
</tbody>
</table>
<h3 id="schema">Schema</h3>
<p>Based on the final total score, your final grade will be determined as follows:</p>
<table>
<thead>
<tr class="header">
<th style="text-align: left;">Score Range</th>
<th style="text-align: left;">Final Grade</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: left;">97–100</td>
<td style="text-align: left;">A+</td>
</tr>
<tr class="even">
<td style="text-align: left;">93–96</td>
<td style="text-align: left;">A</td>
</tr>
<tr class="odd">
<td style="text-align: left;">90–92</td>
<td style="text-align: left;">A-</td>
</tr>
<tr class="even">
<td style="text-align: left;">87–89</td>
<td style="text-align: left;">B+</td>
</tr>
<tr class="odd">
<td style="text-align: left;">83–86</td>
<td style="text-align: left;">B</td>
</tr>
<tr class="even">
<td style="text-align: left;">80–82</td>
<td style="text-align: left;">B-</td>
</tr>
<tr class="odd">
<td style="text-align: left;">77–79</td>
<td style="text-align: left;">C+</td>
</tr>
<tr class="even">
<td style="text-align: left;">73–76</td>
<td style="text-align: left;">C</td>
</tr>
<tr class="odd">
<td style="text-align: left;">70–72</td>
<td style="text-align: left;">C-</td>
</tr>
<tr class="even">
<td style="text-align: left;">65–69</td>
<td style="text-align: left;">D</td>
</tr>
<tr class="odd">
<td style="text-align: left;"><65</td>
<td style="text-align: left;">F</td>
</tr>
</tbody>
</table>
<h2 id="course-expectations" data-magellan-target="course-expectations">Expectations</h2>
<h3 id="participation">Participation</h3>
<p>Students are expected to come to class prepared. Most course topics will require some amount of programming, which is a skill that’s best learned by doing. In-class activities and exercises will be provided to re-enforce these concepts, which at times may be completed in pairs or in groups. Students are expected to give their full effort while completing an in-class exercise and to complete it within a reasonable timeframe. A combination of speed of completion and quality of work will be factored into your participation grade. A student’s number of absences during the semester will also factor into his/her participation grade.</p>
<h3 id="readings">Readings</h3>
<p>Reading assignments will be regularly scheduled during the semester. They will be posted on the course schedule with reminders sent via Slack. Students are to complete the reading assignments by the specified date and be able to discuss it in class.</p>
<p>To help foster discussion of the reading assignments, students will be required to post one question per reading on Slack. Posting more than one question per reading is encouraged. Students are also required to reply to at least 10 questions during the semester and provide an answer. These replies need to be spread out so that questions are answered throughout the course. As such, a bare minimum of 5 answers need to be posted between January 23rd and March 9th, and another minimum of 5 answers need to be posted between March 19th and May 4th.</p>
<p>Some additional rules about the reading posts:</p>
<ul>
<li>All questions and answers for the reading are to be posted in the <code>#4-discussion</code> channel.</li>
<li>Your question and answer posts <strong>must</strong> be labeled to tell us which assignment they’re for.
<ul>
<li>For example, use <code>#reading2</code> if you’re posting about the 2nd assigned reading.</li>
<li>Include <code>#question</code> for a question post and <code>#answer</code> for an answer post.</li>
</ul></li>
<li>When responding to a question, use “reply”.</li>
<li>Questions must be posted by the due date to be eligible for credit.</li>
<li>Answers can be posted up to 48 hours after the “read-by” due date.</li>
<li>Question and answer quality matters.</li>
<li>Avoid Q/A duplications</li>
</ul>
<h3 id="homework">Homework</h3>
<p>There will be 5 main homework assignments to complete during the semester. Smaller, shorter mini-assignments may also be distributed from time to time. Assignments will be submitted to Github as a new pull request.</p>
<p>Assignments submissions are documents consisting of interwoven segments of writing and code blocks. It is expected that you will write in full sentences and use proper grammar and punctuation. You will be expected to explain what you are doing with each chunk of code and to interpret the meaning of what you calculate. The document’s style and formatting will also be taken into account during grading and should follow the course’s style guides for R and RMarkdown.</p>
<p>Students will be permitted to resubmit 1 of their assignments during the semester within a week of receiving their grade on the original assignment. The resubmissions are eligible to receive full credit and replace the lower grade. The updated assignment is to be submitted as a new pull request with a title that begins with the word <strong>Resubmission</strong>.</p>
<p>Students may discuss homework assignments using the <code><i class="fas fa-lock"></i>sp18-homework</code> Slack channel. Take care to ensure that the final submission is in your own words. See the academic integrity section for specifics.</p>
<h3 id="midterm-project">Midterm project</h3>
<p>Students will complete a midterm project in assigned groups during the first half of the semester. For this project, you will perform an exploratory data analysis on a dataset, focusing on “wrangling” the data so that you can produce meaningful visualizations and interpret basic trends. More detailed information about the project will be provided in the coming weeks.</p>
<h3 id="final-project">Final project</h3>
<p>In lieu of a traditional final exam, students will be building a portfolio containing both comprehensive notes on the major R functions used during the course and an overview of accomplishments demonstrating a student’s learning and growth over the semester. Students will also be given a couple of simulations to run, and then compare and contrast. Templates for the format and content of the notes portion of the portfolio will be provided in the first few weeks and will be gradually built throughout the semester. Students will then submit their portfolio about one week before the scheduled final exam.</p>
<p>During the scheduled final exam, I will conduct a brief 5-10 minute interview with each student based on their portfolio submission. The interview will also contain a short exercise to complete. More information on both the portfolio and the final interview will be provided in the coming weeks.</p>
<h2 id="course-conduct" data-magellan-target="course-conduct">Conduct</h2>
<h3 id="academic-integrity">Academic integrity</h3>
<blockquote>
<p>Student members of the George Mason University community pledge not to cheat, plagiarize, steal, or lie in matters related to academic work.<a href="#fn3" class="footnote-ref" id="fnref3"><sup>3</sup></a></p>
</blockquote>
<p>Students may discuss non-group work outside of class, however in all instances it is required that your submitted responses to assignments are written in your own words. Do not duplicate or paraphrase another person’s material or ideas and represent them as your own. Content that comes from a resource or another student should be properly cited.</p>
<p>A note on sharing or reusing code found on other Github repos or on websites like <em>Wikipedia</em> or <em>Stack Overflow</em>. I am aware that there are solution sets, sample snippets of code, etc. that can be of use while working on your assignments, projects and exercises during the course. It’s common knowledge that researchers in both industry and academia will use search engines while writing code. Being able to search for existing solutions so that you don’t “reinvent the wheel” is a useful skill. Therefore, unless I specify otherwise, you are permitted to use these resources as long as you provide a citation.</p>
<p>Exceptions to this rule are:</p>
<ul>
<li>For individual assignments, you cannot reuse another student’s code</li>
<li>For group assignments, you cannot reuse another group’s code</li>
<li>You are not permitted to use solution sets for the two main textbooks we’re using during the course</li>
</ul>
<p><strong><span class="smallcaps">Any material that is taken in whole or in part from another source and not properly cited will be treated as a violation of Mason’s Academic Honor Code.</span></strong></p>
<p>Other violations of Mason’s Honor Code will be treated similarly. Suspected violations will be reported to the Office of Academic Integrity. Please see the Honor Code page for details.</p>
<h3 id="decorumdiscourse">Decorum/discourse</h3>
<p>Students are expected to be civil in their classroom conduct and respectful of their fellow classmates and the instructor for the duration of the course. Examples of expected behavior include, but are not limited to:</p>
<ul>
<li>Showing up to class on time</li>
<li>Not interrupting your classmates or the instructor</li>
<li>Silencing your cell phone</li>
<li>Refraining from texting/messaging</li>
<li>Refraining from using devices for anything other than coursework<a href="#fn4" class="footnote-ref" id="fnref4"><sup>4</sup></a></li>
<li>Removing ear-buds/headphones and sunglasses when class begins</li>
</ul>
<p>The expectations of civil and respectful behavior still apply for all online discussions Students are expected to follow proper grammar and punctuation in online posts and to refrain from using internet slang, abbreviations, and sarcasm.</p>
<p>I will address violations of classroom decorum on a case-by-case basis and reserve the right to enact grade-based penalties for highly disruptive or repeat violations. Penalties for decorum violations cannot be negotiated or appealed.</p>
<h3 id="mason-diversity-statement">Mason diversity statement</h3>
<p>George Mason University promotes a living and learning environment for outstanding growth and productivity among its students, faculty and staff. An emphasis upon diversity and inclusion throughout the campus community is essential to achieve these goals. Diversity is broadly defined to include such characteristics as, but not limited to, race, ethnicity, gender, religion, age, disability, and sexual orientation. Diversity also entails different viewpoints, philosophies, and perspectives. Attention to these aspects of diversity will help promote a culture of inclusion and belonging, and an environment where diverse opinions, backgrounds and practices have the opportunity to be voiced, heard and respected.</p>
<h2 id="support-services">Support services</h2>
<p>The Math Tutoring Center is in 344 Johnson Center; <a href="http://math.gmu.edu/tutor-center.php" class="uri">http://math.gmu.edu/tutor-center.php</a>. The Math Department also maintains a list of persons that have identified themselves as math tutors: <a href="http://math.gmu.edu/tutor-list.php" class="uri">http://math.gmu.edu/tutor-list.php</a></p>
<p>Mason’s Writing Center is in A114 Robinson Hall; (703) 993-1200; <a href="http://writingcenter.gmu.edu/" class="uri">http://writingcenter.gmu.edu/</a></p>
<p>George Mason provides Counseling and Psychological Services (<span class="caps">CAPS</span>) for students. Contact them at (703) 993-2380 or <a href="http://caps.gmu.edu/" class="uri">http://caps.gmu.edu/</a>.</p>
<h2 id="course-disclaimer" data-magellan-target="course-disclaimer">Disclaimer</h2>
<p>The instructor reserves the right to modify this syllabus at any time during the course to improve the learning experience and classroom environment. These changes will either be announced during class or via Slack. The digital version of the syllabus on the course website will be updated to reflect the changes. The pacing of the course and the list of covered topics may also be altered in response to student progress.</p>
<p>The course objectives reflect what a student is expected to understand by the end of the course after putting in the necessary time and effort both inside and outside the classroom and completing all assignments. These outcomes are not a guarantee, and students will get more out of the course the more they put into it. Any acquired skills and knowledge can fade over time if not reviewed or practiced after the course concludes.</p>
<!-- Footnotes -->
<!-- Implicit links -->
<section class="footnotes">
<hr />
<ol>
<li id="fn1"><p>If there are special circumstances requiring that we communicate via email, it is your responsibility to inform me about it as soon as possible.<a href="#fnref1" class="footnote-back">↩</a></p></li>
<li id="fn2"><p>Acceptable evidence includes in-class notes (provide date of class), a reading passage (provide full citation), or another valid source (textbooks, official publications, etc).<a href="#fnref2" class="footnote-back">↩</a></p></li>
<li id="fn3"><p>Office for Academic Integrity. <em>2017-2018 Honor Code and Honor System.</em> Web. 27 Aug. 2017.<a href="#fnref3" class="footnote-back">↩</a></p></li>
<li id="fn4"><p>The term “devices” is meant to be broad and includes classroom computers, laptops, cell phones, tablets and e-readers, smart watches, etc. Exceptions can be made in cases of family or personal emergencies, please speak to me before class.<a href="#fnref4" class="footnote-back">↩</a></p></li>
</ol>
</section>
Can Twitter predict election results?2018-01-23T13:30:00-05:002018-01-23T13:30:00-05:00Dr. Glasbrennertag:spring18.cds101.com,2018-01-23:/materials/class-1-twitter-activity/An introductory activity about a data science study that used Twitter data to predict election outcomes.<h2 id="appearances-of-candidates-names-on-twitter-can-help-predict-election-results">Appearances of candidates’ names on Twitter can help predict election results</h2>
<p>The promise of using data science to predict election results has been received a lot of attention over the last decade. A lot of attention has focused on using polls to create models and predict results, such as what Nate Silver does over at <a href="http://fivethirtyeight.com">fivethirtyeight.com</a>. Besides polls, other kinds of data have been focused on for making predictions about elections. Consider the following 2013 study by a group of researchers at Indiana University. They wanted to see if the appearances of candidates’ names on Twitter can help predict election results.</p>
<p>The Washington Post picked this up in an editorial, written by one of the study’s authors:</p>
<blockquote>
<p>New research in computer science, sociology and political science shows that data extracted from social media platforms yield accurate measurements of public opinion. It turns out that what people say on Twitter or Facebook is a very good indicator of how they will vote.</p>
</blockquote>
<blockquote>
<p>How good? […] co-authors Joseph DiGrazia, Karissa McKelvey, Johan Bollen and I show that Twitter discussions are an unusually good pre- dictor of <span class="caps">U.S.</span> House elections. Using a massive archive of billions of randomly sampled tweets stored at Indiana University, we extracted 542,969 tweets that mention a Democratic or Republican candidate for Congress in 2010. For each congressional district, we computed the percentage of tweets that mentioned these candidates. We found a strong correlation between a candidate’s tweet share and the final two-party vote share, especially when we account for a district’s economic, racial and gender profile. In the 2010 data, our Twitter data predicted the winner in 404 out of 406 competitive races.</p>
</blockquote>
<h2 id="data-science">Data science</h2>
<p>This is a true data science research project, in that:</p>
<ul>
<li><p>The data being analyzed was scraped from the Internet, not collected from a controlled, randomized trial. Typical statistical assumptions about random sampling, etc. do not apply!</p></li>
<li><p>The research question is addressed by combining <em>domain knowledge</em> (i.e. knowledge of how Congressional races work) with a data source (Twitter) that has no obvious relevance.</p></li>
<li><p>A <em>large</em> amount of data (500 million tweets!) was collected. [Note: only 500,000 tweets were analyzed.] In this case, the data was big enough that the Center for Complex Networks and Systems Research had to get involved!</p></li>
<li><p>The project was undertaken by a team of researchers from different fields (i.e. sociology, computing) working in different departments, and bringing different skills to the table.</p></li>
</ul>
<h2 id="read-the-article-like-a-scientist">Read the article like a scientist</h2>
<p>Spend a few minutes reading the <a href="http://www.washingtonpost.com/opinions/how-twitter-can-predict-an-election/2013/08/11/35ef885a-0108-11e3-96a8-d3b921c0924a_story.html">Rojas editorial</a> and skimming the <a href="/doc/DiGrazia2013_SSRN_MoreTweetsMoreVotes-id2235423.pdf">actual paper</a>. Be sure to consider Figure 1 and Table 1 carefully, and address the following questions.</p>
<ol type="1">
<li><p>Write a sentence summarizing the findings of the paper.</p></li>
<li><p>Discuss Figure 1 with the people near you. What is its purpose? What does it convey? Think critically about this data visualization. What would you do differently?</p></li>
<li><p>Discuss with in your team the differences between the Bivariate model and the Full Model. Which one do you think does a better job of predicting the outcome of an election? Which one do you think best addresses the influence of tweets on an election?</p></li>
<li><p>Why do you suppose that the coefficient of RepublicanTweetShare is so much larger in the Bivariate model? How does this reflect on the influence of tweets in an election?</p></li>
<li><p>Do you think the study holds water? Why or why not? What are the shortcomings of this study?</p></li>
</ol>
<h2 id="study-replication">Study replication</h2>
<p>Imagine that you work for a team whose boss, who does not have advanced technical skills or knowledge, asked you to reproduce the study you just read. Discuss the following in your neighbors.</p>
<ol type="1">
<li><p>What steps are necessary to reproduce this study? Be as specific as you can! Try to list the subtasks that you would have to perform.</p></li>
<li><p>What computational tools would you use for each task?</p></li>
</ol>
<h2 id="credits">Credits</h2>
<p>This classroom activity is licensed under a <a href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>. Adapted by James Glasbrenner for <span class="caps">CDS</span>-101 from <a href="https://github.com/mine-cetinkaya-rundel/sta112_f15/blob/8e70667fc8db04f3a65b1c01a57ff747a6c9ed0a/application-exercises/app_Twitter_election_results.Rmd">the following activity</a> by <a href="https://stat.duke.edu/~mc301">Mine Çetinkaya-Rundel</a>. Credit for the dataset example belongs to <a href="http://www.math.smith.edu/~bbaumer/">Ben Baumer</a>.</p>
Datasets2018-01-23T13:30:00-05:002018-05-02T15:30:00-04:00Dr. Glasbrennertag:spring18.cds101.com,2018-01-23:/<table>
<colgroup>
<col style="width: 12%" />
<col style="width: 9%" />
<col style="width: 77%" />
</colgroup>
<thead>
<tr class="header">
<th style="text-align: center;">Name</th>
<th style="text-align: center;">Formats</th>
<th style="text-align: left;">Description</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: center;"><span class="monospace">rail_trail</span></td>
<td style="text-align: center;"><a href="/files/datasets/rail_trail.csv">csv</a><br><a href="/files/datasets/rail_trail.rds">rds</a></td>
<td style="text-align: left;">Dataset counting the number of people that pass through a sensor on a rail trail <br> Dataset for <a href="/assignments/homework-1/">Homework 1</a></td>
</tr>
<tr class="even">
<td style="text-align: center;"><span class="monospace">titanic</span></td>
<td style="text-align: center;"><a href="/files/datasets/rail_trail.csv">csv</a></td>
<td style="text-align: left;">Dataset about the passengers on the Titanic <br> Dataset for <a href="/assignments/homework-2/">Homework 2</a></td>
</tr>
<tr class="odd">
<td style="text-align: center;"><span class="monospace">county-complete</span></td>
<td style="text-align: center;"><a href="/files/datasets/county_complete.rds">rds</a></td>
<td style="text-align: left;">Dataset of selected variables for 3,143 counties taken from …</td></tr></tbody></table><table>
<colgroup>
<col style="width: 12%" />
<col style="width: 9%" />
<col style="width: 77%" />
</colgroup>
<thead>
<tr class="header">
<th style="text-align: center;">Name</th>
<th style="text-align: center;">Formats</th>
<th style="text-align: left;">Description</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: center;"><span class="monospace">rail_trail</span></td>
<td style="text-align: center;"><a href="/files/datasets/rail_trail.csv">csv</a><br><a href="/files/datasets/rail_trail.rds">rds</a></td>
<td style="text-align: left;">Dataset counting the number of people that pass through a sensor on a rail trail <br> Dataset for <a href="/assignments/homework-1/">Homework 1</a></td>
</tr>
<tr class="even">
<td style="text-align: center;"><span class="monospace">titanic</span></td>
<td style="text-align: center;"><a href="/files/datasets/rail_trail.csv">csv</a></td>
<td style="text-align: left;">Dataset about the passengers on the Titanic <br> Dataset for <a href="/assignments/homework-2/">Homework 2</a></td>
</tr>
<tr class="odd">
<td style="text-align: center;"><span class="monospace">county-complete</span></td>
<td style="text-align: center;"><a href="/files/datasets/county_complete.rds">rds</a></td>
<td style="text-align: left;">Dataset of selected variables for 3,143 counties taken from the United States Census <br> Dataset for <a href="/materials/representing-distributions-pmf/"><span class="caps">PMF</span></a> and <a href="/materials/representing-distributions-cdf/"><span class="caps">CDF</span></a> writeups</td>
</tr>
<tr class="even">
<td style="text-align: center;"><span class="monospace">yawn</span></td>
<td style="text-align: center;"><a href="/files/datasets/yawn.csv">csv</a></td>
<td style="text-align: left;">Dataset of yawning experiment performed on <em>Mythbusters</em> television show</td>
</tr>
<tr class="odd">
<td style="text-align: center;"><span class="monospace">2002FemPreg</span></td>
<td style="text-align: center;"><a href="/files/datasets/2002FemPreg.rds">rds</a></td>
<td style="text-align: left;">Dataset from the <em>National Survey of Family Growth, Cycle 6</em> <br> Dataset for <a href="/assignments/homework-4/">Homework 4</a></td>
</tr>
<tr class="even">
<td style="text-align: center;"><span class="monospace">mariokart</span></td>
<td style="text-align: center;"><a href="/files/datasets/mariokart.rds">rds</a></td>
<td style="text-align: left;">Dataset from ebay.com for the game Mario Kart for the Nintendo Wii. This data was collected in early October, 2009.</td>
</tr>
<tr class="odd">
<td style="text-align: center;"><span class="monospace">housing</span></td>
<td style="text-align: center;"><a href="/files/datasets/housing_train.rds">training rds</a><br><a href="/files/datasets/housing_test.rds">testing rds</a></td>
<td style="text-align: left;">Dataset from the <span class="caps">NYC</span> Department of Finance reporting market valuations of condominiums for Fiscal Year 2011/2012. Split into a testing and training set. <br> Dataset for <a href="/assignments/homework-5/">Homework 5</a></td>
</tr>
</tbody>
</table>