ISYE 6414

Datasets

Each individual student must meet the 3d+3d+ requirement requirement (at least 3 datasets coming from at least 3 different data sources): the joined datasets can be the same (or not) amongst all members of a group. Picking good sources of data that can be combined in interesting ways is essential to analysis that can lead to valuable insights.

Requirements

  • You need 3 distinct datasets from 3 different data sources (3d+3d+ requirement)
    A dataset is a collection of data, while a data source is the organization/entity that aggregates, assembles, and publishes the data. These datasets must come from different data sources (organizations). For example, the Bureau of Labor Statistics has many datasets — use as many as you like from them, but that only counts as 1 data source. Part of your job is to combine these sources: for example, you might combine inflation data from the US Bureau of Labor Statistics with homelessness data from the U.S. Department of Housing and Urban Development and chronic absenteeism from the U.S. Department of Education.
  • At least 1 of the datasets needs to be comprised of many observations
    In the example above, perhaps your 'many N observations' dataset contains chronic absenteeism by school district. You might decide to aggregate this by state or regions: that's okay. The main point is that the dataset wasn't trivial to begin: you simplifying things for the sake of understandability is … understandable, as it's part of the job of a data scientist.
  • Be bold: don't worry about finding weak predictors/coefficients
    Your group will be graded upon the work you do — the writing, analysis, etc. You will not be graded based upon your ability to find what you hoped or expected to find: go out on a limb … maybe you'll find something interesting that will make the TAs day.