ISYE 6414
Students will be invited to given a login for this site after Final Project groups are finalized onJanuary 27.
Checkpoints, the Final Report and Peer Reviews will be submitted via this account.
Your individual code will be submitted via your group's GitHub repository, which we'll invite you to after your account is created.

Datasets

Each individual student must use at least 3 datasets from 3 different data sources: the joined datasets can be the same (or not) amongst all members of a group. Picking good sources of data that can be combined in interesting ways is essential to analysis that can lead to valuable insights.

Requirements

  • You need at least 3 datasets from 3 different data sources
    A dataset is a collection of data, while a data source is the organization/entity that aggregates, assembles, and publishes the data. These datasets must come from different data sources (organizations). For example, the Bureau of Labor Statistics has many datasets — use as many as you like from them, but that only counts as 1 data source. Part of your job is to combine these sources: for example, you might combine inflation data from the US Bureau of Labor Statistics with homelessness data from the U.S. Department of Housing and Urban Development and chronic absenteeism from the U.S. Department of Education.
    Common Question: What counts as a 'data source'?
    Multiple datasets from the same organization (e.g., several CSV files from the Bureau of Labor Statistics) count as only 1 data source. You need 3 different organizations/providers.
  • At least 1 of the datasets needs to be comprised of many observations
    In the example above, perhaps your 'many N observations' dataset contains chronic absenteeism by school district. You might decide to aggregate this by state or regions: that's okay. The main point is that the dataset wasn't trivial to begin: you simplifying things for the sake of understandability is … understandable, as it's part of the job of a data scientist.
  • Be bold: don't worry about finding weak predictors/coefficients
    Your group will be graded upon the work you do — the writing, analysis, etc. You will not be graded based upon your ability to find what you hoped or expected to find: go out on a limb … maybe you'll find something interesting that will make the TAs day.

Visual: The 3 Data Sources Requirement

flowchart TB subgraph Source1["<b>Data Source 1</b><br/>e.g., Bureau of Labor Statistics"] D1["Dataset A<br/>(Employment Data)"] end subgraph Source2["<b>Data Source 2</b><br/>e.g., Census Bureau"] D2["Dataset B<br/>(Demographics)"] end subgraph Source3["<b>Data Source 3</b><br/>e.g., Dept. of Education"] D3["Dataset C<br/>(Education Stats)"] end D1 --> JOIN["Combined Dataset<br/>(merged on shared keys)"] D2 --> JOIN D3 --> JOIN JOIN --> ANALYSIS["Your Individual Analysis"] style Source1 fill:#e3f2fd,stroke:#1976d2,stroke-width:2px style Source2 fill:#e8f5e9,stroke:#388e3c,stroke-width:2px style Source3 fill:#fff3e0,stroke:#f57c00,stroke-width:2px style JOIN fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px style ANALYSIS fill:#fce4ec,stroke:#c2185b,stroke-width:2px