ISYE 6414
User accounts will be set up and emailed byJune 1.
Your Analysis Plan, Final Report, and Peer Reviews will be submitted via this account.
Your individual code will be submitted via your group's GitHub repository, which we'll invite you to after your account is created.

Datasets

Key Definitions

Dataset:A collection of data, typically a single file or table (e.g., a CSV file with unemployment rates by state).
Data Source:The organization or entity that aggregates, assembles, and publishes data (e.g., the Bureau of Labor Statistics).

Requirements

  • Your group needs at least 3 datasets from 3 different data sources, joined together
    Multiple datasets from the same organization count as only 1 source.
  • At least 1 dataset must have 10,000+ rows before filtering
    • This is the "core" dataset for your analysis.
    • Filter and clean it thoughtfully, but retain at least a few thousand rows afterward — you need enough data to split into training, validation, and test sets and still model meaningfully.
    • Datasets you join don't need similar cardinality — joining a smaller reference table (e.g., country-level median income with <1,000 rows) to your core dataset is fine.
    • You may aggregate granular data (e.g., school district → state) for your analysis.
  • Enough predictors for 10+ per model
    Your combined data must support 10+ predictors per model (a categorical variable counts as 1 regardless of its number of levels) so you can perform meaningful variable selection.
  • Don't worry about finding weak predictors
    You're graded on your analysis quality, not on finding strong correlations.
Rules of Thumb
  1. The problem shouldn't be trivial — your core dataset should be large enough with a decent number of predictors such that you're performing variable selection.
  2. The datasets you join to your core dataset should add meaningful information (but even just 1 additional predictor is fine).

Visual: The 3 Data Sources Requirement

flowchart TB subgraph Source1["<b>Data Source 1</b>"] D1["Dataset A<br/>(Employment Data)<br/><i>e.g., Bureau of Labor Statistics</i>"] end subgraph Source2["<b>Data Source 2</b>"] D2["Dataset B<br/>(Demographics)<br/><i>e.g., Census Bureau</i>"] end subgraph Source3["<b>Data Source 3</b>"] D3["Dataset C<br/>(Education Stats)<br/><i>e.g., Dept. of Education</i>"] end D1 --> JOIN["Combined Dataset<br/>(merged on shared keys)"] D2 --> JOIN D3 --> JOIN JOIN --> ANALYSIS["Your Individual Analysis"] style Source1 fill:#e3f2fd,stroke:#1976d2,stroke-width:2px style Source2 fill:#e8f5e9,stroke:#388e3c,stroke-width:2px style Source3 fill:#fff3e0,stroke:#f57c00,stroke-width:2px style JOIN fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px style ANALYSIS fill:#fce4ec,stroke:#c2185b,stroke-width:2px

Example

A Group's Combined Dataset

Your group assembles one combined dataset by joining at least 3 datasets drawn from at least 3 different data sources. For example:

DatasetData SourceRole
Housing prices (50,000+ rows)Zillow ResearchCore dataset
Unemployment ratesBureau of Labor StatisticsJoined by region + date
Population dataUS Census BureauJoined by region
✓ Valid: 3 datasets from 3 distinct sources, joined, with a 10,000+ row core dataset