ISYE 6414
ISYE 6414 Final Project
Sign In

Datasets

Key Definitions

Dataset:A collection of data, typically a single file or table (e.g., a CSV file with unemployment rates by state).
Data Source:The organization or entity that aggregates, assembles, and publishes data (e.g., the Bureau of Labor Statistics).

Requirements

  • Each group member needs at least 3 datasets from 3 different data sources
    Multiple datasets from the same organization count as only 1 source.
  • At least 1 dataset must have 10,000+ rows
    • This is the "core" dataset for your analysis.
    • Datasets you join don't need similar cardinality — joining a smaller reference table (e.g., country-level median income with <1,000 rows) to your core dataset is fine.
    • You may aggregate granular data (e.g., school district → state) for your analysis.
  • Don't worry about finding weak predictors
    You're graded on your analysis quality, not on finding strong correlations.
Rules of Thumb
  1. The problem shouldn't be trivial — your core dataset should be large enough with a decent number of predictors such that you're performing variable selection.
  2. The datasets you join to your core dataset should add meaningful information (but even just 1 additional predictor is fine).

Visual: The 3 Data Sources Requirement

flowchart TB subgraph Source1["<b>Data Source 1</b>"] D1["Dataset A<br/>(Employment Data)<br/><i>e.g., Bureau of Labor Statistics</i>"] end subgraph Source2["<b>Data Source 2</b>"] D2["Dataset B<br/>(Demographics)<br/><i>e.g., Census Bureau</i>"] end subgraph Source3["<b>Data Source 3</b>"] D3["Dataset C<br/>(Education Stats)<br/><i>e.g., Dept. of Education</i>"] end D1 --> JOIN["Combined Dataset<br/>(merged on shared keys)"] D2 --> JOIN D3 --> JOIN JOIN --> ANALYSIS["Your Individual Analysis"] style Source1 fill:#e3f2fd,stroke:#1976d2,stroke-width:2px style Source2 fill:#e8f5e9,stroke:#388e3c,stroke-width:2px style Source3 fill:#fff3e0,stroke:#f57c00,stroke-width:2px style JOIN fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px style ANALYSIS fill:#fce4ec,stroke:#c2185b,stroke-width:2px

Example Scenarios

Example 1: All Members Share the Same Datasets

A 3-person group decides to use the same 3 datasets from 3 different sources:

DatasetData SourceAliceBobCarol
Unemployment ratesBureau of Labor Statistics
Housing pricesZillow Research
Population dataUS Census Bureau
✓ Valid: Each member uses 3 datasets from 3 sources
Example 2: Members Use Different Datasets

A 2-person group where each member explores different aspects of a topic:

DatasetData SourceDavidEmma
Inflation ratesBureau of Labor Statistics
Homelessness countsHUD
School fundingDept. of Education
Crime statisticsFBI UCR
Healthcare accessCDC
✓ Valid: David uses 3 sources (BLS, HUD, DoE), Emma uses 3 sources (BLS, FBI, CDC)