Datasets
Picking good sources of data that can be combined in interesting ways is essential to analysis that can lead to valuable insights.
Requirements for the Final Project
- You need 3 distinct data sources
Part of your job is to combine these sources: for example, you might combine inflation data from the
US Bureau of Labor Statistics with homelessness data from the
U.S. Department of Housing and Urban Development and chronic absenteeism from the
U.S. Department of Education.
- 1 of the datasets needs to be comprised of many observations
In the example above, perhaps your 'many N observations' dataset contains chronic absenteeism by school district. You might decide to aggregate this by state or regions: that's okay. The main point is that the dataset wasn't trivial to begin: you simplifying things for the sake of understandability is … understandable, as it's part of the job of a data scientist.
- Be bold: don't worry about finding weak predictors/coefficients.
Your group will be graded upon the work you do — the writing, analysis, etc. You will not be graded based upon your ability to find what you hoped or expected to find: go out on a limb … maybe you'll find something interesting that will make the TAs day.
How do we combine datasets?
Combining data from different sources is a crucial skill in data science because it allows you to answer broader questions and gain deeper insights. In practice, this typically means identifying a shared column or key among your datasets—such as a date, region, or other unique identifiers—and then using join or merge operations to bring them together. If you’re working in Python, you can use pandas’ merge()
or concat()
functions; in R, you can do something similar with dplyr functions like left_join()
, right_join()
, or full_join()
. It’s common to run into small hurdles during this process, like inconsistent column names or mismatched data types. Address these by renaming columns, transforming date formats, or mapping categorical variables so that everything aligns correctly.
If one of your datasets has many observations (for example, chronic absenteeism at the school-district level) and another dataset only has figures at the state or national level, you’ll need to decide on a common level of aggregation. Sometimes that means summing or averaging data to match the granularity of the other source. The key is being consistent: if you aggregate your "big" dataset to the state level, do the same for any other datasets meant to join on that state identifier. Ultimately, your goal is a unified dataset that lets you explore potential relationships across all your variables, so don’t be afraid to experiment—this is precisely the kind of problem real data scientists tackle every day.
Finding Sources
Professor Horng has a number of suggested datasets linked, so we'll point you
there instead of copying their links.
A different approach would be to start with a topic, conduct your
literature review and see what sources people used in the papers you read … and the papers cited in the papers you read (which you can also go read recursively … you should; that's kind of how you do a literature review and eventually reach a saturation point in a particular field).