Each Project Skeleton has most of what you need, but it's missing extra datasets (you need to find them) and some decisions... like what exactly you want to model/predict/explore.
Keep in Mind
These skeletons are not complete: you need to expand upon them.
Logistic Regression: Predicting Recidivism Rates Based on Demographic and Criminal History
Background
The criminal justice system aims to reduce recidivism, but many individuals re-offend within a few years of release. Understanding the factors that contribute to recidivism can inform policy changes and rehabilitation programs.
Possible Research Questions
What factors (e.g., age, prior convictions, education level) are associated with a higher likelihood of reoffending?
Does participation in rehabilitation programs reduce recidivism?
Are certain types of offenses more predictive of recidivism?
Dependent Variable: Recidivism (binary: 1 = reoffended, 0 = did not reoffend)
Independent Variables: Age at release, number of prior convictions, type of offense, education level, participation in job training or rehabilitation, race/ethnicity, etc.
Methods
Perform logistic regression to classify individuals as likely or unlikely to reoffend.
Check odds ratios for each predictor variable.
Evaluate the model using AUC-ROC curves and confusion matrices to measure classification performance.
Multiple Linear Regression: Predicting Student Loan Debt Based on Socioeconomic Factors
Background
The rising cost of education has led to an increasing reliance on student loans in the U.S. Some students accumulate significantly higher debt than others, leading to disparities in financial burden post-graduation. This study will analyze factors influencing student loan debt levels using multiple linear regression.
Possible Research Questions
What socioeconomic factors (e.g., parental income, tuition costs, major choice, school type) significantly impact the total student loan debt upon graduation?
Does a student's employment status during college reduce overall debt?
Do in-state and out-of-state students differ significantly in loan amounts?
Dependent Variable: Total student loan debt at graduation (continuous)
Independent Variables: Family income, tuition costs, financial aid received, major category, institution type, employment status, etc.
Methods
Perform multiple linear regression to predict total student loan debt.
Check for multicollinearity, heteroscedasticity, and normality of residuals.
Compare models with and without interaction terms.
Poisson Regression: Modeling Traffic Accident Counts Based on Road Conditions and Demographics
Background
Traffic accidents are a major public safety concern, and many factors contribute to accident frequency. This study will analyze how road conditions, weather, and demographics influence accident rates using Poisson regression.
Possible Research Questions
Do certain road conditions (e.g., wet roads, construction zones) lead to higher accident frequencies?
How does driver age and experience impact accident counts?
Does population density correlate with accident frequency in urban vs. rural areas?
Possible Data Sources
National Highway Traffic Safety Administration (NHTSA) Crash Data
Federal Highway Administration (FHWA) Traffic Volume Reports
Local Department of Transportation Data
Key Variables
Dependent Variable: Number of accidents at a given location (count variable)
Independent Variables: Weather conditions, road type, speed limits, driver demographics, traffic volume, urban vs. rural, etc.
Methods
Use Poisson regression to model accident frequency as a function of predictor variables.
Check for overdispersion (considering a negative binomial model if necessary).
Assess model goodness-of-fit using deviance and AIC/BIC criteria.