In this study, we attempted to reproduce the results of Gehrmann et al. (2018), who analyzed electronic health records (EHRs) to predict patient health status. We applied Convolutional Neural Networks (CNNs) along with other natural language processing (NLP) methods to perform patient phenotyping, classification of patients to determine medical conditions. Patient phenotyping is critical because it informs diagnoses and supports treatment planning. Using 1,610 patient discharge summaries from the MIMIC-III database, we applied various Deep Learning techniques, including LSTM and GRU models and compared the predictive performance of CNNs with Bag-of-Words, n-gram features, and cTAKES. Our results show that Bag-of-Words, n-gram models, and cTAKES outperformed CNNs across all disease categories examined in the study.

Access Full Paper Here

GitHub Repo

Insights into the U.S. Market for Cereal Containing Hemp: Evidence from Scanner Data

Abstract:

As hemp re-emerges in the U.S. food system under a modern regulatory framework, understanding its integration into consumer diets is increasingly important. This study investigates the consumer adoption of hemp-inclusive cereals products incorporating hemp seeds or ingredients. Using household-level scanner panel data from 2017–2019, we examine purchasing behaviors across households, focusing on spending patterns, demographic drivers, and post-purchase behavior.

To classify consumer engagement with hemp-inclusive cereal, we used Random Forest classification, a non-parametric model capable of capturing non-linear relationships between household characteristics and purchasing decisions.

By exploring consumer pathways and decision-making dynamics, this research contributes to understanding how novel, plant-based food products like hemp-inclusive cereal can better penetrate mainstream markets. The insights offer practical implications for food marketers, retailers, and policymakers aiming to promote sustainable and functional foods in today’s evolving food landscape.

Estimating Mean and Variance Functions of a Random Variable Dependent on Two Independent Variables

Abstract:

The mean and variance of a random variable play pivotal roles in statistical analysis and predictive modeling, offering insights into central tendency and data dispersion. In this study, we aim to develop predictive models for estimating the mean and variance of a random variable Y, dependent on two independent variables X1 and X2. Leveraging 200 realizations of Y across various (X1, X2) pairs, we employ data mining and machine learning techniques to predict mean and variance as functions of X1 and X2. Methodology involves employing a range of non-linear predictive models, including Generalized Additive Models, Support Vector Machine (SVM), Local Smoothing (LOESS), K-Nearest Neighbors (KNN), Neural Networks, Random Forest (RF), and boosting. Parameter tuning via cross-validation and grid search optimizes model performance, with KNN emerging as the top-performing model. Validation results demonstrate KNN's superior accuracy in predicting both mean and variance, as evidenced by lower RMSE, MAE, and RSS values compared to other models. In conclusion, the KNN model offers the most accurate estimations for mean and variance, providing valuable insights for decision-making across various domains.

Access Full Paper Here

GitHub Code

Decoding Cardiovascular Disease: Predictors and Insights

Abstract:

This project aims to build prediction models for assessing the presence of cardiovascular disease (CVD) in patients and to analyze factors that can predict the presence of cardiovascular disease. The goal is to derive insights from an existing dataset of 70,000 patient records to identify which features may have the most significant influence on cardiovascular disease and identify potential avenues for further research. Machine learning algorithms, including Linear Discriminant Analysis, Naive Bayes, Logistic Regression, and Random Forests, were used to build predictive models of CVD. We found that systolic blood pressure was most significantly associated with the presence of CVD, followed by age (CVD was more likely to be present in older populations). Factors such as cholesterol, glucose levels, weight, and height had moderate to low significance. By leveraging these results and expanding on them, healthcare professionals can enhance preventive strategies for individuals at higher risk of developing CVD, ultimately leading to improved patient outcomes and reduced healthcare burdens.

Note:I led the main data analytics and modeling for this project, while exploratory data analysis was conducted by other team members.

Access Full Paper Here

GitHub Code

Chasing Votes: A Markov Chain Monte Carlo Forecast for 2024 U.S. Election

Abstract:

Markov chain Monte Carlo methods (MCMC) are essential tools for solving many modern-day statistical and computational problems (Calderhead, 2014). The name “Monte Carlo” started as the casino at Monte Carlo. But it soon became a technical term for simulation of random processes (Geyer, 2011). A Markov chain is a type of stochastic process that consists of random variables transitioning between states according to specific probabilistic rules. The process depends on the Markov Property, which states that the future state of the system is determined solely by its current state, without any influence from past states. This makes Markov chains a powerful tool for modeling sequences where each step is independent of all but the most recent one (Lateef, 2019). This paper will extensively explain MCMC and provide an exploratory analysis using it to simulate the current 2024 US Presidential Election.

Access Full Paper Here

GitHub Code

Forecasting Petty Crimes During Major Holidays in the United States

Abstract:

Holidays present unique challenges for police departments in preparing duty rosters and personnel assignments. Law enforcement agencies must decide the appropriate number of officers on duty during times when individuals prefer to be at home (Cohn and Rotton, 2003). It is commonly assumed that crime rates, particularly petty crimes, increase on certain holidays, such as New Year’s Eve, due to factors such as, increased alcohol consumption and a higher probability of aggressive behavior. However, empirical research on holiday crime patterns remains limited, with many existing studies being outdated or inconclusive.

Given the limited and conflicting evidence, this study aims to forecast the occurrence of petty crimes during major holidays in the United States. By leveraging historical crime data and predictive modeling techniques, this project seeks to provide insights that can assist law enforcement agencies in resource allocation and crime prevention strategies during holiday periods.

Note:In this collaborative petty crime forecasting project, I developed SARIMA and XGBoost models to forecast crime trends at the national level and for Los Angeles. I also performed anomaly detection using the Z-score method. Boston-specific forecasting and CUSUM-based anomaly detection were conducted by other team members.

Access Full Paper Here

GitHub Code