Insights With Patalee

This study aims to predict whether a country engages in international trade using various classification models. The analysis utilizes predictor variables such as common language, legal system similarity, shared religion, border proximity, trading distance, free trade agreements, colonial ties, currency union, World Trade Organization membership, island status, and landlocked status. Random forest (RF) and Boosting are employed as primary models, supplemented by LDA, Naïve Bayes, Logistic Regression, and Generalized Additive Model (GAM). Results indicate that RF outperforms other models with the lowest testing error, followed by Boosting. Variable importance analysis identifies shared religion, WTO membership, and distance as influential predictors in RF, while Boosting emphasizes distance, shared religion, and WTO membership. Cross-validation further confirms RF's superior performance, with lower testing errors compared to other models. Boosting, despite slightly higher testing error rates, demonstrates promising results. In conclusion, RF proves to be the most effective model for predicting international trade, offering high accuracy and generalization. However, Boosting also exhibits strong performance, highlighting its potential for further improvement through fine-tuning. Overall, the study provides valuable insights into the application of classification models in predicting international trade patterns.

Disclaimer: The analysis presented in this document is based on publicly available datasets, which have been previously published and cited accordingly. The research aims to apply different analytical techniques to address distinct research questions. This work is intended for demonstration and discussion and is not part of a formal publication

Please Click Here for the Full Paper

GitHub repository URL

Estimating Mean and Variance Functions of a Random Variable Dependent on Two Independent Variables

Abstract:

The mean and variance of a random variable play pivotal roles in statistical analysis and predictive modeling, offering insights into central tendency and data dispersion. In this study, we aim to develop predictive models for estimating the mean and variance of a random variable Y, dependent on two independent variables X1 and X2. Leveraging 200 realizations of Y across various (X1, X2) pairs, we employ data mining and machine learning techniques to predict mean and variance as functions of X1 and X2. Methodology involves employing a range of non-linear predictive models, including Generalized Additive Models, Support Vector Machine (SVM), Local Smoothing (LOESS), K-Nearest Neighbors (KNN), Neural Networks, Random Forest (RF), and boosting. Parameter tuning via cross-validation and grid search optimizes model performance, with KNN emerging as the top-performing model. Validation results demonstrate KNN's superior accuracy in predicting both mean and variance, as evidenced by lower RMSE, MAE, and RSS values compared to other models. In conclusion, the KNN model offers the most accurate estimations for mean and variance, providing valuable insights for decision-making across various domains.

Please Click Here for the Full Paper

GitHub Repository URL

Forecasting Petty Crimes During Major Holidays in the United Sates

Abstract:

Holidays present unique challenges for police departments in preparing duty rosters and personnel assignments. Law enforcement agencies must decide the appropriate number of officers on duty during times when individuals prefer to be at home (Cohn and Rotton, 2003). It is commonly assumed that crime rates, particularly petty crimes, increase on certain holidays, such as New Year’s Eve, due to factors such as, increased alcohol consumption and a higher probability of aggressive behavior. However, empirical research on holiday crime patterns remains limited, with many existing studies being outdated or inconclusive.

Early research provides mixed findings on crime trends during holidays. Lester (1979) found that homicides in the U.S. occurred disproportionately on major holidays. Similarly, Templer, Brooner, and Corgiat (1983) reported an increase in police service calls on national and local holidays. Conversely, other studies, such as Rotton and Frey (1985), failed to establish a consistent pattern indicating higher crime rates on holidays.

Several avenues have sought to spread awareness of crime spoiling the holiday spirit: Lloyd (2022) shared stories of victims that suffered car and workplace break-ins as well as an instance of a volunteer organization robbed while attempting to bring the Christmas spirit to the community by decorating houses for free, Kaufman (knklawfirm.com) cited desperation and opportunity as reasons for increased levels of crime, while Justice (2015/17) and nnw.org provided tips to lower risk and resources for victims.

Given the limited and conflicting evidence, this study aims to forecast the occurrence of petty crimes during major holidays in the United States. By leveraging historical crime data and predictive modeling techniques, this project seeks to provide insights that can assist law enforcement agencies in resource allocation and crime prevention strategies during holiday periods.

Please Click Here for the Full Paper

GitHub Repository URL

Chasing Votes: A Markov Chain Monte Carlo Forecast for 2024 US Presidential Election

Abstract:

Markov chain Monte Carlo methods (MCMC) are essential tools for solving many modern-day statistical and computational problems (Calderhead, 2014). The name “Monte Carlo” started as the casino at Monte Carlo. But it soon became a technical term for simulation of random processes (Geyer, 2011). A Markov chain is a type of stochastic process that consists of random variables transitioning between states according to specific probabilistic rules. The process depends on the Markov Property, which states that the future state of the system is determined solely by its current state, without any influence from past states. This makes Markov chains a powerful tool for modeling sequences where each step is independent of all but the most recent one (Lateef, 2019). This paper will extensively explain MCMC and provide an exploratory analysis using it to simulate the current 2024 US Presidential Election.

Please Click Here for the Full Paper

GitHub Repository URL

Decoding Cardiovascular Disease

Abstract

This project aims to build prediction models for assessing the presence of cardiovascular disease (CVD) in patients and to analyze factors that can predict the presence of cardiovascular disease. The goal is to derive insights from an existing dataset of 70,000 patient records to identify which features may have the most significant influence on cardiovascular disease and identify potential avenues for further research. Machine learning algorithms, including Linear Discriminant Analysis, Naive Bayes, Logistic Regression, and Random Forests, were used to build predictive models of CVD. We found that systolic blood pressure was most significantly associated with the presence of CVD, followed by age (CVD was more likely to be present in older populations). Factors such as cholesterol, glucose levels, weight, and height had moderate to low significance. By leveraging these results and expanding on them, healthcare professionals can enhance preventive strategies for individuals at higher risk of developing CVD, ultimately leading to improved patient outcomes and reduced healthcare burdens.

Note on Authorship and Contributions:

This project was completed as part of a group assignment for ISYE 7406 [Data Mining and Statistical Learning] at Georgia Institute of Technology. The full report (linked below) represents the joint effort of all team members: Buddhika Patalee, Meenakshi Janardhanan, Griffin Fess, Isacar Racine, and Roxana Shahryari.I was responsible for conducting all modeling and analysis (e.g., classification modeling, validation, and interpretation), while the exploratory data analysis (EDA) section was completed by other team members. The GitHub repository below contains only the code I personally developed for this project.

Please Click Here for the Full Paper

GitHub Repository URL