Big Data in Health Care

Big data in health care is transforming the way we understand and deliver medical services. By bringing together massive volumes of patient records, clinical notes, medical images, genomic data, and information from wearables, health systems can uncover patterns that improve decision-making and patient outcomes. Advanced analytics and AI make it possible to detect disease trends, predict outbreaks, and personalize treatments based on individual risk factors. At the same time, big data helps hospitals and providers improve efficiency, reduce costs, and enhance patient safety by anticipating complications before they occur. Beyond individual care, policymakers and researchers also rely on health data to design effective public health programs, evaluate interventions, and address disparities, making big data a cornerstone of modern, evidence-based health care.

Techniques such as Hadoop, MapReduce, and Apache Spark are widely used to store, process, and analyze big health data efficiently, enabling rapid insights from massive and complex datasets.

This project implements a full pipeline for mortality prediction using clinical data from the MIMIC database. It covers data preprocessing, feature engineering, and predictive modeling with Logistic Regression, SVM, and Decision Trees, along with cross-validation and custom model development . The code demonstrates how big data techniques can be applied to healthcare analytics to generate insights from complex patient event data.

GitHub Repository URL

Predicting Patient Mortality Using ICU Event Data

Tools: PySpark, Python, NumPy, SciPy, scikit-learn

This project applies big-data processing and machine learning to ICU clinical event data to predict 30-day patient mortality after discharge. Using PySpark on Google Colab, I implemented an end-to-end data-analytics pipeline that included:

Data Engineering: Extracted, cleaned, and aggregated millions of patient events (diagnoses, drugs, lab results) into feature vectors using Spark transformations.
Feature Construction: Defined 2000-day observation and 30-day prediction windows; normalized and converted data into SVMLight format for ML training.
Modeling: Implemented stochastic gradient-descent logistic regression from scratch to classify patient outcomes and evaluated model performance using ROC curves and AUC.
Outcome: Demonstrated scalable, interpretable predictive modeling on healthcare big data to support early risk detection and hospital resource planning.

GitHub Repository URL

Patient Phenotyping and Clustering Using PySpark

Tools: PySpark, Python (RDD API, MLlib), scikit-learn

This project implements both supervised and unsupervised phenotyping methods to identify Type 2 Diabetes (T2DM) cases and controls from large-scale EHR data.

Rule-Based Phenotyping: Implemented a simplified PheKB algorithm in PySpark to classify patients as case, control, or unknown using diagnostic codes, lab results, and medication records.
Feature Construction: Built distributed ETL pipelines to aggregate diagnosis counts, medication counts, and average lab values into patient-level feature vectors.
Unsupervised Learning: Applied K-Means and Gaussian Mixture Models (GMM) to cluster patients and discover latent phenotypes based on clinical similarities.
Evaluation: Computed cluster purity and compared clustering results against rule-based ground truth to assess consistency and separability of phenotypes.

This assignment demonstrates the integration of big-data ETL, clustering, and evaluation techniques in Spark to extract clinically meaningful insights from EHR data.

GitHub Repository URL