Health Disease Machine Learning Analysis

Skills

Scikit Learn, Pandas, Matplotlib, Seaborn

Timeline

3 weeks

Team

Zoe Teoh, Matthew Kern

Overview

As the leading cause of death in the United States heart disease claim over 23% of American lives. As this is a manageable condition (through lifestyle changes and sometimes medication), we hope that by helping to identify people with heart disease we can help them to live longer and healthier lives. Specifically, by creating a model that can accurately determine if people have heart disease we can hopefully help them to discover the issue and get the care that they need.

Steps

Initial Data Discovery and Exploration
We will determine what data is available and perform a brief exploratory data analysis in order to find out more about blank values, correlations, and other interesting factors. Since our data are in different .data files, we would have to combine all these data together and figure out what are the name of the columns, converting them into the appropriate data type.

Visualise the Data
we hope to better understand the different variables in the dataset by creating some pair plots between variables. Since there are 75 variables, we hope to see whether there is any relationship between the variables and determine whether there is a need to narrow the variables we use.

Data Cleaning
Remove any blank values, augment of translate values with known information, and prepare the data to be used by machine learning algorithms. Remove the `name` column, which has recently been filled with dummy values and columns that have been marked “not used”. For continuous variables, we plan to scale them by standardising the data. For categorical variables, we realise there is no clear relationship between the values and plan to use one hot encoding for those variables.

Initial Machine Learning Exploration
Test multiple machine learning algorithms to see which ones perform best

Fine Tuning
Add parameters or leverage multiple models to build the best solution using grid search and/or random search

Analyze Results & Present Findings
Draw conclusions and summarize results (we will be presenting our work in a jupyter notebook and via a short video)

Machine Learning Strategy

The problem we are planning to solve is to predict whether a person has heart disease based on a series of variables such as their age, gender, cholesterol level and whether a person smokes. Since we are predicting whether a person has heart disease or not, it would be a binary classification.

P(T, E+ΔE) > P(T,E)

In our case, the Experience would be the number of datasets that we train our machine learning model with; the Task would be the classification of whether a person has heart disease or not: the Performance measure would be the accuracy of classification, evaluated using the Confusion Matrix so that we can distinguish between type I and type II errors.

Motivation

Tasked with using machine learning for a practical application, the options that we liked the most had the potential to help many people. This project stood out not only because of the availability of a rich dataset, but also because the impact of such research really could improve lives. Moreover, as heart disease is so prevalent the number of people that we could help is huge. Thus, although we considered many options for projects this one seemed both feasible and impactful.

Evaluation of Success

As we have one large dataset, we will use cross-validation to measure the accuracy score for predicting which patients have heart disease. Since it is a classification problem, we think that using a confusion matrix would be the most appropriate in term of evaluating the success. Thus, we hope that our properly tuned model will have a higher accuracy score (and less misclassifications in the confusion matrix) than that of the models that we initially test.