Initial Data Discovery and Exploration
We will determine what data is available and perform a brief exploratory data analysis in order to find out more about blank values, correlations, and other interesting factors. Since our data are in different .data files, we would have to combine all these data together and figure out what are the name of the columns, converting them into the appropriate data type.
Visualise the Data
we hope to better understand the different variables in the dataset by creating some pair plots between variables. Since there are 75 variables, we hope to see whether there is any relationship between the variables and determine whether there is a need to narrow the variables we use.
Data Cleaning
Remove any blank values, augment of translate values with known information, and prepare the data to be used by machine learning algorithms. Remove the `name` column, which has recently been filled with dummy values and columns that have been marked “not used”. For continuous variables, we plan to scale them by standardising the data. For categorical variables, we realise there is no clear relationship between the values and plan to use one hot encoding for those variables.
Initial Machine Learning Exploration
Test multiple machine learning algorithms to see which ones perform best
Fine Tuning
Add parameters or leverage multiple models to build the best solution using grid search and/or random search
Analyze Results & Present Findings
Draw conclusions and summarize results (we will be presenting our work in a jupyter notebook and via a short video)