Source Code: GitHub

Competition Statement

The goal of this competition is to predict if a person has any of three medical conditions. You are being asked to predict if the person has one or more of any of the three medical conditions (Class 1), or none of the three medical conditions (Class 0). You will create a model trained on measurements of health characteristics.

To determine if someone has these medical conditions requires a long and intrusive process to collect information from patients. With predictive models, we can shorten this process and keep patient details private by collecting key characteristics relative to the conditions, then encoding these characteristics.

Your work will help researchers discover the relationship between measurements of certain characteristics and potential patient conditions.

Section 1: Final Approach

Project General Approach

The first step in analyzing this data was to do some preprocessing. The following were the two preprocessing steps that were used. These techniques, or some similar alternative, are standard practice when doing data analysis for machine learning models.

Check the data for missing values and impute the values using a random forest imputer
Understand how much of the data is categorical or continuous
Normalize the data

Figure 1: Missing Value Summary for The ICR – Kaggle dataset

The next step after preprocessing was to decide if I wanted to embed the normalized data in any way. One great characteristic of embedding the data is that at times it can be more robust to covariate shift that occurs from having different types of test data.

Figure 2: Covariate Shift Illustrated Through Training and Test Dataset Distributions.

I didn’t end up doing any embeddings for my features, but after the competition closed, I saw the second-place winner chose to use UMAP embeddings. To this, I revert to the No Free Lunch Theorem for machine learning, which in short states that no one embedding or model will always produce the best results. It will always depend, which is why it’s wise to create multiple models and constantly observe their performance. Anyway, back to the competition workflow

After deciding that I would not embed my data, I had to consider which machine learning model to use. At the time of this competition, I had read a lot about gradient boosting and the great successes it’s had in Kaggle Competitions, so I thought, let’s try It out with Xgboost!

I began by creating a function that performed K-Fold cross-validation for my Xgboost model. Since my dataset was relatively small (~600 data points), I explored the hyperparameter space using a Bayesian optimization Python package. The following is the code I used to set up the Bayesian optimization. The pbounds variable represents the parameter bounds of the hyper-parameter being investigated in the keys, and the tuple represents the range that can be explored by the Bayesian optimizer.

Figure 3: Bayesian optimization setup in Python

The Bayesian optimizer then performs optimizations and outputs the model performance for each model it creates. Below is a snippet of the Bayesian output in the Python console with Jupyter notebooks.

Figure 4: Jupyter Notebooks console output from Bayesian Optimization.

After the Bayesian optimizer runs for a certain amount of iterations, the best parameters are taken from the Bayesian optimizer and used to train a model. After the model was trained, a confusion matrix was created.

Figure 5: Confusion Matrix and Feature Importance for Model Submission.

Final Competition Outcome

At the final competition evaluation, the hosts of the competition reevaluated everyone’s model based on a hold-out test set that was not included in the submission scores; the hold-out test set consisted of 400 test data points. When the final submission went through, I placed 4045/6431. I wouldn’t say I’m proud of this placement, but it did allow me to understand why my models weren’t performing as well as expected. I’d say that the main conclusion I came to from this was that I was overfitting my training data. Gradient boosting algorithms can easily do this, especially as the number of boosting rounds increases. Section 2 briefly explores alternative approaches to increase performance, but the Kaggle environment became problematic for notebook submissions to this competition.

Section 2: Alternative Approaches

After the competition closed, I decided to explore to explore two different avenues.

Different model algorithms
Different model embeddings

Model Algorithms

Sadly, when exploring this avenue, I ran into multiple problems with Kaggle’s environment since the competition submission was a Jupyter Notebook. I incorporated the following algorithms, but they didn’t submit within Kaggle’s competition.

Random Forest Classifier
Support Vector Machine Classifier
MLPClassifier

I pinned the import of these packages from sci-kit learn as the issue, despite the notebook running when I saved it. So sadly, I reverted to the original Xgboost model, but this time I changed the hyperparameter windows. Originally I set a range of 200 to 1000 for n_estimators. The following are the new settings I used for hyperparameter optimization; n_estimators, learning rate, and max depth were all changed to smaller ranges.

Figure 6: Bayesian Optimization Parameter Bounds

Submitting the new optimizations with a smaller range for some parameters did not make the performance go up. So I decided to explore more regularization parameters and added reg_lambda and reg_alpha to the hyperparameter optimization, which represents l2 norm regularization and the l1 norm regularization respectively. From a quick exploration of these parameters using Bayesian optimization, the performance did not increase. Again, I think this is pointing out that using a different model might be a wise decision.

Model Embeddings

I wanted to pursue the following embeddings to see how they would perform, but it became obvious that certain package imports were problematic with Kaggle’s environments.

TruncatedSVD
PCA
UMAP

Conclusion

My model workflow would’ve benefitted from exploring different models, embedding types, and levels of regularization to see if I could produce a better model. Ultimately, this competition gave me a glimpse of what it may be like to work with patient data.

Kaggle ICR Competition – Age Related Conditions

Source Code: GitHub

Competition Statement

Section 1: Final Approach

Project General Approach

Final Competition Outcome

Section 2: Alternative Approaches

Model Algorithms

Model Embeddings

Conclusion

Leave a Reply Cancel reply

Start your journey right now! Build unique websites with our creative tool.

Source Code: GitHub

Competition Statement

Section 1: Final Approach

Project General Approach

Final Competition Outcome

Section 2: Alternative Approaches

Model Algorithms

Model Embeddings

Conclusion

Related Projects

Diabetes Readmission Classifier

Marketing Analysis (Gradient Boosting)

Kaggle Single Cell Perturbations

Leave a Reply Cancel reply

Start your journey right now! Build unique websites with our creative tool.