My first ever ML model

I believe in communities.

Jul 01, 2024

Thanks to Jason, our community manager and my friend, I encountered DataTribe Collective unexpectedly one day. I had no expectations and was essentially unaware of what this community could offer. When Eevamaija mentioned the Datathon challenge hosted by Stanford University's Women in Data Science (WiDS) Foundation on Kaggle, I was uncertain of what to anticipate. Nevertheless, I chose to participate.

I already knew basic data analysis practices due to my master’s thesis - I studied the impact of the legislation change in university patenting in Turkey by using logistic regression analysis. But machine learning? Me? I was intimidated, I felt like I said “yes” to something that I never know if I would manage, which is not like me at all.

I’m a person who loves formal education practices, like taking courses, deep diving into the background and foundation of the concepts.

Ever since I came to Finland, and being around in startups scene, I realised that my way feels a bit long, and hence time-consuming. So, I decided to find a middle ground to approach this challenge: I was going to learn the basics, I would still look at some of the concepts but in a more generalised way. After all, I could learn the models with a good background knowledge after the challenge.

The challenge and first taste of machine learning

During the challenge period, WiDS has announced some workshops, that can help people to learn and thus implement these into the challenge. I only had time for one workshop, but I felt like it gave a good basis to build the model. It was the first time that I have heard about some concepts like data preprocessing, label-encoding, and one-hot encoding. Turns out I already had an idea of these things, e.g. label and one-hot encoding are basically dummy variables. After this workshop, I decided to run the same code in my computer to understand progress, and I googled almost everything that I was not familiar, which helped a lot.

Challenge: To predict the duration of time it takes for patients to receive metastatic cancer diagnosis

After data preprocessing, I needed to find a way to select the independent variables (or features) for the model. In the meantime, I was bombarding the DataTribe Slack with lots of questions (thank you to my teammates for their patience). I decided to check out some courses online, like a crash course in ML, and I found Kaggle’s courses. I took the Intro to Machine Learning course, which was a perfect match. It gave enough information to start the basics of ML, which gave me a sort of blueprint to approach these challenges.

Approaching to a solution

The course gave me a generalized idea on how to approach or implement a model. But I still felt like I needed to find a way to select the independent variables for this challenge. So, I used a similar approach to my master’s thesis, I researched the literature in this topic, mainly the relationship between socioeconomic indicators and breast cancer diagnosis. Luckily I found a lot of Open Access articles around this topic, which helped a ton, and they gave me an idea of what type of variables that I need to select. After I decided the features that I want to use in the model, I made a plan:

Use some of the data preprocessing tools/methods that are mentioned in WiDS Workshop to understand the dataset.
With the help of journal articles, make a list of features that will be used in the model and preprocess them specifically.
Implement the model.

However, when I tried to do this, I had a lot of errors and difficulties, and I couldn’t understand why. Thankfully, the timing was amazing, OpenAI had just released the free version of ChatGPT-4o, and I decided to use it. GPT gave me a basic idea of how to find a solution, I asked questions to the bot like “Why did you use this model?”, “Why did you select random state as 42?”. I must say, comparing to 3.5 this version has helped me a LOT, especially in integrating climate columns. It gave me four different ideas to include these columns: aggregation, feature engineering, dimensionality reduction, and time series analysis. I chose aggregating the columns by using averages. It may not be an ideal solution, but it is a solution.

First drafts and submission

When I finally found a solution -for myself, at least- for features, I implemented Random Forest Regressor (this was the only thing that I knew at that time). And all of a sudden, I realised: this is my first ML model! It truly felt like a milestone, and I was super happy! I later celebrated this accomplishment by going to see the Marimekko Fashion Show in Esplanadi 😎.

But, before that, I shared the notebook with my teammates. We decided to have a meeting, with Eevamaija and Julia, on a Sunday morning (yes, you read it right 😅) and basically told them what I’ve done so far. They loved the work, and they gave me some feedback and ideas to improve. I implemented the changes that they mentioned, especially the correlation heatmap for the unemployment rate, poverty and median household income came handy.

According to this heatmap, poverty variable was not that useful, and if it was used it could have created an autocorrelation problem. So, median household income and unemployment rate were more useful in this case.

After this, Julia also shared her notebook, which was very very detailed and informative. Our team was superb, we were just asking a lot of things to each other, discussing different methods, learning along the way.

We ended up submitting two separate notebooks, one by me and one by Julia. Even though my results (root mean squared error) is not that great, it is still something. If you want to check it out, you can find it here.

What’s next?

I started this challenge with zero knowledge in ML, and it was scary. But I managed while being scared, and it worked, at least a bit. Now, I am more scared of deep diving to ML, but I am capable of doing everything while I’m scared. Because in the end, if you don’t feel that tension (or whatever), are you really passionate of what you are doing? This challenge made me realize that I am still passionate, and I still crave learning.

Let’s see what the new challenges will be like in the future. But in the meantime, I’m going to enjoy understanding jokes about ML and AI.

Written by Irem Corum Aktas, Data Analyst & Social Scientist
LinkedIn: https://www.linkedin.com/in/irem-corum-aktas-618367b2/

DataTribe