Hey guys! How are you doing? This time I thought of bringing something more advanced but at the same time with a little beginner friendly approach, so even if you don’t have much knowledge in Data Science you can start modeling right away! Because we all were beginners at some point (and I still happily am one too).
I’m going to go a little fast so we don’t have to explain every little detail, in case you know something already, but in case you don’t (and that’s ok), I’m going to leave a hyperlink which has the concept that I think it might be helpful to learn a little bit more about.
So my idea here is to introduce some simple concepts of prediction and modeling with a simple Ordinary Least Squares (OLS) which is a type of linear regression. It’s a mathematical concept used to estimate the unknown based on provided previous info (data). The way it works in basic terms is a line is drawn (upwards if the trend is ascending, downwards if it’s descending), and you provide parameters to use as predictors to predict a certain continuous variable (meaning its a “infinite” number, instead of a category, a.k.a. discrete variable).
So as an example we could provide a bunch of information about a person to try to predict their weight. Let’s say GH hormone, cortisol, BMI, insulin level, gender, height, body fat percentage, skeletal muscle mass, percentage of water, age, shoe size and etc… all that to find their weight. So how close do you think we would get? Pretty close, right? Probably that’s because with that we would have some “data leakage”, that is information which some of those variables (for example, body fat, BMI and skeletal muscle mass) passed directly, instead of being a tool to predict. So this is a bit complicated, but if you want to look into it, you also should check condition number and multicollinearity. So a more “clean” case for our example would be something like height, insulin, gender and shoe size. Which depending on the number of entries in our data, it probably would be pretty good already.
But let’s take a real example and something concrete to show you guys:
- The first cell we have import statements that we are going to use in this tutorial.
- The second one we assign to “penguins”, which is the “penguins” dataset that lives in the seaborn library (you can check my brief tutorial on seaborn here).
- The third cell we call .head() on our dataset. Note that we specify 3, so it returns only the first 3 rows of the table. 2 other notorious ways to first check your data is .sample() and .tail().
The .head() and some more stuff we are about to do is known as Exploratory Data Analysis (EDA), it is a way to… well, you know, analyze and explore your data. It’s really important because you normally do this when you first encounter your data, so you can have an idea of how to use, manipulate and even clean it.
- In the first cell we check for duplicates, and .duplicated() gives us a Boolean (True or False) telling us if each row has a duplicate or not, where the second part, .sum() sums up the results. Since False is equal to 0, and True is equal to 1 we can assume that we have no duplicates.
- The second works in the same way, but for NaN values. It gives us the result for each column and is really useful in case you want to understand the data even further and decide which way you can replace these values. (more on NaN values here, and more on replacing them here).
- The third cell is a method to get rid of the rows that have a NaN value. “inplace=True” means that we want to keep the changes we are making on the dataframe. (Another way would be “penguins = penguins.dropna()” ).
- In the last cell we are checking for which kind of type of variable the columns have. In our case we have four that are numbers (Float) and three that are Object (Strings)
So what is the correlation between the float variables? How do they relate to each other? A visual way to understand is a matrix, the less cool and more real kind:
- The cell shows a panda function called plotting.scatter_matrix, it returns all the correlations in a dataset (you can’t specify the columns) and in the diagonal (in blue) it shows the distribution of that column. So we can see some pretty clear patterns.
We can also visualize the impact on the weight based on the specie of the animal!
- In the only cell we have we use seaborn (sns) to plot the graph (.violinplot) and inside the parenthesis we specify the data and the columns we want for x and y. On the second line we are just getting rid of the lines around the graph.
So how about we try to predict the weight of the penguin based on the other information we have about him?
So from here we would have 2 options: either not use the columns that are Object, or transform them into numbers. The thing is, computers don’t like words as much as we do, and they don’t even work in some cases. So two very used methods are “OneHotEncoding”, a very useful tool, but it doesn’t accept strings, and “GetDummies” that is perfect for our case. It transform the number of unique values of a column into number of columns with either 0’s or 1’s, so only one column has 1, indicating the value that it actually is. Let’s take a look:
- In the first three cells we are applying get_dummies, which lives in the pandas library to our dataset, and assigning it to a variable called the columns + “_dummies” to differentiate in a moment. Another thing to note is the drop_first=True (the default is equals False). It is related to those concepts that we mentioned early of multicollinearity. We basically don’t want more than one column saying “the same thing”.
- In the last cell we show what one of the get_dummies looks like. In the case that we didn’t have the drop_first equals to True, we would have another column with only 0's.
And now we want to add those new columns to our data, but also we want to drop the old ones with the string, so once again, they don’t tell us the same thing:
- In the first cell we use .drop(columns=[‘name_of_the_column’]) to drop the columns that we used to create the dummy values respective to the same information.
- In the last three cells we joined the dummy values that we created with our dataset, creating a new data set.
And this is what it looks like:
Now with our final dataset complete, we end our data manipulation and can head to our modeling!
So now to start our modeling we make our First Simple Model (FSM). The FSM works as a starting point for the rest of our work. We throw some stuff that is likely to not work well since we aren’t looking too closely at what we’re including. After we start paying more attention to our variables, we can improve based on our first model.
Here the code starts to get a little complicated, but that’s ok, take your time:
- In the first cell we do all the work. We assign to penguins_1 the formula, that is “target~predictor_1+predictor_2+…+predictor_n”.
- In the second line we fit the model, we call ols, provide the formula and which dataset we are using and then we use .fit() on everything.
- And in the last line we assign the summary of the model we fitted.
- In the second and last cell we only want to see what the summary looks like.
So as you can see, there is a lot of information that we can gather from a simple model. Some of the parameters to pay attention to are AIC and BIC, the residuals, Log-Likelihood… but for us, R-squared is the most straightforward response to the predictive power of a model. The higher the better, but in cases that are higher than 0.90 you should start to pay attention to overfitting. Still talking about overfitting, we should pay attention to Condition Number, which points to multicollinearity and in most of the cases it is not worth to have your R-squared go up if your condition number is also going up. You can look it up, but a short explanation would be: you are artificially making your R-squared higher.
Another important thing that is worth visualizing is the Q-Q Plot. It is a way to see how your model is predicting and where it is making mistakes:
- The cell above is the qqplot function that uses the model fitted to calculate the residuals in blue (penguins_1_mod.resid) and the prediction that is represented by the red line (line = ‘q’).
What means in the Q Q plot is that our model get the actual value wrong mostly in the upper quantiles.
And to improve the model we are going to add 2 of the dummy variables that we created, the Specie and the Gender:
And that get us a much better R-squared!
And a better Q Q Plot!
So that’s it for today. I hope I can help someone out there. I made it as simple as I could, always thinking that it would be nice if I found something like this when I was trying to learn.
This is a super useful and practical tool and can be used in so many different way in the same dataset. There are advantages and disadvantages but as a starter modeling tool and to predict linear targets it is a great instrument. To understand everything on the output takes a lot, I haven’t got there yet, but we are on the right track!