A Conceptual Guide to Regression Analysis

November 27, 2018
scikit-learn ml regression

An introduction to regression analysis, the intuition behind them, how to execute them in Python, and how to interpret the results

Outline

In this post I’m going to walk through the following important regression analysis topics and demonstrate how to implement these powerful analysis tools in Python:

The Intuition Behind Regression

In this section, I’ll try and build an intuitive understanding of regression analysis, both for the simple and multiple cases. The first thing that usually comes to mind when a mathematical concept like regression is mentioned is the formula of the concept. However, developing a fuller understanding of complex subjects often begins with a deep conceptual, intuitive understanding of things first. I have foud this approach - concepts first, Math later - to be extremely helpful especially in data science and machine learning.

Now back to regressions. In simple terms, regression is simply a way of explaining one thing using other things. You might remember from Math and Science classes that most phenomena are a function of other things. For example, the famous equation \(F = ma\) simply states that force is a function of mass and accelaration. What this means that force is very well-described by these two variables, mass and acceleration. Let’s look at a more relatable example: wine tasting. If you were asked to describe the flavour of five different wines using only one characteristic (acidity, for instance), you will most likely this limitation makes the task quite difficult. After all, wine is a complex chemical compound with many more characteristics besides acidity. To truly explain the flavour of the wines, more variables are required.

This example also demmonstrates another important concept in regression, that things can be explained in varying degress (while you might be able to explain the flavour of wine using only their acidity, a better degree of explanation could be achieved by using additinal variables). This is the key difference between simple linear regression and multiple linear regression. In simple linear regressions, one variable adequately describes another. Multiple regressions are used when multiple variables are needed to adequately describe a phenomenon.

Getting Slightly Technical…

We now know that regressions are used to explain one thing using one or more variables. With this intuitive understanding in the bag (hopefully), let’s delve to deepen this intuition. What’s actually taking place when we describe the wine flavor?

What we are actually doing, mathematically, is _taking weighted combination of one or more independent variables that explain the variability of the dependent variable_. The generalized equation of multiple regressions is:

\[ Y_i = \beta_0 + \beta_1\mathbf{X_{1i}} + \ldots + \beta_k\mathbf{X_{ki}} + \epsilon_i \]

where: - \(Y_i\) is the dependent variable (what you are trying to explain) - Variables \(\mathbf{X_{1i}}\) are the independent variables. - Variables \(\beta_k\) are the weights corresponding to each independent variable. - \(\epsilon_i\): the error term. This is needed because nothing can be perfectly described in the real world.

Multiple Regression: A Pizza Ranking Example

At a first look, the multiple linear regression equation above might seem a little intimidating. In this section, we’ll go through an example to conceptually explain the equation above and demonstrate that it really is rather straight-forward.

Suppose I want to rank my own favourite pizza on a scale of 1 to 10. Suppose further that I have collected data on all the pizzas that I have eaten and ranked them with corresponding attributes.

While this isn’t overly scientific, it will serve the example well. Now, people have different preferences for how they like their pizza - suppose we have some variables that explain how likely I am to like a pizza. These are the attributes that I have been using to rank my theoretical pizza consumption.

We can then re-write our regression equation to look something like this:

\[ Pizza Score = \beta_0 + \beta_1 THICKNESS + \beta_2 CRISP + \beta_3 SPSI + \beta_4 CPSI + \epsilon_i \]

What does this mean? Well, it means that the score I give a pizza will be explained - hopefully, and at least partially - by a weighted combination of the thickness, crisp, amount of sauce, and the amount of cheese.

Our \(\beta\) factors are weights - because some of these factors might be more important than others. Perhaps the amount of sauce on my pizza isn’t as important to me as the amount of cheese - the weights for each independent variable will account for this.

\(\epsilon\) is the error term, because there is likely some variability in this equation (it is, after all, very subjective) that won’t be accounted for in the variables.

The term \(\beta_0\) is a constant factor. In this case, maybe I really love pizza, and no matter what I’m always going to give the score a 3 out of 10 because I like pizza so much. This factor always represents the portion of the dependent variable that doesn’t depend on the independent variables (thickness, crisp, etc…).

You will see this constant factor in every regression - and it can be very important. A lot of advanced investing strategies actually rely on “hedging” out all of the other independent variables so that you’re only left with this constant.

What if other things affect how much I like pizza?

In most real-world situations, we won’t be able to perfectly describe the dependent variable - in this case the Pizza Score. There may be other factors that affect the score - like how long the pizza has sat in the box, the number of toppings on it, the type of cheese being used, the acidity or sweetness of the sauce (or any number of other tangible factors you might be able to think of). There may even be intangible factors like how hungry you are when you eat the pizza.

This is actually a fundamental problem in data science and machine learning - we call this feature engineering. Feature engineering is the creation of quantifiable data points that can improve regressions or other predictive models. I’m not going to go too deep on that here - but keep that in mind. Later, when we talk about the coefficient of determination you’ll see why these lack of variables can affect a regression.

Hopefully this gives you an intuitive grasp of what we’re trying to achieve with regression. Now we’re going to dive deeper.

Implementing the Pizza Score Model in scikit-learn

In this section, we are going to simulate the pizza ranking example model describe in the prior sections. The scikit-learn library provides a handy way for creating toy datasets for testing various statistical models (regressions, classifications, clustering …).

Generate Toy Data

We mentioned above that our regression is going to have four variables (THICKNESS, CRISP, SPSI, and CPSI), and one dependent variable (Pizza_Score).

We’re going to call our four variables, collectively, pizza_features, and our dependent variable pizza_score. Using the make_regression function, we’ll create a dataset with the following features:

I seed the random number generator in numpy so that if I have to rerun this notebook the same dataset will be generated.

# import dependencies
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.datasets import make_regression
from sklearn.preprocessing import MinMaxScaler
import statsmodels.api as sm
import statsmodels.formula.api as smf

import warnings


#notebook settings
sns.set_context('notebook')
plt.style.use('fivethirtyeight')
#plt.style.use('ggplot')
plt.rcParams['lines.linewidth'] = 1.5
warnings.filterwarnings('ignore')
%matplotlib inline

#generating data
np.set_printoptions(precision=2)
np.random.seed(813)

pizza_features, pizza_score = make_regression(
    n_samples = 1000,
    n_features = 4,
    n_informative = 2
)

Data Processing

For the purposes of this example, we want our pizza score to be between 1 and 10. However, the output from the random dataset generator is much different. Let’s look at our maximum and minimum pizza scores…

print(max(pizza_score))
print(min(pizza_score))
74.43087414338704
-83.54014698616761

We can easily scale this data so that it’s between 0 and 1, and then multiple the entire array by 10. We can then confirm that our pizzas scores are all contained with 0 and 10.

pizza_score = (pizza_score-min(pizza_score)) /(max(pizza_score) - min(pizza_score)) * 10
print(max(pizza_score))
print(min(pizza_score))
10.0
0.0

Just for fun, we can combine both entries into a single dataframe so that we can see our pizza scores alongside all of the attributes…

pizza_attributes = ['THICKNESS','CRISP','SPSI','CPSI']
pizza_features_df = pd.DataFrame(pizza_features,
                               columns = pizza_attributes)
pizza_score_df = pd.DataFrame(pizza_score,
                              columns = ['Pizza_Score'])
pizza_df = pizza_features_df.merge(pizza_score_df,
                                   left_index = True,
                                   right_index = True)
print(pizza_df.head())
   THICKNESS     CRISP      SPSI      CPSI  Pizza_Score
0   0.473132  0.587506  1.901269  2.888218    10.000000
1   0.051947 -0.048655 -0.296745 -0.059480     4.910707
2   1.169596 -1.001573 -0.696988  0.296086     4.799838
3   0.942445  1.874017 -2.061068  0.565154     3.560019
4   0.421759  0.817018 -1.436081 -0.494829     3.271640

Something is still a little funny, all of our attribute scores should still be transformed to be between 0 and 10 as well. Let’s go ahead and do that. This time we’ll use the MinMaxScaler from sklearn -MinMaxScalerdoes the same thing we did in the formula above to transform the pizza score!

I’m not transforming these scores for any reason other than a score of 1 to 10 seems to be the defacto scoring scale for pizza! Also, this will do something funny to our data that we’ll explore later.

scaler = MinMaxScaler()
pizza_features_df = scaler.fit_transform(pizza_features_df) * 10
pizza_features_df = pd.DataFrame(pizza_features_df,
                                 columns = pizza_attributes)
print(pizza_features_df.head())
   THICKNESS     CRISP      SPSI      CPSI
0   5.415908  6.673049  7.900437  9.432805
1   4.800505  5.684973  4.499813  5.446018
2   6.433526  4.204913  3.880584  5.926924
3   6.101631  8.671241  1.770168  6.290840
4   5.340845  7.029524  2.737107  4.857205

Running the Regression

Next we’re going to run the regression itself. Before we do, we’re going to add the constant term to our dataframe. Recall our formula from above:

\[ Pizza Score = \beta_0 + \beta_1 THICKNESS + \beta_2 CRISP + \beta_3 SPSI + \beta_4 CPSI + \epsilon_i \] The term we’re adding to the dataframe is \(\beta_0\). Once we add the constant, our dataframe is going to look like this:

pizza_features_constant = sm.add_constant(pizza_features_df)
print(pizza_features_constant.head())
   const  THICKNESS     CRISP      SPSI      CPSI
0    1.0   5.415908  6.673049  7.900437  9.432805
1    1.0   4.800505  5.684973  4.499813  5.446018
2    1.0   6.433526  4.204913  3.880584  5.926924
3    1.0   6.101631  8.671241  1.770168  6.290840
4    1.0   5.340845  7.029524  2.737107  4.857205
model = sm.OLS(pizza_score_df, pizza_features_constant)
lin_reg = model.fit() #Fit the regression line to the data
print(lin_reg.summary()) #output the summary of the regression
                            OLS Regression Results                            
==============================================================================
Dep. Variable:            Pizza_Score   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 9.098e+31
Date:                Thu, 17 Dec 2020   Prob (F-statistic):               0.00
Time:                        08:06:51   Log-Likelihood:                 32237.
No. Observations:                1000   AIC:                        -6.446e+04
Df Residuals:                     995   BIC:                        -6.444e+04
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -1.9389   5.46e-16  -3.55e+15      0.000      -1.939      -1.939
THICKNESS   7.633e-17    5.3e-17      1.441      0.150   -2.76e-17     1.8e-16
CRISP      -2.776e-17   4.89e-17     -0.568      0.570   -1.24e-16    6.81e-17
SPSI           0.7040   4.85e-17   1.45e+16      0.000       0.704       0.704
CPSI           0.6760   5.57e-17   1.21e+16      0.000       0.676       0.676
==============================================================================
Omnibus:                       13.996   Durbin-Watson:                   0.249
Prob(Omnibus):                  0.001   Jarque-Bera (JB):               14.195
Skew:                           0.277   Prob(JB):                     0.000827
Kurtosis:                       3.181   Cond. No.                         76.3
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Intepratation: what do these results mean?

The output from the regressions spits out the above table. Let’s walk through these terms and talk about what they mean.

Now scroll down a bit to where you see coef, std err, etc…

The “coef” column is the beta coefficients for our equation. Observe that the coefficients for THICKNESS and CRISP are very small (zero). Also note that only two of the coefficients are non-zero. Namely, SPSI and CPSI. This is because when we generated our toy dataset we set the number of informative variables to be two. Our formula for our pizza score now becomes the following:

\[ Pizza Score = -1.9389 + 0.704 \times SPSI + 0.676 \times CPSI + \epsilon_i \] What this means is that for the individual in question - which is represented by the random pizza score dataset we created - the rating this person would apply to a pizza would abide by this formula.

Visualizing the important pizza factors

By plotting an important factor and an unimportant factor against pizza score, we can visually see the correlations. Here we plot SPSI(an important factor), and CRISP(an unimportant factor)

plt.figure(figsize = (20,10))
plt.scatter(pizza_score_df['Pizza_Score'],
            pizza_features_df['SPSI'])
plt.scatter(pizza_score_df['Pizza_Score'],
            pizza_features_df['CRISP'],
            alpha=0.5)
plt.xlabel('Attribute score')
plt.ylabel('Pizza Score')
plt.title("Important pizza factors versus unimportant pizza factors")
plt.legend();
No handles with labels found to put in legend.

png

Principles of Model Specification (and what happens when you “misspecify”)

There are some compelling philosophies on model specification, in general, that I think are much easier to grasp within the context of our pizza example. They are as follows:

  1. The model should be grounded in logic: We should be able to supply reasoning behind the choice of variables, and the reasoning should make sense. For example, the color of the pizza box is likely an unreasonable factor to include in our pizza regression, but perhaps the “fluffiness” of the crust should be. Note that model specification, and feature engineering in general, often relies on the imagination of the people building the model. If there is something you can think of that might affect a person’s pizza score, it’s worth testing if it makes sense! This is the essence of science - develop a hypothesis and then test it.

  2. The law of parsimony: This means that fewer, high impact variables are better than a lot of variables. Each variable within a regression should play an important and impactful role. In our regression, sauce per square inch and cheese per square inch are definitely important and impactful.

  3. Violations of the regression assumptions should be examined before accepting the model: We haven’t discussed them yet, but the presence of heterskedasticity, multicollinearity, and serial correlation can call into question the validity of the regression model. We’ll discuss these shortly!

  4. Model should be useful on “real world data”: It’s fine and good to examine and explore the strength of relationships within a sample of data - but in order to test our Pizza regression model we would want to test it on new pizzas to see if it has a reasonable ability to “predict” whether or not this individual is bound to like the pizza - based on the sauce per square inch and cheese per square inch. This applies to all other statistical models as well - the “out-of-sample” test is the gold standard in determining whether or not the model is valid.

Misspecifying a model

The concept of misspecification is relatively straightforward. In our example, our regression results told us (by design) that there are two key variables that fully describe the variability in this individuals pizza score. If we were to remove one of these variables from the regression, we would no longer be able to explain the full variability in the dependent variable. It would be like trying to fully describe the flavour of a wine using only “acidity” as a descriptor, or Netwon trying to full explain force using only mass. To fully explain certain real-world phenomenon, there is typically a minimum number of variables that one needs to use - anything less than that would be a “misspecification”. It’s up to the data scientist to discover what these variables are.

Let’s try this on our pizza data to see what happens to the regression results. Let’s remove the “sauce per square inch” score and see what happen…

misspecified_df=pizza_features_constant.drop(['SPSI'],axis=1)
print(misspecified_df.head())
   const  THICKNESS     CRISP      CPSI
0    1.0   5.415908  6.673049  9.432805
1    1.0   4.800505  5.684973  5.446018
2    1.0   6.433526  4.204913  5.926924
3    1.0   6.101631  8.671241  6.290840
4    1.0   5.340845  7.029524  4.857205
model2 = sm.OLS(pizza_score_df, misspecified_df)
lin_reg2 = model2.fit()
print(lin_reg2.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:            Pizza_Score   R-squared:                       0.422
Model:                            OLS   Adj. R-squared:                  0.420
Method:                 Least Squares   F-statistic:                     242.0
Date:                Thu, 17 Dec 2020   Prob (F-statistic):          6.39e-118
Time:                        08:15:20   Log-Likelihood:                -1524.7
No. Observations:                1000   AIC:                             3057.
Df Residuals:                     996   BIC:                             3077.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          1.6459      0.224      7.354      0.000       1.207       2.085
THICKNESS     -0.0206      0.024     -0.844      0.399      -0.068       0.027
CRISP         -0.0047      0.022     -0.209      0.835      -0.049       0.039
CPSI           0.6896      0.026     26.942      0.000       0.639       0.740
==============================================================================
Omnibus:                        0.468   Durbin-Watson:                   2.035
Prob(Omnibus):                  0.791   Jarque-Bera (JB):                0.518
Skew:                           0.050   Prob(JB):                        0.772
Kurtosis:                       2.950   Cond. No.                         60.0
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Can you see what happened here? By removing SPSI, a key independent variable, we forced the regression to try and explain the pizza score using only CPSI. This resulted in a few negative changes:

Model misspecification can be a really difficult thing to identify. Imagine we had done this regression first? Would it be obvious that we had “ommitted” a variable?

Collinearity, heteroscedasticity, and serial correlation

We have mentioned these three terms several times in this tutorial. These are, collectively, the “trifecta of no-no’s” when it comes to regression. It took my a while to wrap my head around these three things - mostly because they’re often presented poorly and in an uncessarily academic way (Or maybe I’m just a dummy!). In any case, I hope I do a better job of explaining these things here.

Heteroscedasticity

To illustrate an example of this let’s do a thought experiment. Imagine you’re standing on a beach and you have a basket of tennis balls. You drop the first tennis ball directly in front of you feet. You then record on a sheet of paper how far you “think” the ball is from your feet (in inches)

You then throw the second ball a little farther than the first, and again record the estimated distance down on your sheet of paper. You repeat this until you have to throw the balls as far as you possibly can down the beach - each time estimating how far they are from your feet in inches.

To finish the experiment, you mark your feet in the sand and then measure the exact distance of each ball from where your feet were. Do you think the error in the estimate for the final ball will be the same as the error in the estimate of the first ball? Likely not - it’s easy to estimate distance when something is very close. It’s much harder (more prone to error) when it’s far away.

If you were to plot the difference between the estimated distance and actual distance for each ball, it would likely increase in magnitude as the balls get further away. This is heteroscedasticity.

The presence of heteroscedasticity can invalidate the results of a regression because it assumes the variance of the errors are constant. Make sense?

Below is a plot illustrating what heteroscedastic variance looks like visually. The shaded area, in this case, represents the potential error of our distance estimate. The solid red line is the actual estimate. With this plot we’re illustrating the example above - that estimating the distance of the ball from your feet will become harder the farther ball is away. The error variance, +/- 10% of the ball distance, has been chosen arbitrarily.

x=np.linspace(1,50)
y=x
upper=1+1.1*x
lower=1+0.90*x
plt.figure(figsize=(15,7.5))
plt.plot(y)
plt.fill_between(x,upper,lower,alpha=0.5)
plt.xlabel('number of balls tossed')
plt.ylabel('Actual distance of ball from feet')
plt.title("Example of Heteroscedastic error variance.");

png

Collinearity and Multicollinearity

This refers to the situation where any of the independent variables are significantly correlated with other independent variables.

For example, imagine in our pizza regression we introduced another sauce attribute called “Sauce volume”. We already have sauce per square inch as a scoring attribute. It makes sense that “sauce volume” and “sauce per square inch” are likely related. In fact, they might be perfectly correlated - or close to it. That is, if we increase sauce volume it has to go somewhere (spread around the pizza!).

Collinearity throws off regressions because when independent variables are correlated it can become impossible to determine how important each one is. Another word for collinearity is just “redundancy”. If you have two variables that move the same way, only one of them is required to explain the dependent variable - not both. To correct for collinearity, you just need to drop one of the variables. There are also libraries in python that can detect correlated variables and drop them automatically - although in most applications it’s important for the scientist to keep themselves in the loop.

Serial correlation (auto-correlation)

Serial correlation refers to the values of a timeseries being _correlated with the a “lagged” version of itself_. For example, imagine you boiled a pot of water and left it on the counter for a couple of minutes. Your best estimate for what the temperature is now is what the temperature was one period ago - in other words, the temperature of the pot from one second to the next aren’t random. Likewise, your best estimate for the temperature of the pot one second from now will be the temperature of the pot now, plus some error term.

In regression, one of the key assumptions is that the error terms (or residuals) are not serially correlated.

To visualize this, consider the sinusoid shown below with an upward sloping mean. If the mean is our “line of best fit” from the regression, the difference between the sinusoid and the mean line are the “residuals”. It’s clear that the residuals here aren’t random - they clearly follow the sinusoidal function. In this case, the residuals are “serially correlated”.

x = np.arange(300) # the points on the x axis for plotting
# compute the value (amplitude) of the sin wave at the for each sample
y = np.sin(0.2*x)+0.01*x
z=0.01*x

# showing the exact location of the smaples
plt.figure(figsize=(15,5))
plt.plot(x,y,z);

png

Summary

Tuning Support Vector Machines - Visualized

June 2, 2019
scikit-learn SVM classification ml

Visualizing `XGBoost` Hyperparameters

May 26, 2019
hyperparameters xgboost classification ml

Selecting a Machine Learning Algorithm - Part II

April 14, 2019
scikit-learn ml data engineering
comments powered by Disqus