Simple & Multiple Linear Regression

Simple Linear Regression

Imagine we have data between the number of employees and the number of tickets processed through sales. The relationship appears to be linear since it appears a straight line can be drawn through the data. If we know the equation of the line, we can predict values for tickets given a certain number of employees. The equation for a line is:

y = mx + b

Where:

Y = Target Variable

X = Predictor Variable

m = Slope of the line

b = Y-intercept

 

Target Variable: The target variable is the variable we are trying to predict. It is also referred to as the dependent variable. In our example, we are trying to predict Y, or the average number of tickets.

Predictor Variable: Predictor variables are used to try to predict the target variable and are also known as independent variables. In the example, there is just one predictor variable, X, or the number of employees. It is used to predict the number of tickets.

What if we have more than one predictor variable?

Multiple Linear Regression

Multiple Linear Regression builds on a simple linear model by adding additional variables to the model.

Important: Linear regression models assume that numerical predictor variables have a linear relationship with the target variable. It's good practice to analyze the individual variables first before you run your variables through the linear regression model.

What about categorical variables? Due to the nature of categorical variables, a scatterplot cannot be used to see if a linear relationship exists. So let's talk about what happens when you add a categorical variable to the mix of predictor variables. Here's a general regression equation with two predictor variables.

Y = β 0 + β1X1 + β2 X 2

As discussed, the X's represent the values for each variable. These come directly from the data. The β's come from the linear regression model. β 0 is the intercept. The other β's represent the relationship between the predictor variable X and the target variable Y.

Now let's say you add a third variable that is categorical. Putting the actual value of a category into an equation wouldn't work because you can't do math with string variables, so we have to transform the variable.

This is where dummy variables come into play. A dummy variable can only take on two values, generally zero or one. You would add one dummy variable for one less than the number of unique values in the categorical variable. So if the variable is binary, you'd add one dummy. If there are four categories, you'd add three dummy variables. The reason why the number of dummy variables one less the number of unique values is created is that the equation needs a baseline value that is not coded into a dummy variable. That one category becomes the category that others are compared to.

Validation

Now that we've performed the analysis and run the Linear Regression Model, we need to validate the model's results. In other words, is there a way to measure how good the model is? Or, in this case, is the linear expression we calculated a good fit for our data?

Step 1: Correlation

This value is often referred to as r. The range of r is from -1 to +1. The closer r is to plus or minus 1, the higher the correlation between x and y.

Step 2: Calculate r-squared

While a strong correlation is good, we want to know how well the data fit our line. Fortunately, we can get a sense of how good the formula is at approximating the data by calculating the coefficient of determination, or r-squared. R-squared is a coefficient between 0 and 1. R-squared is interpreted as the per cent of variance that is explained by the model or the explanatory power of the model. An R-squared value close to 1 would mean that nearly all variance in the target variable is explained by the model. An R-squared value close to 0 would mean that nearly none of the variance in the target variable is explained by the model.

Caution about interpreting R-squared: How you interpret R-squared depends heavily on the problem you're trying to model and the data you use. For tough problems, a very low R-squared may be acceptable. Also, a high R-squared may result from a flawed model. However, in general, the higher the R-squared, the better, especially as you add and remove predictor variables to determine the most robust predictive model.

R-Squared vs Adjusted R-Squared: The adjusted r-squared value should be used with multiple linear regressions due to a phenomenon that occurs when adding additional variables to the model. In a nutshell, the more variables that are included, the higher the r-squared value will be - even if there is no relationship between the additional variables and the target variable. Therefore, we use the Adjusted R-squared value.

Step 3: P-value

The p-value is the probability that observed results (the coefficient estimate) occurred by chance and that there is no actual relationship between the predictor and the target variable. In other words, the p-value is the probability that the coefficient is zero. The lower the p-value, the higher the probability that a relationship exists between the predictor and the target variable. If the p-value is high, we should not rely on the coefficient estimate. When a predictor variable has a p-value below 0.05, the relationship between it and the target variable is considered to be statistically significant.

Statistical Significance - "Statistical significance is a result that is not likely to occur randomly, but rather is likely to be attributable to a specific cause." - Investopedia.

 

Previous
Previous

Predicting Diamond Prices

Next
Next

Selecting a Predictive Analytical Framework