Linear regression is used to make some sense of the data we have at hand, by unearthing the relation between target values and features of the data. When we know this relation, we can make predictions about the data that we haven’t seen before, in other words, we can infer the target value from feature values. Let’s exemplify this:
Suppose we want to understand how a particular company called X decides what to pay its employees.
There may be so many factors that go into that decision so, we go around and ask most of the employees who work there.
After a lot of prying and sneaking around, it turns out that:
These three indicators seem to be the major ones so we’ll be using them in our implementation.
Now, with the information we have gathered, we want to understand the underlying relationship between these factors and the current salary paid to the employees. We come up with this oversimplified equation:
SALARY = (? x Qualifications) + (? x Length of Service) + (? x Getting along with the Boss)
We can see from the equation above that the salary is affected by 3 attributes. These attributes, also called features, affect the salary according to their own weight which is depicted in the equation as question marks simply because we don’t actually know what these weights are.
Now, let’s imagine what would happen if we know these weights exactly. Then, if we have an employee whose salary we don’t know, we can use the features (qualifications, length of service, etc.) of the employee to predict the employee’s salary, that is, we would understand how these features and the target value (salary) are related.
Turns out, linear regression is used to do exactly that! It is used to get a good estimate of these weights so that they can be used to predict the target value of unseen data. In the machine learning world, these weights are often called parameters, hence from now on, we’ll adopt that term here as well.
Gradient Descent Algorithm
Now that we have a grasp of what linear regression is, we can come to the how part. How does this algorithm work? How can we figure out these parameters for linear regression?
In machine learning, there is another famous algorithm called gradient descent that is widely used, not only for estimating the parameters for linear regression but for other optimization problems as well. In the gradient descent algorithm, the parameters of the model are changed iteratively at each step starting with the initial values of the parameters.
To remind us once more, parameters (weights) are the numerical values that determine how much each feature affects the target value.
We want to know these parameter values exactly, but in real life, this is not possible because there may be other features (hence, parameters of those features as well) affecting the target value.
However, we want them to predict the target value as close as possible to the actual value. Since the question marks in the above equation represent the parameter values, we can replace them with values like this:
SALARY = (1000 x Qualifications) + (200 x Length of Service) + (500 x Getting along with the Boss)
Here, it is obvious that the qualifications feature affects salary more than the other features because its parameter value is higher than the rest.
Keep in mind that we have chosen these parameter values intuitively and we will be using them as our initial parameter values, yet these initial values will change at every step of the algorithm toward their optimal values.
Going along with our analogy, suppose we have an initial estimate for the parameters of these features and we went around and asked these questions to the first employee we could find:
For the first question, we told the employee that we would accept an answer in years (1 year, 2 years, 5 years, etc.).
For the second question, we told the employee that the answer would be any number from 1 to 10 (1 being the least qualified and 10 the most).
For the last question, the answer would be a number from -5 to 5. Here, minus represents the negativity of the relationship between the employee and the boss. Therefore, -5 means that the boss quite dislikes the employee, 0 could mean that the boss doesn’t even know the employee and/or there is no interaction between the two and +5 means that the two of them get along just great.
When we asked the employee these questions, these were the answers we got:
Remember that we want to predict the salary based on the parameters we have chosen and the answers we’ve got from the employee. After predicting what the salary would be based on only these answers, we ask the employee what her actual salary is.
The difference between the predicted and actual values determines how successful our estimates for these parameters (weights) are. The gradient descent algorithm’s job is to make this difference (predicted - actual) as small as possible!.
Let’s go ahead and call this difference error since it denotes how different the actual value is from the predicted value. Now, let’s plug the numbers we’ve got from the first employee’s answers into our equation:
SALARY = (1000 x 10) + (200 x 9) + (500 x -4)
Hence, this shows that our prediction for the salary is:
Predicted SALARY = 12800
Now, after getting the value of the actual salary from the employee, we calculate the error between the actual and predicted value:
Actual SALARY = 9800
Error = Predicted SALARY - Actual SALARY = 12800 - 9800 = 3000
We see that our error is 3000, we want to make this error as small as possible by tweaking the parameter values appropriately. But how do we do that?
How do we decide what the correct way of changing the parameter values is?
Obviously, we can make guesses intuitively and change the parameter values (increase or decrease) to make the error small enough. However, this won’t be practical if we have one hundred features and not only three. One hundred features mean one hundred parameter values, remember?
Obviously, we have to find a better way than this.
Moreover, there is another issue to consider. This error cannot represent only one employee, in other words, we cannot only change the parameter values for one employee since we want this model to be representative of all the employees who work at company X.
We have to get the answers from each employee and plug those values into the equation, find the error and change the parameters accordingly.
The error function that we have used here (Error = Predicted - Actual) is one of the most basic functions in machine learning which has certain limitations. Consequently, while implementing linear regression, we will use a more sophisticated version called the sum of squared errors (SSE), which is simply the sum of squared differences between the actual and predicted values.
Now, a quick change of notation is in order. The term cost is often used a lot instead of error and we’ll use that here as well since it costs us to miss the actual value by some amount. If the predicted value was equal to the actual value, the cost would be zero.
As a result, a cost function is used to measure how wrong the model is in terms of its ability to estimate the relationship between feature and target values.
After establishing which cost function to use, we can now move on. The whole point of the gradient descent algorithm is to minimize the cost function. When we minimize the cost function, we are actually ensuring the possible lowest error while increasing the accuracy of our model. We go over our dataset iteratively while updating the parameters at each step.
Back to our analogy, remember that we had three parameters (qualifications, length of service and getting along with the boss) that we wanted to change in a direction that minimized the cost function. So checking each data point in our dataset basically means asking each employee who works at company X those 3 questions we formed, plugging the values we extract from answers into the cost function and deciding into which direction the next step should be taken in order to minimize the cost function.
Now, how do we decide which direction we should move toward to make the total cost lower?
Calculus comes to our help here. When we want to minimize a function, we take the derivative of the function with respect to a variable and use that derivative to decide which direction to go.
In our analogy, the parameters that we have chosen are actually the variables of our cost function because the cost function varies as each parameter varies (hence, variable, duh!).
We have to take the derivative with respect to each parameter and update the parameters using those derivative values. In the picture below, we can see the graph of the cost function against just one parameter (Length of Service).
Now when we calculate the partial derivative of the cost function with respect to this parameter only, we get the direction we need to move towards for this parameter, in order to reach the local minima whose slope equals 0.
When we take the derivative with respect to each parameter and find the direction we need to move towards, we update each parameter simultaneously:
Updated Length of Service = Old Length of Service - (Learning Rate x Partial Derivative w.r.t. Length of Service)
This update rule is applied to all of the parameters using their partial derivatives respectively.
When we apply the update rule once, it means we are iterating the dataset once.
Here, the learning rate also called the learning step, is the amount that the parameters are updated during learning. The learning rate is a configurable hyperparameter, often chosen between 0.0 and 1.0, that defines the rate or speed at which the model learns.
Obviously, tuning this hyperparameter is of great importance in machine learning!
So one iteration means asking each employee those three questions only once (or going over the dataset once) and updating the parameter values accordingly.
After iterating the dataset many times, iteration stops when we reach a point where the cost is low enough for us to decide that we can stop the algorithm and use the parameter values that were updated up until now. Then, we can use those optimized values to predict new target values for new feature values.
What do we mean by optimized here? Well, after we have found parameter values for our three features, now they can predict new target values with the lowest possible error.
Hence, we optimized those parameters in our model. This is where learning of machine learning happens indeed. We learn the parameters that minimize our cost function.
In TAZI, we have linear regression implementation that anyone, whether they have prior knowledge about coding or not, can use with an easy-to-understand interface.