Logistic regression is an algorithm that computes for the probability of an input observation of being a particular class. Given an input observation x, which can be represented as a vector of features [x_{1}, x_{2}, x_{3}, …, x_{n}], the algorithm determines the probability that x is of class y, or *P*(y|x), where y is an element of a binary set (like 0 or 1, positive or negative, true or false).

The probability *P*(y|x) is determined by using weights and biases. A weight is how relevant a feature is for determining the class. For example, feature x_{1} can have a positive weight if it is relevant to the classification, or a negative weight if it is not. Each feature in [x_{1}, x_{2},…, x_{n}] is multiplied by its corresponding weight, and the sum of their products is then added to the bias term. At this point, it starts to look like linear regression:

However, this equation is not enough to produce a probability. Since the weights can be negative, it is possible that the value of *z* will also be negative, in which case it doesn’t satisfy the classification problem at all. As the next step, logistic regression passes *z* to a sigmoid function, which maps a real-valued number into numbers in the range 0 and 1. The sigmoid function looks like this:

Recall that logistic regression finds the probability of an observation being in a particular class. Thus, by default, σ(z) computes for the probability that x is of the positive class, or *P*(y=1|x). To compute for the probability that x is of the negative class, we can just subtract from 1.

P(y = 1 | x) = a

P(y = 0 | x) = 1 – a

A classification problem requires the output to be a class (0 or 1). Logistic regression uses a **decision boundary** to specify the threshold at which to accept the probability as a positive class. For example, we can set the decision boundary to be 0.8. This means that if the calculated probability P(y=1|x) is greater than 0.8, then the algorithm outputs 1 (positive class). Otherwise, it outputs 0 (negative class). We represent the output of the algorithm as y^ (y-hat).

##### Sample Case

We’ll take the Iris dataset as an example and simplify the problem such that we are only looking for the likelihood that an observation is of class “setosa”, given the features: [sepal_length, sepal_width, petal_length, petal_width]. Suppose that we take one observation *x*, and after learning the weights and bias term (I’ll discuss how this learning takes place in the next section), we come up with the following:

feature | description | value | weight w |

x1 | sepal_length | 0.2 | 0.05 |

x2 | sepal_width | 0.3 | 0.2 |

x3 | petal_length | 0.2 | -1.0 |

x4 | petal_width | 1.0 | 0.5 |

Let’s assume that bias *b* = 0.

From the figures above, we can calculate the probability that x is of class “setosa” (positive class):

*P*(*y*=1∣*x*) = *a* = *σ*(*w* . *x* + *b*)

= *σ*([0.05, 0.2, −1.0, 0.5] . [0.2, 0.3, 0.2, 1.0])

= **0.73**

We then take this to calculate the probability that x is *not* of class “setosa” (negative class), subtracting it from 1.

*P*(*y*=0∣*x*) = 1 − *P*(*y*=1∣*x*) = 1 − *a*

= 1 − 0.73=** 0.27**

For logistic regression to be an effective tool for learning, it has to do two more things after calculating the probabilities. First, it must calculate the error in its prediction as opposed to the actual value. Logistic regression is a supervised learning method, which means that it is provided with the correct labels for training. Secondly, it has to perform an algorithm that adjusts the weights and biases to minimize the error in the next time it makes a prediction. The first step involves a cost function, which we will see in the next section.

##### Calculating the Loss

The loss function (or cost function) expresses the difference between the prediction and the actual value. For a binary classification problem (1 or 0), we want a function that will maximize the *likelihood* of the prediction P(y=1|x) to be correct if the actual label is 1. We also want the function to penalize the prediction (i.e., calculate higher error) if it is incorrect depending on how far the probability P(y=1|x) is from the actual label. Logistic regression uses the **cross-entropy loss function** to calculate the loss.

In information theory, entropy is a measure of the amount of information contained in an event, which can be quantified by the probability of it occurring. This is expressed by the following function I(x), or the amount of information given x:

In this sense, the more likely an event occurs, the less information it has, and thus has less entropy. The less likely it occurs, the more information it has, and thus has more entropy. We use the negative logarithm to make sure that I(x) is always positive (since the log of a number between 0 and 1 is negative).

P(x) | I(x) | Entropy |

100% or 1 | -log(1) = 0 | No information/entropy |

0.50 | -log(0.50) = 0.30 | A little information/entropy |

0 | -log(0) = ∞ | Infinite/lots of information/entropy |

Cross-entropy means measuring the entropy of a probability of an event occurring based on the probability distribution of another event. We have two events: *y *representing the actual label, and *a* representing the predicted label. We can measure the *entropy* of the predicted label probabilities using the actual label’s probability distribution.

P(y=1) | P(y=0) | |

Actual label (y) probability distribution | y | 1-y |

Predicted label (a) entropy | -log(a) | -log(1-a) |

From the values in the table, we derive the cross-entropy loss function L(*a*, *y*), which is the cross-entropy loss of *a* from *y*. As an additional feature, the equation gets the negative of the result to make sure that the loss is always a positive number.

*L*(*a*, *y*) = − [*y log*(*a*) + (1 − *y*)*log*(1 − *a*)]

Let’s use a sample case to understand how the function works. Let y be the probability distribution of x being *actually *of class “setosa”. This could only be either 1 or 0. Let *a* be the probability distribution of x being *predicted *to being of class “setosa”. To measure the cross-entropy loss, we calculate the entropy of the prediction using the probability in *y* as the weight.

In our previous sample case, P(y=1|x) = **0.73** and P(y=0|x) = **0.27**. Let’s say that the observation is actually “setosa”, or y=1. We’ll plug in the values in the table as such:

Case 1 | P(y=1) | P(y=0) |

Actual label (y) probability distribution | 1 | 0 |

Predicted label (a) entropy | log(0.73) | log(0.27) |

Applying the formula to the values in the table, we get the cross-entropy loss between A and B as -[-0.45+ 0] = 0.45. The loss is therefore **0.45**.

If, instead, the observation was actually *not* “setosa”, or y=0, the table would look like this:

Case 2 | P(y=1) | P(y=0) |

Actual label (y) probability distribution | 0 | 1 |

Predicted label (a) entropy | log(0.73) | log(0.27) |

Applying the formula, we get a cross-entropy loss of -[0 – 1.88] = **1.88**, which is much higher than the loss when the actual label was 1.

We can see that the cross-entropy function is convenient as it satisfies the requirements stated initially for the loss function:

- In Case 1: If the actual label is 1, the probability of the opposite label P(y=0) is cancelled off, increasing the
*likelihood*of the predicted label P(y=1) to be true. This is similar in Case 2: the probability of the opposite label P(y=1) is cancelled off, increasing the*likelihood*of the predicted label P(y=0). - The incorrect prediction is penalized with higher loss. In Case 1, since the actual value is 1 and the predicted value 0.73 is very close to 1, the loss is not so large. However, in Case 2, when the actual value was 0, the predicted value 0.73 is farther away from 0, so the loss is much larger.

##### Learning the Weights and Biases Based on the Loss

So far, we have calculated the probability *a* and the difference from the actual label *y* using the cost function L(*a*, *y*).

The “learning” aspect in logistic regression is the algorithm that adjusts the weights and biases to reduce the loss for the next time a prediction is to be made. This is called the optimization algorithm, and the most common one used with logistic regression is called **stochastic gradient descent**. What this basically does is backtrack from the loss equation by getting the derivatives of each step with respect to the loss.

First, we get the derivative of *a* with respect to the loss *L*. Then, we can compute for the derivative of *z* with respect to the loss. Feel free to derive the equations if you have some background in differential calculus. But for simplicity, I’ll just head straight to the derivatives:

We can then get the derivative of each weight *w* and the bias *b *with respect to the loss:

Since, in our example, we had four features and four corresponding weights, we apply the derivative on each weight.

a = 0.73

*dz* = a – y = 0.73 – 1 = -0.27

*dw*_{1 }= 0.2 x (-0.27) = -0.054

*dw*_{2 }= 0.3 x (-0.27) = -0.081

*dw _{3} *= 0.2 x (-0.27) = -0.054

*dw _{4} *= 1.0 x (-0.27) = -0.27

We have just calculated the *gradients*. To adjust the weights, we subtract each gradient from the corresponding original weight *w*. We apply a learning rate *α* to control how much we descend and avoid overshooting the optimum solution.

w_{1} = 0.05 – (0.2)(-0.054) = 0.0608

w_{2} = 0.2 – (0.2)(-0.081) = 0.2162

w_{3} = -1.0 – (0.2)(-0.054) = -0.054

w_{4} = 0.5 – (0.2)(-0.27) = 0.554

b = 0

Thus, the new weights will be [0.0608, 0.2162, -0.054, 0.554]. These weights will then be used on the next iteration of of prediction.

So far, I’ve presented a case with only one data sample. What if there were *m* examples? In a real-world training with logistic regression, the dataset is typically divided into *m* number of batches of *n* samples each batch. For each batch *m _{i}*, the algorithm is applied

*n*times while accumulating the loss and gradients

*dw*and

*db*. The average of the gradients is used to adjust the weights and bias, which are then passed on to the next batch

*m*. The algorithm would look like this:

_{i+1}### Pseudocode ### def logistic_regression(train_x, train_y, weights, bias, learning_rate, m, n): for i=1 to m: X = train_x[i] Y = train_y[i] avg_loss = 0 dw = 0 db = 0 for j=1 to n: z = weights.X + bias # . means dot-product a = sigma(z) loss += -[Y*log(a) + (1-Y)*log(1-a)] dz = a - y dw += X * dz # get the sum of dw from 1 to n db += dz # get the sum of db from 1 to n avg_loss = loss / n dw_avg = dw / n db_avg = db / n # Adjust the weights and bias weights = weights - (learning_rate * dw_avg) bias = bias - (learning_rate * db_avg)