Interpreting logits: Sigmoid vs Softmax

Table of contents

The humble sigmoid
Binary Classification
Multi-class classification
The mighty softmax
Convergence
More than one class?
PyTorch Implementation

Neural networks are capable of producing raw output scores for each of the classes (Fig 1). Recall that in the case of a probabilistic classifier (for definitions, notation and problem set up, check out my other post on some unifying notation), we place priors on the parameters of the model and obtain the posterior distribution over the classes.

But what do these scores indicate? How do we interpret them? Class A has score \(5.0\) while Class B has \(-2.1\). There’s no clear way to understand how these scores translate to the original problem, i.e. which class does the given input (or data instance) belong to. One common course of action is to say input \(\mathbf{x}\) belongs to the class with the highest raw output score.

Sure! This works. Continuing with the example from before, Class A is the right class then.

But wait a second, what if Class B had a score of \(4.999\) instead? Or \(-200.0\) ? Is it the exact situation as before since Class A is the “right” answer in all cases?

Instead of relying on ad-hoc rules and metrics to interpret the output scores (also known as logits or \(z(\mathbf{x})\), check out the blog post, some unifying notation ), a better method is to convert these scores into probabilities! Probabilities come with ready-to-use interpretability. If the output probability score of Class A is \(0.7\), it means that with \(70\%\) confidence, the “right” class for the given data instance is Class A.

Great! Now how do we convert output scores into probabilities?

The humble sigmoid

Enter the sigmoid function \(\sigma: \mathbb{R}\to [0,1]\)

\[\sigma(z) = \frac{e^z}{1 + e^z} = \frac{1}{1 + e^{-z}}\]

This is a mathematical function that converts any real-valued scalar to a point in the interval \([0,1]\). How is this a probability score?

Remember that for a value \(p\) to be the probability score for an event \(E\):

\(p \geq 0\) and \(p \leq 1\)
\(\mathbf{prob}(E^c) = 1-p\) where \(E^c\) is the complement of \(E\)

Does the sigmoid satisfy the above properties in the scenario we care about? Yes. The first condition is easy: \(\sigma(z) \geq 0\) and \(\sigma(z) \leq 1\) on the basis of its mathematical definition. The second condition is a little tricky, since we need to define what \(E\) and \(E^c\) are. This is dependent on our scenario

Binary Classification

In a binary classification setting, when the two classes are Class A (also called the positive class) and Not Class A (complement of Class A or also called the negative class), we have a clear cut definition of \(E\) and \(E^c\). And the sigmoid can now be interpreted as a probability.

Thus, \(\sigma (z(\mathbf{x}) )\) is the probability that \(\mathbf{x}\) belongs to the positive class and \(1 - \sigma(z(\mathbf{x}))\) is the probability that \(\mathbf{x}\) belongs to the negative class.

Note that the negative class is the complement of the positive class, thus they are mutually exclusive and exhastive, i.e. an input instance can belong to either class, but not both and their probabilities sum to \(1\). The output prediction is simply the one that has a larger confidence (probability). Or, in other words, threshold the outputs (typically at \(0.5\)) and pick the class that beats the threshold.

Awesome! Are we done then? Not quite.

Figure 1: Binary classification: using a sigmoid

Multi-class classification

What happens in a multi-class classification problem with \(C\) classes? How do we convert the raw logits to probabilities? If only there was vector extension to the sigmoid …

Oh wait, there is!

The mighty softmax

Presenting the softmax function \(S:\mathbf{R}^C \to {[0,1]}^C\)

\[S(\mathbf{z})_i = \frac{e^{\mathbf{z}_i}}{\sum_{j=1}^C e^{\mathbf{z}_j}} = \frac{e^{\mathbf{z}_i}}{ e^{\mathbf{z}_1} + ... + e^{\mathbf{z}_j} + ... + e^{\mathbf{z}_C}}\]

This function takes a vector of real-values and converts each of them into corresponding probabilities. In a \(C\)-class classification where \(k \in \{1,2,...,C\}\), it naturally lends the interpretation

\[\mathbf{prob}(y=k|\mathbf{x}) = \frac{e^{\mathbf{z(\mathbf{x})}_k}}{\sum_{j=1}^C e^{\mathbf{z(\mathbf{x})}_j}}\]

Again, as in the case of the sigmoid above, the classes are considered mutually exclusive and exhaustive, i.e. an input instance can belong to only one of these classes, not more and their probabilities sum to \(1\). The output prediction is again simply the one with the largest confidence.

Figure 2: Multi-class classification: using a softmax

Convergence

Note that when \(C = 2\) the softmax is identical to the sigmoid.

\[\mathbf{z}(\mathbf{x}) = [z, 0]\] \[S(\mathbf{z})_1 = \frac{e^z}{e^z + e^0} = \frac{e^z}{e^z + 1} = \sigma(z)\] \[S(\mathbf{z})_2 = \frac{e^0}{e^z + e^0} = \frac{1}{e^z + 1} = 1-\sigma(z)\]

Perfect! We found an easy way to convert raw scores to their probabilistic scores, both in a binary classification and a multi-class classification setting.

We must be done then, right? Nope.

More than one class?

What if input data can belong to more than one class in a multi-class classification problem? For instance, genre classification of movies (a movie can fall into multiple genres) or classification of chest x-rays (a given chest x-ray can have more than one disease). Such problems are refered to as multi-label classification problems. In these settings, the classes are NOT mutually exclusive.

The most common approach in modelling such problems is to transform them each into binary classification problems, i.e. train a binary classifier independently for each class. This can be done easily by just applying sigmoid function to each of raw scores. Note that the output probabilities will NOT sum to \(1\). The output predictions will be those classes that can beat a probability threshold.

Figure 3: Multi-label classification: using multiple sigmoids

PyTorch Implementation

Here’s how to get the sigmoid scores and the softmax scores in PyTorch. Note that sigmoid scores are element-wise and softmax scores depend on the specificed dimension.

The following classes will be useful for computing the loss during optimization:

torch.nn.BCELoss takes logistic sigmoid values as inputs
torch.nn.BCELossWithLogitsLoss takes logits as inputs
torch.nn.CrossEntropyLoss takes logits as inputs (performs log_softmax internally)
torch.nn.NLLLoss is like cross entropy but takes log-probabilities (log-softmax ) values as inputs

import torch

def getSoftmaxScores(inputs, dimen):
	''' Get the softmax scores '''
	print('---Softmax---')
	print('---Dim = ' + str(dimen) + '---')
	softmaxFunc = torch.nn.Softmax(dim = dimen)
	softmaxScores = softmaxFunc(inputs)
	print('Softmax Scores: \n', softmaxScores)
	sums_0 = torch.sum(softmaxScores, dim=0)
	sums_1 = torch.sum(softmaxScores, dim=1)
	print('Sum over dimension 0: \n', sums_0)
	print('Sum over dimension 1: \n', sums_1)

def getSigmoidScores(inputs):
	''' Get the sigmoid scores: they are element-wise '''
	print('---Sigmoid---')
	sigmoidScores = torch.sigmoid(inputs)
	print('Sigmoid Scores: \n', sigmoidScores)

logits = torch.randn(2, 3)*10 - 5
print('Logits: ', logits)
getSigmoidScores(logits)
getSoftmaxScores(logits, 0)
getSoftmaxScores(logits, 1)

// Outputs
Logits:  tensor([[-10.6383,  -3.3302,  10.4895],
        [ -7.0935,  10.9497, -24.0366]])
---Sigmoid---
Sigmoid Scores: 
 tensor([[2.3979e-05, 3.4550e-02, 9.9997e-01],
        [8.2977e-04, 9.9998e-01, 3.6393e-11]])
---Softmax---
---Dim = 0---
Softmax Scores: 
 tensor([[2.8065e-02, 6.2851e-07, 1.0000e+00],
        [9.7194e-01, 1.0000e+00, 1.0127e-15]])
Sum over dimension 0: 
 tensor([1., 1., 1.])
Sum over dimension 1: 
 tensor([1.0281, 1.9719])
---Softmax---
---Dim = 1---
Softmax Scores: 
 tensor([[6.6729e-10, 9.9584e-07, 1.0000e+00],
        [1.4585e-08, 1.0000e+00, 6.3917e-16]])
Sum over dimension 0: 
 tensor([1.5252e-08, 1.0000e+00, 1.0000e+00])
Sum over dimension 1: 
 tensor([1., 1.])

If you want to use parts of the text, any of the figures or share the article, please cite it as: