Link Search Menu Expand Document

Some unifying notation: Probabilistic Classifiers

Table of contents
  1. Notation
  2. Setting
  3. Definitions

One big challenge while learning, well… anything really, lies in demystifying jargon. And in a field like Machine Learning, which has roots in math, statistical theory, probability theory (bayesian and frequentist) and computer science, this becomes even harder. Each sub-field comes with its own jargon even when talking about the exact same thing!

Here’s a starting point to unify some notation and definitions when dealing with Probabilistic Classifiers.

Notation

  • \(\mathbf{x} \in \mathbb{R}^d\) : feature inputs or covariates (usually high dimensional), described by the random variable \(X\)
  • \(y \in \{1,2,...,C\}\) : classes (labels), described by the random variable \(Y\)

  • Each instance of data is drawn from a join probability distribution \(p^*(X,Y)\)

  • \(p^*(\mathbf{x},y)\) : True distribution of \(\mathbf{x}\) and \(y\). Also denoted by \(p^*(X,Y)\) or \(p^*(X= \mathbf{x}, Y = y)\)

  • \(\mathcal{D} = \{(\mathbf{x}_n, y_n)\}_{n = 1}^{N}\) : Dataset \(\mathcal{D}\) with \(N\) i.i.d (independant and identically distributed) samples from \(p^*\)

Probabilistic Classification NotationFigure 1: Common probabilistic classification notation

Setting

  • Classification — task of assigning a class to a given instance of data defined by a set of features
  • Probabilistic Classification — stricter task of assigning probabilities that the given instance of data (features) belongs to each possible class. The probabilities indicate the “confidence” in that class being correct for the given instance

    Note: We refer to probabilistic classification as classification below unless explicitly specified

  • \(C\)-class Classification problem setting:
    • true distribution is assumed to be a discrete distribution over \(C\) classes
    • observed \(y\) is a sample from conditional distribution \(p^*(y \vert \mathbf{x})\) or \(p^*(Y \vert X=\mathbf{x})\)
  • Neural networks (discriminative classifiers) try and estimate \(p_\theta(y \vert \mathbf{x})\) by fitting \(\theta\) using \(\mathcal{D}\) (training dataset)
    • During deployment, the NN is evaluated using a dataset \(\mathcal{T}\), sampled from a distribution \(q(\mathbf{x},y)\) or \(q(X,Y)\)

Neural Network + SoftmaxFigure 2: A neural network with a softmax classifier

Definitions

  • Logits: the activations from the last layer of the NN are termed as “logits”, \(z(\mathbf{x}) \in \mathbb{R}^C\)
    • They are fed into a function that converts the \(C\) logits into \(C\) probabilities, \(p_i\)
    • Let’s assume softmax (Fig 2) here.

New

For other such functions, check out my post on sigmoid vs softmax

  • Confidence: the maximum \(p_i\) is the assosciated “confidence” with the prediction
  • Prediction: the class corresponding to the maximum \(p_i\) is the prediction

  • Components of the joint probability distribution \(p(X, Y)\):
    • Evidence: \(p(X)\)
    • Likelihood: \(p(X \vert Y)\)
    • Prior: \(p(Y)\)
    • Posterior: \(p(Y \vert X)\)
\[p(Y=y_i|X) = \frac{p(X|Y=y_i)\, p(Y=y_i)}{p(X)} = \frac{p(X|Y=y_i)\, p(Y=y_i)}{\sum_{j \in classes}p(X|Y=y_j)\, p(Y=y_j)}\]
  • Bayes’ optimal classifier: A classifier that predicts the true posterior probability distribution, \(p(Y \vert X=\mathbf{x})\), for every input instance \(\mathbf{x}\) is a Bayes’ optimal classifer


If you want to use parts of the text, any of the figures or share the article, please cite it as:

@article{ nanbhas2020NNnotation,
  title   = "Some unifying notation: Probabilistic Classifiers",
  author  = "Bhaskhar, Nandita",
  journal = "Blog: Roots of my Equation (web.stanford.edu/~nanbhas/blog/)",
  year    = "2020",
  url     = "https://web.stanford.edu/~nanbhas/blog/some-unifying-notation/"
}