TL;DR: This post explores methods for calculating lower and upper bounds on the softmax function’s output and discusses why this is crucial for neural network verification.
Many neural network architectures rely on the softmax function, such as the transformer which uses it in the attention layer, or merely as an output layer to generate a probability-like output. The verification of neural networks field has mostly stayed away from such networks, in part due to the difficulties raised by the use of the softmax. Verification methods often use a bound propagation procedure in which bounds over the input of the network are propagated through each layer of the network. As the naive bounds of the softmax are relatively loose, the bounds become unusable after a small amount of layers as the looseness compounds upon itself. However, with architectures using the softmax becoming more prevalent, there has been work in increasing the tightness of these bounds. Three papers that have formulated tighter bounds for the softmax, which I summarise here alongside a summary of softmax and its uses.
There are two main crucial properties of the softmax function that make it useful in neural networks. The first is that it transforms the input (e.g. the logits of a classifier) into a probability-like distribution in that the output sums to one. This makes the output interpretable as a probability distribution over independent classes. The second is that it is differentiable, which is crucial for backpropagation. This property allows the model to calculate gradients and update weights during the training process.
For input $x \in \mathcal{R}^{K}$ the output $p$ of the softmax is given by:
\[p_j = \frac{e^{x_j}}{\sum_{j'=1}^{K} e^{x_{j'}}} = \frac{1}{1 + \sum_{j' \neq j} e^{x_{j'} - x_j}}, \quad j = 1, \ldots, K.\]Interactive visualisations in 1D and 2D [1] highlight interesting properties of the softmax function. For example, adding a constant to each input does not change the output probabilities, and scaling the inputs affects the ‘spread’ or ‘temperature’ of the probabilities.
A notable application of the softmax function, denoted as $\sigma$, is in the attention mechanism of transformers, where it helps model interactions between different elements of the input:[2]
\[\sigma \left( \frac{X W_Q W_K^{\top} X^{\top}}{\sqrt{d_k}} \right) X W_V\]Approach 1: Composite Operations [3]
Shi et al. (2019) decompose the softmax into constituent operations that can each be linearly bounded sequentially, with the bounds of each operation passed to the next. Specifically they:
Approach 2: Composite Operations [4]
Wei et al. (2023) reformulate the previous composition by instead taking the logarithm of the softmax and exponentiating this. They then approach this in a composite approach as before.
Shown below are the bounds on the softmax (with 2 inputs) for the linear bounds, exponential reciprocal (Shi et al.) and log-sum-exp bounds.
In their experiments they use a with outputs that have differing amounts of probability concentrated on only one component. The left/right columns in the figure below refer to the upper/lower bounds on the softmax output, whilst the top/bottom rows are the case where the output shown is the component with high/low concentration.
Approach 3: Accounting for dependencies [5]
This approach is based on the insight that prior works ignore dependencies among the inputs of the softmax. The authors call it the $n$-dimensional relation approach and claim it achieves 3.24 times larger certified radii compared to the state-of-the-art. This approach does not decompose the softmax into simple functions and relax them sequentially, which they state causes errors in earlier relaxations to adversely affect subsequent stages. Instead, to circumvent this impact they adopt one-stage relaxations that treat softmax as an $n$-dimensional function without decomposition.
By providing more precise bounds, these methods may allow bound propagation based verification methods to incorporate the softmax function, and hence be used on popular architectures such as transformers.
[1] Elliot Whaite, Softmax Function Explained In Depth with 3D Visuals, Youtube.
[2] Vaswani et al., Attention is All you Need, NeurIPS, 2017.
[3] Shi, Z., Zhang, H., Chang, K.-W., Huang, M., & Hsieh, Robustness Verification for Transformers, Shi et al., 2019, International Conference on Learning Representations.
[4] Wei et al., 2023, Convex Bounds on the Softmax Function with Applications to Robustness Verification, Proceedings of The 26th International Conference on Artificial Intelligence and Statistics.
[5] Zhang, Y., Shen, L., Guo, S., & Ji, S. (2024). GaLileo: General Linear Relaxation Framework for Tightening Robustness Certification of Transformers. Proceedings of the AAAI Conference on Artificial Intelligence, 38(19), Article 19.