LayerNorm is an operation that shifts and rescales a vector so that its mean is zero and its variance is 1.0, then re-scales this result by two learned parameters.
This scales the magnitudes of one vector (activations) to be scaled to match another vector before they are mixed.
When applied to a vector of length , LayerNorm contributes learned parameters to a model everywhere it is used.
Mathematically
Given an input vector of dimension , LayerNorm is:
where
is the mean of the input vector
is the standard deviation of the input vector
is the learned per-dimension scale and bias
is a numerical stability constant
I don’t really know what this stuff means.