Translation of this article:Numerical Stability in TensorFlow, if there is any infringement, please contact to delete, only for academic exchange, do not commercial. If there are fallacies, please contact to point out.

When using any numerical computation library, such as NumPy or TensorFlow, it is worth noting that it is not necessary to write the correct mathematical calculation code to calculate the correct result. You also want to make sure that the whole calculation is stable.

Let’s start with an example. In elementary school, we learned that for any number x that is not zero, x times y over y is equal to x. But let’s see if this is the case in practice:

import numpy as np

x = np.float32(1)

y = np.float32(1e-50)  # y would be stored as zero
z = x * y / y

print(z)  # prints nan
Copy the code

The reason for the error is that y is a float32 number, which is too small to represent. A similar problem occurs when y is too large:

y = np.float32(1e39)  # y would be stored as inf
z = x * y / y

print(z)  # prints 0
Copy the code

The smallest positive value that the float32 type can represent is 1.4013E-45, and anything below this value will be stored as zero. In addition, any number over 3.40282 E + 38 will be stored as INF.

print(np.nextafter(np.float32(0), np.float32(1)))  # 1.4013 e-45 prints
print(np.finfo(np.float32).max)  # print 3.40282 e
Copy the code

To ensure the stability of your calculations, you should avoid using values with very small or very large absolute values. This may sound low-level, but these problems can make your program difficult to debug, especially when gradient descent is performed in TensorFlow. This is because you need to ensure that all values are within the valid range of the data type not only in the forward pass, but also in the back pass (during the gradient operation).

Let’s look at a real example. We want to calculate its Softmax value on the Logits vector. An implementation of Too Navie would look like this:

import tensorflow as tf

def unstable_softmax(logits):
    exp = tf.exp(logits)
    return exp / tf.reduce_sum(exp)

tf.Session().run(unstable_softmax([1000..0.]))  # prints [ nan, 0.]
Copy the code

Note that taking the logarithm of a relatively small number will result in a large number outside the range of Float32. For our naive softmax implementation, the maximum effective logarithm is ln(3.40282e+38) = 88.7, and if you exceed this value, nan results.

But how can we make it more stable? The solution is fairly simple. Is easy to see, exp / ∑ exp (x – c) -c (x) = exp (x) / ∑ exp (x). Therefore, we can subtract any constant from the logic and the result will be the same. We chose this constant as the logical maximum. Thus, the domain of the exponential function will be limited to [-INF, 0], so its range will be [0.0,1.0], which is desirable:

import tensorflow as tf

def softmax(logits):
    exp = tf.exp(logits - tf.reduce_max(logits))
    return exp / tf.reduce_sum(exp)

tf.Session().run(softmax([1000..0.]))  # prints [ 1., 0.]
Copy the code

Let’s look at a more complicated case. Suppose we have a classification problem and use the Softmax function to generate probabilities from our logic. Then we define a cross entropy loss function between the true and predicted values. Recall that the classification distribution of cross entropy can be simply defined as xe(p, q) = -∑ p_i log(q_i), so a simple code for cross entropy looks like this:

def unstable_softmax_cross_entropy(labels, logits):
    logits = tf.log(softmax(logits))
    return -tf.reduce_sum(labels * logits)

labels = tf.constant([0.5.0.5])
logits = tf.constant([1000..0.])

xe = unstable_softmax_cross_entropy(labels, logits)

print(tf.Session().run(xe))  # prints inf
Copy the code

Note that in this code, as softmax output approaches 0, the output will approach infinity, which will make our calculations unstable. We can rewrite it by extending the Softmax function and making some simplifications:

def softmax_cross_entropy(labels, logits):
    scaled_logits = logits - tf.reduce_max(logits)
    normalized_logits = scaled_logits - tf.reduce_logsumexp(scaled_logits)
    return -tf.reduce_sum(labels * normalized_logits)

labels = tf.constant([0.5.0.5])
logits = tf.constant([1000..0.])

xe = softmax_cross_entropy(labels, logits)

print(tf.Session().run(xe))  # 500.0 prints
Copy the code

We can also verify that the gradient calculation is correct:

g = tf.gradients(xe, logits)
print(tf.Session().run(g))  # prints [0.5, 0.5]
Copy the code

Again, be careful when you do gradient descent to make sure that the function and the gradients of each layer are in a valid range. Exponential and logarithmic functions should also be used with great care because they can map small numbers to large numbers and vice versa.