(Choose 1 answer)
You have built a network using the tanh activation for all the hidden units. You initialize the weights to relative
large values, using np.random.randn(.....)*1000. What will happen?
A. It doesn't matter. So long as you initialize the weights randomly gradient descent is not affected by whether the weights are large or small.
B. This will cause the inputs of the tanh to also be very large, thus causing gradients to also become large.You therefore have to set a to be very small to prevent divergence; this will slow down learning.
C. This will cause the inputs of the tanh to also be very large, causing the units to be "highly activated" and thus speed up learning compared to if the weights had to start from small values.
D. This will cause the inputs of the tanh to also be very large, thus causing gradients to be close to zero. The optimization algorithm will thus become slow.
Exit 34