(Choose 2 answers)
A. (1)
B. (II)
C. (III)
D. (IV)
function mapping from the inputs to the hidden units)and (2)(controlling the mapping from the hidden units to the outputs). If we set all the elements of (1) to be 0, and all the elements of (2) to be 1, then this suffices for symmetry breaking,since the neurons are no longer all computing the same function of
the input.
random initialization, your algorithm may converge to different local optima (i.e., if you run the algorithm twice with different random initializations,gradient descent may converge to two different solutions).
(II) Suppose that the parameter (1) is a square matrix (meaning the number of rows anuale the numhar
(IV) If we are training a neural network using gradient descent, one reasonable "debugging" sten to
E36