20
(Choose 1 answer)
Here're the update equations for the GRU.
GRU
B. D
CA
D. C
<t> = tanh(W[I, c<t-1>, x<t>]+bc)
Γ = (Wulc<-1>, x<t>] + b)
= (W, [ct-1>, x<t>] + br)
c<t> = <t> + (1-Г) * c<-1>
a<t> = c<t>
Alice proposes to simplify the GRU by always removing the I I.e., setting = 1. Betty proposes to simplify the GRU by removing the I,. I. e., setting I, 1 always. Which of these models is more likely to work without vanishing gradient problems even when trained on very long input sequences?
a. Alice's model (removing ), because if I, 0 for a timestep, the gradient can propagate back through that timestep without much decay.
A. B