(Choose 1 answer)
GRU
Here're the update equations for the GRU.
<t> = tanh(W [T, c<t-1>, x<t>] +bc)
Alice proposes to simplify the GRU by always removing the Γu. I.e., setting u = 1.Betty proposes to simplify the GRU by removing the Fr.I. e., setting Fr = 1 always.Which of these models is more likely to work without vanishing gradient problems even when trained on very long input sequences?
Exit 39
A. Alice's model (removingГu), because if u=0 for a timestep, the gradient can propagate back through that timestep without much decay.
B. Betty's model (removing「u), because if u=1 for a timestep, the gradient can propagate back through that timestep without much decay.
C. Betty's model (removing
= (W,[c<t-1>, x<t>] + br)
c<t> = <t>+ (1-) c<-1>
a<t> = c<t>
Γ = (W[c<-1>, x<t>] + b)