Hi, thanks for sharing this impressive project.
I wonder how did you come up with the update rule? Specifically, why the average over the attention scores before softmax is suitable for being the beta_t? Is it derived theoretically or randomly tried out with a lot of experiments? May you can share some insights to take away? thanks!