The paper about PowerNorm states that: We find that there are clear differences in the batch statistics of NLP data versus CV data. In particular, we observe that batch statistics for NLP data have a very large variance throughout training. Is this still TRUE for ViT? I see most implementation of ViT using layernorm?