ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision Paper • 2102.03334 • Published Feb 5, 2021