Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction (2022-01-05T00:00:00.000000Z)