A simple and compact ViT architecture called UViT is proposed that achieves strong performance on COCO object detection and instance segmentation tasks and completes a scaling rule to optimize the model’s trade-off on accuracy and computation cost / model size.