Abstract: Large-scale vision foundation models have made significant progress in visual tasks on natural images, with vision transformers (ViTs) being the primary choice due to their good scalability ...