In the visual encoding step, the image is divided into discrete patches. Unlike the classic ViT, in the proposed Swin Transformer architecture, the image is divided into patches of size 4×4 4 × 4 and ...