A previous issue of PTK described Vision Transformer (ViT), which applies the Transformer architecture used for language processing tasks, to computer vision problems. A research team from NAVER AI has iterated on this architecture, releasing an OSS implementation of Pooling-based Vision Transformer (PiT). PiT extends ViT by adopting the pooling layer that improves the generalization and expressiveness of CNNs. PiT outperforms ViT on several tasks including image classification and object detection and on robustness benchmarks.