Humans interpret their environment by parsing visual scenes into part-whole hierarchies according to psychological research. In this paper, Geoffrey Hinton proposes using capsules to represent the part-whole hierarchy (as a dynamically constructed parse tree of objects and their parts) in a neural net. He describes the GLOM architecture wherein embedding vectors at the same level in different columns interact to produce islands of identical embeddings that represent the parse tree. The GLOM system (which has not been implemented) combines transformers, neural fields, contrastive learning, capsule networks, denoising autoencoders, and RNNs and could unlock new computer vision use cases like generalizing shape recognition to new viewpoints.