Although relations between words are hierarchical, sequence-to-sequence models cannot generalize in a hierarchy-sensitive manner when performing syntactic transformations (e.g. transforming declarative sentences into questions). To overcome this weakness, Mueller et al. propose pre-training seq2seq models on natural language data. They find that pre-trained seq2seq models can generalize hierarchically when performing syntactic transformations, although only after exposure to very large corpora of language data. In contrast, models that are not pre-trained must apply linear/position heuristics to perform syntactic transformations.