These networks represent an image as a pooled outer product of features derived from two CNNs and capture localized feature interactions in a translationally invariant manner and can be trained from scratch on the ImageNet dataset offering consistent improvements over the baseline architecture.