This work proposes an attention-based approach for multimodal image patch matching using a Transformer encoder attending to the feature maps of a multiscale Siamese CNN, and introduces an attentionresidual architecture, using a residual connection bypassing the encoder.