Temporal and cross-modal attention for audio-visual zero-shot learning - Citation Graph | Papersgraph