Beyond the Field-of-View: Enhancing Scene Visibility and Perception With Clip-Recurrent Transformer (2022-11-21T00:00:00.000000Z)

TL;DR

This paper proposes the concept of online video inpainting for autonomous vehicles to expand the field of view, thereby enhancing scene visibility, perception, and system safety and introduces the FlowLens architecture, which explicitly employs optical flow and implicitly incorporates a novel clip-recurrent transformer for feature propagation.

Abstract

Vision sensors are widely applied in vehicles, robots, and roadside infrastructure. However, due to limitations in hardware cost and system size, camera Field-of-View (FoV) is often restricted and may not provide sufficient coverage. Nevertheless, from a spatiotemporal perspective, it is possible to obtain information beyond the camera's physical FoV from past video streams. In this paper, we propose the concept of online video inpainting for autonomous vehicles to expand the field of view, thereby enhancing scene visibility, perception, and system safety. To achieve this, we introduce the FlowLens architecture, which explicitly employs optical flow and implicitly incorporates a novel clip-recurrent transformer for feature propagation. FlowLens offers two key features: 1) FlowLens includes a newly designed Clip-Recurrent Hub with 3D-Decoupled Cross Attention (DDCA) to progressively process global information accumulated over time. 2) It integrates a multi-branch Mix Fusion Feed Forward Network (MixF3N) to enhance the precise spatial flow of local features. To facilitate training and evaluation, we derive the KITTI360 dataset with various FoV mask, which covers both outer- and inner FoV expansion scenarios. We also conduct both quantitative assessments and qualitative comparisons of beyond-FoV semantics and beyond-FoV object detection across different models. We illustrate that employing FlowLens to reconstruct unseen scenes even enhances perception within the field of view by providing reliable semantic context. Extensive experiments and user studies involving offline and online video inpainting, as well as beyond-FoV perception tasks, demonstrate that FlowLens achieves state-of-the-art performance.

References107 items

Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

Towards Language-Driven Video Inpainting via Multimodal Large Language Models

Flow-Guided Diffusion for Video Inpainting

ProPainter: Improving Propagation and Transformer for Video Inpainting

Deficiency-Aware Masked Transformer for Video Inpainting

FishDreamer: Towards Fisheye Semantic Completion via Unified Image Outpainting and Segmentation

Imagen Video: High Definition Video Generation with Diffusion Models

Flow-Guided Transformer for Video Inpainting

Deep 360° Optical Flow Estimation Based on Multi-Projection Fusion

Error Compensation Framework for Flow-Guided Video Inpainting

Rethinking Alignment in Video Super-Resolution Transformers

Outpainting by Queries

Optical Camera Communication in Vehicular Applications: A Review

MaskViT: Masked Visual Pre-Training for Video Prediction

FisheyeEX: Polar Outpainting for Extending the FoV of Fisheye Lens

Surround-View Fisheye Camera Perception for Automated Driving: Overview, Survey & Challenges

Review on Panoramic Imaging and Its Applications in Scene Understanding

Reduce Information Loss in Transformers for Pluralistic Image Inpainting

Cylin-Painting: Seamless 360° Panoramic Image Outpainting and Beyond

Towards An End-to-End Framework for Flow-Guided Video Inpainting

MISF:Multi-level Interactive Siamese Filtering for High-Fidelity Image Inpainting

PanoFlow: Learning 360° Optical Flow for Surrounding Temporal Understanding

How Do Vision Transformers Work?

MaskGIT: Masked Generative Image Transformer

VRT: A Video Restoration Transformer

Generalised Image Outpainting with U-Transformer

MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition

Image-Adaptive Hint Generation via Vision Transformer for Outpainting

Generative Adversarial Networks

Transfer Beyond the Field of View: Dense Panoramic Semantic Segmentation via Unsupervised Domain Adaptation

KITTI-360: A Novel Dataset and Benchmarks for Urban Scene Understanding in 2D and 3D

Resolution-robust Large Mask Inpainting with Fourier Convolutions

FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting

Structured Denoising Diffusion Models in Discrete State-Spaces

CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows

FitVid: Overfitting in Pixel-Level Video Prediction

SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers

FDAN: Flow-guided Deformable Alignment Network for Video Super-Resolution

BasicVSR++: Improving Video Super-Resolution with Enhanced Propagation and Alignment

InOut: Diverse Image Outpainting via GAN Inversion

CDFI: Compression-Driven Network Design for Frame Interpolation

Painting Outside as Inside: Edge Guided Image Outpainting via Bidirectional Rearrangement with Progressive Step Learning

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Flow-edge Guided Video Completion

Learning Joint Spatial-Temporal Transformations for Video Inpainting

Design of a panoramic annular lens with ultrawide angle and small blind area.

BiFuse: Monocular 360 Depth Estimation via Bi-Projection Fusion

Language Models are Few-Shot Learners

Contextual Residual Aggregation for Ultra High-Resolution Image Inpainting

RAFT: Recurrent All-Pairs Field Transforms for Optical Flow

MaskFlownet: Asymmetric Feature Matching With Learnable Occlusion Mask

Adversarial Video Generation on Complex Datasets

Axial Attention in Multidimensional Transformers

Copy-and-Paste Networks for Deep Video Inpainting

Boundless: Generative Adversarial Networks for Image Extension

AdaCoF: Adaptive Collaboration of Flows for Video Frame Interpolation

Can we PASS beyond the Field of View? Panoramic Annular Semantic Segmentation for Real-World Surrounding Perception

Generating Diverse High-Fidelity Images with VQ-VAE-2

Wide-Context Semantic Image Extrapolation

Deep Flow-Guided Video Inpainting

Deep Video Inpainting

Free-Form Video Inpainting With 3D Gated Convolution and Temporal PatchGAN

Deformable ConvNets V2: More Deformable, Better Results

YouTube-VOS: Sequence-to-Sequence Video Object Segmentation

Painting Outside the Box: Image Outpainting with GANs

Video-to-Video Synthesis

Learning Blind Video Temporal Consistency

OmniDepth: Dense Depth Estimation for Indoors Spherical Panoramas

Video Inpainting by Jointly Learning Temporal Structure and Spatial Details

Free-Form Image Inpainting With Gated Convolution

A comparative review of plausible hole filling strategies in the context of scene depth image completion

Image Inpainting for Irregular Holes Using Partial Convolutions

Im2Pano3D: Extrapolating 360° Structure and Semantics Beyond the Field of View

Neural Discrete Representation Learning

Optical system design of space fisheye lens and performance analysis

Globally and locally consistent image completion

Temporally coherent completion of dynamic video

Optical Flow Estimation Using a Spatial Pyramid Network

A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation

Context Encoders: Feature Learning by Inpainting

A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation

Rear obstacle detection system with fisheye stereo camera using HCT

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Simplified compact fisheye lens challenges and design

FlowNet: Learning Optical Flow with Convolutional Networks

Deep Unsupervised Learning using Nonequilibrium Thermodynamics

Video Inpainting of Complex Scenes

Flow and Color Inpainting for Video Completion

Microsoft COCO: Common Objects in Context

A 360-degree panoramic video system design

PatchMatch: a randomized correspondence algorithm for structural image editing

Design of a panoramic annular lens with a long focal length.

Simultaneous structure and texture image inpainting

To be Intelligent

Image inpainting

Focal Attention for Long-Range Interactions in Vision Transformers

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

"Learnable Gated Temporal Shift Module for Deep Video Inpainting"

360 o snapshot imaging with a convex array of long-wave infrared cameras

100

GENERATIVE ADVERSARIAL NETS

101

architectures following

102

The newly introduced 3D-Decoupled Cross Attention (DDCA) and Mix Fusion Feed Forward Network (MixF3N) are seamlessly integrated into the FlowLens architecture, further boosting its performance

103

Through extensive experiments and user studies

104

KITTI360

105

trans-former 2022

106

for the small model. The input features of the are split into 7 × 7 overlapping patches with 3 × 3

107

propose FlowLens , a novel clip-recurrent transformer framework designed to enhance scene visibility and perception beyond the field of view in real-time,