3260 papers • 126 benchmarks • 313 datasets
Multimodal association refers to the process of associating multiple modalities or types of data in time series analysis. In time series analysis, multiple modalities or types of data can be collected, such as sensor data, images, audio, and text. Multimodal association aims to integrate these different types of data to improve the understanding and prediction of the time series. For example, in a smart home application, sensor data from temperature, humidity, and motion sensors can be combined with images from cameras to monitor the activities of residents. By analyzing the multimodal data together, the system can detect anomalies or patterns that may not be visible in individual modalities alone. Multimodal association can be achieved using various techniques, including deep learning models, statistical models, and graph-based models. These models can be trained on the multimodal data to learn the associations and dependencies between the different types of data.
(Image credit: Papersgraph)
These leaderboards are used to track progress in multimodal-association-1
No benchmarks available.
Use these libraries to find multimodal-association-1 models and implementations
In this paper, we present Vi-Fi, a multi-modal system that leverages a user's smartphone WiFi Fine Timing Measurements (FTM) and inertial measurement unit (IMU) sensor data to associate the user detected on a camera footage with their corresponding smartphone identifier (e.g. WiFi MAC address). Our approach uses a recurrent multi-modal deep neural network that exploits FTM and IMU measurements along with distance between user and camera (depth information) to learn affinity matrices. As a baseline method for comparison, we also present a traditional non deep learning approach that uses bipartite graph matching. To facilitate evaluation, we collected a multi-modal dataset that comprises camera videos with depth information (RGB-D), WiFi FTM and IMU measurements for multiple participants at diverse real-world settings. Using association accuracy as the key metric for evaluating the fidelity of Vi-Fi in associating human users on camera feed with their phone IDs, we show that Vi-Fi achieves between 81% (real-time) to 91% (offline) association accuracy.
This work introduces WinoGAViL: an online game of vision-and-language associations, used as a dynamic evaluation benchmark and indicates that the collected associations require diverse reasoning skills, including general knowledge, common sense, abstraction, and more.
In this paper, we present ViTag to associate user identities across multimodal data, particularly those obtained from cameras and smartphones. ViTag associates a sequence of vision tracker generated bounding boxes with Inertial Mea-surement Unit (IMU) data and Wi-Fi Fine Time Measurements (FTM) from smartphones. We formulate the problem as association by sequence to sequence (seq2seq) translation. In this two-step process, our system first performs cross-modal translation using a multimodal LSTM encoder-decoder network (X-Translator) that translates one modality to another, e.g. recon-structing IMU and FTM readings purely from camera bounding boxes. Second, an association module finds identity matches between camera and phone domains, where the translated modality is then matched with the observed data from the same modality. In contrast to existing works, our proposed approach can associate identities in multi-person scenarios where all users may be performing the same activity. Extensive experiments in real-world indoor and outdoor environments demonstrate that online association on camera and phone data (IMU and FTM) achieves an average Identity Precision Accuracy (IDP) of 88.39% on a 1 to 3 seconds window, outperforming the state-of-the-art Vi-Fi (82.93%). Further study on modalities within the phone domain shows the FTM can improve association performance by 12.56% on average. Finally, results from our sensitivity experiments demonstrate the robustness of ViTag under different noise and environment variations.
Adding a benchmark result helps the community track progress.