Scalable semi-supervised dimensionality reduction with GPU-accelerated EmbedSOM (2022-01-03T00:00:00.000000Z)

TL;DR

This paper describes an e-cient, highly parallel GPU implementation of EmbedSOM designed to provide interactive results on large datasets, and presents BlosSOM, a high-performance semi-supervised dimensionality reduction so-called for interactive user-steerable visualization of high-dimensional datasets with millions of individual data points.

Abstract

Dimensionality reduction methods have found vast application as visualization tools in diverse areas of science. Although many dierent methods exist, their performance is oen insucient for providing quick insight into many contemporary datasets, and the unsupervised mode of use prevents the users from utilizing the methods for dataset exploration and netuning the details for improved visualization quality. We present BlosSOM, a high-performance semi-supervised dimensionality reduction soware for interactive user-steerable visualization of high-dimensional datasets with millions of individual data points. BlosSOM builds on a GPUaccelerated implementation of the EmbedSOM algorithm, complemented by several landmarkbased algorithms for interfacing the unsupervised model learning algorithms with the user supervision. We show the application of BlosSOM on realistic datasets, where it helps to produce high-quality visualizations that incorporate user-specied layout and focus on certain features. We believe the semi-supervised dimensionality reduction will improve the data visualization possibilities for science areas such as single-cell cytometry, and provide a fast and ecient base methodology for new directions in dataset exploration and annotation. Dimensionality reduction algorithms emerged as indispensable utilities that enable various forms of intuitive data visualization, providing insight that in turn simplies rigorous data analysis. Various algorithms have been proposed for graphs and high-dimensional point-cloud data, and many dierent types of datasets that can be represented with a graph structure or embedded into vector spaces. The development has beneted especially the life sciences, where algorithms like t-SNE [21] reshaped the accepted ways of interpreting many kinds of measurements, such as genes, single-cell phenotypes and development 1 ar X iv :2 20 1. 00 70 1v 1 [ cs .L G ] 3 J an 2 02 2 pathways, and behavioral patterns [34, 6]. Performance of the non-linear dimensionality reduction algorithms becomes a concern if the analysis pipeline is required to scale or when the results are required in a limited amount of time such as in clinical settings. The most popular methods, typically based on neighborhood embedding computed by stochastic descent, force-based layouting or neural autoencoders, reach applicability limits when the dataset size is too large. To tackle the limitations, we have previously developed EmbedSOM [15], a dimensionality reduction and visualization algorithm based on self-organizing maps (SOMs) [13]. EmbedSOM provided an order-of-magnitude speedup on datasets typical for the single-cell cytometry data visualization while retaining competitive quality of the results. The concept has proven useful for interactive and high-performance workows in cytometry [16, 14], and easily applies to many other types of datasets. Despite of that, the parallelization potential of the extremely data-regular design of EmbedSOM algorithm has remained mostly untapped. Our contribution in this paper is a natural continuation of the development: We describe an ecient, highly parallel GPU implementation of EmbedSOM designed to provide interactive results on large datasets. The implementation is suciently fast to provide real-time visualizations of datasets larger than 10 of individual data points on o-the-shelf hardware, while maintaining smooth videolike frame rate. We demonstrate that the result gives unprecedented, controllable view of the details of specic high-dimensional datasets. The instant feedback available to the user opens possibilities for partial supervision of the visualization process, allowing user-intuitive resolution of possible visualization ambiguities as well as natural exploration of new datasets. We demonstrate some of the achievable results on two realistic datasets. The resulting soware, called BlosSOM, is published as free and open source. BlosSOM can be readily utilized to reproduce our results and explore more datasets; additionally it contains support for working with data formats (mainly, the FCS standard [31]) that make it immediately useful for visualization of existing and new biological data. In the paper, we briey describe the EmbedSOM algorithm (Section 2.1), and show an extension of its generalized form that dynamically mixes the user feedback to the learning process, thus enabling the semi-supervised learning (Section 2.2). We specically detail the CUDA-based GPU implementation of the algorithm in Section 3, and report the achieved performance improvements (Section 4.1). Finally, we showcase the achievable results on biological data, and discuss possible future enhancements and applications that would aid data analysis (Sections 4.2, 4.4).

TL;DR

Abstract

Authors

References37 items

OMIP‐080: 29‐Color flow cytometry panel for comprehensive evaluation of NK and T cells reconstitution after hematopoietic stem cells transplantation

Initialization is critical for preserving global data structure in both t-SNE and UMAP

Detailed Analysis and Optimization of CUDA K-means Algorithm

GigaSOM.jl: High-performance clustering and visualization of huge cytometry datasets

Bringing UMAP Closer to the Speed of Light with GPU Acceleration

Systems-Level Immunomonitoring from Acute to Recovery Phase of Severe COVID-19

t-viSNE: Interactive Assessment and Interpretation of t-SNE Projections

ShinySOM: graphical SOM-based analysis of single-cell cytometry data

Generalized EmbedSOM on quadtree-structured self-organizing maps

Visualizing structure and transitions in high-dimensional biological data

Automated optimized parameters for T-distributed stochastic neighbor embedding improve visualization and analysis of large datasets

TriMap: Large-scale Dimensionality Reduction Using Triplets

How much can k-means be improved by using better initialization and repeats?

Embedding to reference t-SNE space addresses batch effects in single-cell classification

Quantitative Comparison of Conventional and t-SNE-guided Gating Analyses

Heavy-tailed kernels reveal a finer cluster structure in t-SNE visualisations

Dimensionality reduction for visualizing single-cell data using UMAP

Survey of GPU Based Sorting Algorithms

T-SNE-CUDA: GPU-Accelerated T-SNE and its Applications to Modern Data

A comparison of single-cell trajectory inference methods: towards more accurate and robust tools

Fast Interpolation-based t-SNE for Improved Visualization of Single-Cell RNA-Seq Data

Optogenetic dissection of descending behavioral control in Drosophila

Visual analysis of mass cytometry data by hierarchical stochastic neighbour embedding reveals rare cell types

Interpretable dimensionality reduction of single cell transcriptome data with deep generative models

Employing GPU architectures for permutation-based indexing

Flow cytometry: basic principles and applications

Mass Cytometry: Single Cells, Many Features

Automated Mapping of Phenotype Space with Single-Cell Data

Comparative Analysis of Single-Cell RNA Sequencing Methods

Optimizing Sorting and Top-k Selection Steps in Permutation Based Indexing on GPUs

Data File Standard for Flow Cytometry, version FCS 3.1

The self-organizing map

GPU-Assisted Scatterplots for Millions of Call Events

Visualizing Data using t-SNE

The API reference for CUB

Kratochvı́l. “Detailed Analysis and Optimization of CUDA K-means Algorithm

A survey of ow cytometry data analysis methods

A survey of ow cytometry data analysis methods