This paper describes an e-cient, highly parallel GPU implementation of EmbedSOM designed to provide interactive results on large datasets, and presents BlosSOM, a high-performance semi-supervised dimensionality reduction so-called for interactive user-steerable visualization of high-dimensional datasets with millions of individual data points.
Dimensionality reduction methods have found vast application as visualization tools in diverse areas of science. Although many dierent methods exist, their performance is oen insucient for providing quick insight into many contemporary datasets, and the unsupervised mode of use prevents the users from utilizing the methods for dataset exploration and netuning the details for improved visualization quality. We present BlosSOM, a high-performance semi-supervised dimensionality reduction soware for interactive user-steerable visualization of high-dimensional datasets with millions of individual data points. BlosSOM builds on a GPUaccelerated implementation of the EmbedSOM algorithm, complemented by several landmarkbased algorithms for interfacing the unsupervised model learning algorithms with the user supervision. We show the application of BlosSOM on realistic datasets, where it helps to produce high-quality visualizations that incorporate user-specied layout and focus on certain features. We believe the semi-supervised dimensionality reduction will improve the data visualization possibilities for science areas such as single-cell cytometry, and provide a fast and ecient base methodology for new directions in dataset exploration and annotation. Dimensionality reduction algorithms emerged as indispensable utilities that enable various forms of intuitive data visualization, providing insight that in turn simplies rigorous data analysis. Various algorithms have been proposed for graphs and high-dimensional point-cloud data, and many dierent types of datasets that can be represented with a graph structure or embedded into vector spaces. The development has beneted especially the life sciences, where algorithms like t-SNE [21] reshaped the accepted ways of interpreting many kinds of measurements, such as genes, single-cell phenotypes and development 1 ar X iv :2 20 1. 00 70 1v 1 [ cs .L G ] 3 J an 2 02 2 pathways, and behavioral patterns [34, 6]. Performance of the non-linear dimensionality reduction algorithms becomes a concern if the analysis pipeline is required to scale or when the results are required in a limited amount of time such as in clinical settings. The most popular methods, typically based on neighborhood embedding computed by stochastic descent, force-based layouting or neural autoencoders, reach applicability limits when the dataset size is too large. To tackle the limitations, we have previously developed EmbedSOM [15], a dimensionality reduction and visualization algorithm based on self-organizing maps (SOMs) [13]. EmbedSOM provided an order-of-magnitude speedup on datasets typical for the single-cell cytometry data visualization while retaining competitive quality of the results. The concept has proven useful for interactive and high-performance workows in cytometry [16, 14], and easily applies to many other types of datasets. Despite of that, the parallelization potential of the extremely data-regular design of EmbedSOM algorithm has remained mostly untapped. Our contribution in this paper is a natural continuation of the development: We describe an ecient, highly parallel GPU implementation of EmbedSOM designed to provide interactive results on large datasets. The implementation is suciently fast to provide real-time visualizations of datasets larger than 10 of individual data points on o-the-shelf hardware, while maintaining smooth videolike frame rate. We demonstrate that the result gives unprecedented, controllable view of the details of specic high-dimensional datasets. The instant feedback available to the user opens possibilities for partial supervision of the visualization process, allowing user-intuitive resolution of possible visualization ambiguities as well as natural exploration of new datasets. We demonstrate some of the achievable results on two realistic datasets. The resulting soware, called BlosSOM, is published as free and open source. BlosSOM can be readily utilized to reproduce our results and explore more datasets; additionally it contains support for working with data formats (mainly, the FCS standard [31]) that make it immediately useful for visualization of existing and new biological data. In the paper, we briey describe the EmbedSOM algorithm (Section 2.1), and show an extension of its generalized form that dynamically mixes the user feedback to the learning process, thus enabling the semi-supervised learning (Section 2.2). We specically detail the CUDA-based GPU implementation of the algorithm in Section 3, and report the achieved performance improvements (Section 4.1). Finally, we showcase the achievable results on biological data, and discuss possible future enhancements and applications that would aid data analysis (Sections 4.2, 4.4).