Efficient interpretable variants of online SOM for large dissimilarity data

Archive ouverte

Mariette, Jérôme, J. | Olteanu, Madalina | Vialaneix, Nathalie

Edité par CCSD ; Elsevier -

International audience. Self-organizing maps (SOM) are a useful tool for exploring data. In its original version, the SOM algorithm was designed for numerical vectors. Since then, several extensions have been proposed to handle complex datasets described by (dis)similarities. Most of these extensions represent prototypes by a list of (dis)similarities with the entire dataset and suffer from several drawbacks: their complexity is increased-it becomes quadratic instead of linear-, the stability is reduced and the interpretability of the prototypes is lost. In the present article, we propose and compare two extensions of the stochastic SOM for (dis)similarity data: the first one takes advantage of the online setting in order to maintain a sparse representation of the prototypes at each step of the algorithm, while the second one uses a dimension reduction in a feature space defined by the (dis)similarity. Our contributions to the analysis of (dis)similarity data with topographic maps are thus twofolds: first, we present a new version of the SOM algorithm which ensures a sparse representation of the prototypes through online updates. Second, this approach is compared on several benchmarks to a standard dimension reduction technique (K-PCA), which is itself adapted to large datasets with the Nyström approximation. Results demonstrate that both approaches lead to reduce the prototypes dimensionality while providing accurate results in a reasonable computational time. Selecting one of these two strategies depends on the dataset size, the need to easily interpret the results and the computational facilities available. The conclusion tries to provide some recommendations to help the user making this choice.

Suggestions

Du même auteur

Kernel and dissimilarity methods for exploratory analysis in a social context

Archive ouverte | Mariette, Jérôme, J. | CCSD

International audience. While most of statistical methods for prediction or data mining have been built for data made of independent observations of a common set of p numerical variables, many real-world application...

Unsupervised multiple kernel learning for heterogeneous data integration

Archive ouverte | Mariette, Jérôme, J. | CCSD

International audience. Motivation: Recent high-throughput sequencing advances have expanded the breadth of available omics datasets and the integrated analysis of multiple datasets obtained on the same samples has ...

Des noyaux pour les omiques

Archive ouverte | Mariette, Jérôme, J. | CCSD

International audience. Le développement des techniques de séquençage haut débit génère un volume de données en forte croissance à des coûts relativement faibles. Ces données sont souvent de très grande dimension, h...

Chargement des enrichissements...