Archive ouverte HAL - Probabilistic Count Matrix Factorization for Single Cell Expression Data Analysis

Probabilistic Count Matrix Factorization for Single Cell Expression Data Analysis

Ghislain Durif 1, 2, 3 Laurent Modolo 4, 2, 5 Jeff Mold 5 Sophie Lambert-Lacroix 6 Franck Picard 2
1 Thoth - Apprentissage de modèles à partir de données massives
Inria Grenoble - Rhône-Alpes, LJK - Laboratoire Jean Kuntzmann
2 Statistique en grande dimension pour la génomique
PEGASE - Département PEGASE [LBBE]
6 TIMC-IMAG-BCM - Biologie Computationnelle et Mathématique
TIMC-IMAG - Techniques de l'Ingénierie Médicale et de la Complexité - Informatique, Mathématiques et Applications, Grenoble - UMR 5525
Abstract : Motivation: The development of high throughput single-cell sequencing technologies now allows the investigation of the population diversity of cellular transcriptomes. The expression dynamics (gene-to-gene variability) can be quantified more accurately, thanks to the measurement of lowly-expressed genes. In addition, the cell-to-cell variability is high, with a low proportion of cells expressing the same genes at the same time/level. Those emerging patterns appear to be very challenging from the statistical point of view, especially to represent a summarized view of single-cell expression data. PCA is a most powerful tool for high dimensional data representation, by searching for latent directions catching the most variability in the data. Unfortunately, classical PCA is based on Euclidean distance and projections that poorly work in presence of over-dispersed count data with dropout events like single-cell expression data.Results: We propose a probabilistic Count Matrix Factorization (pCMF) approach for single-cell expression data analysis, that relies on a sparse Gamma-Poisson factor model. This hierarchical model is inferred using a variational EM algorithm. It is able to jointly build a low dimensional representation of cells and genes. We show how this probabilistic framework induces a geometry that is suitable for single-cell data visualization, and produces a compression of the data that is very powerful for clustering purposes. Our method is competed against other standard representation methods like t-SNE, and we illustrate its performance for the representation of single-cell expression (scRNA-seq) data.Availability: Our work is implemented in the pCMF R-package1.
Liste complète des métadonnées

Littérature citée [50 références]  Voir  Masquer  Télécharger

https://hal.archives-ouvertes.fr/hal-01649275
Contributeur : Ghislain Durif <>
Soumis le : mardi 12 mars 2019 - 11:06:59
Dernière modification le : mardi 24 septembre 2019 - 16:22:05

Fichier

article_pCMF_2019.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

Citation

Ghislain Durif, Laurent Modolo, Jeff Mold, Sophie Lambert-Lacroix, Franck Picard. Probabilistic Count Matrix Factorization for Single Cell Expression Data Analysis. Bioinformatics, Oxford University Press (OUP), In press, pp.1-41. ⟨10.1093/bioinformatics/btz177⟩. ⟨hal-01649275v3⟩

Partager

Métriques

Consultations de la notice

249

Téléchargements de fichiers

248