Semi-supervised learning with pseudo-labeling compares favorably with large language models for regulatory sequence prediction

Archive ouverte

Phan, Han | Brouard, Céline | Mourad, Raphaël

Edité par CCSD ; Oxford University Press (OUP) -

International audience.

Predicting molecular processes using deep learning is a promising approach to provide biological insights for non-coding single nucleotide polymorphisms identified in genome-wide association studies. However, most deep learning methods rely on supervised learning, which requires DNA sequences associated with functional data, and whose amount is severely limited by the finite size of the human genome. Conversely, the amount of mammalian DNA sequences is growing exponentially due to ongoing large-scale sequencing projects, but in most cases without functional data. To alleviate the limitations of supervised learning, we propose a novel semi-supervised learning (SSL) based on pseudo-labeling, which allows to exploit unlabeled DNA sequences from numerous genomes during model pre-training. We further improved it incorporating principles from the Noisy Student algorithm to predict the confidence in pseudo-labeled data used for pre-training, which showed improvements for transcription factor with very few binding (very small training data). The approach is very f lexible and can be used to train any neural architecture including state-of-the-art models, and shows in most cases strong predictive performance improvements compared to standard supervised learning. Moreover, small models trained by SSL showed similar or better performance than large language model DNABERT2.

Suggestions

Du même auteur

Should we really use graph neural networks for transcriptomic prediction?

Archive ouverte | Brouard, Céline | CCSD

International audience. The recent development of deep learning methods have undoubtedly led to great improvement in various machine learning tasks, especially in prediction tasks. This type of methods have also bee...

NMFProfiler: a multi-omics integration method for samples stratified in groups

Archive ouverte | Mercadié, Aurélie | CCSD

International audience. Motivation The development of high-throughput sequencing enabled the massive production of “omics” data for various applications in biology. By analyzing simultaneously paired datasets collec...

ProA and ProB repeat sequences shape genome organization, and enhancers open domains

Archive ouverte | Bonnet, Konstantinn Acen | CCSD

SUMMARY There is a growing awareness that repeat sequences (RepSeq) - the main constituents of the human genome - are also prime players in its organization. Here we propose that the genome should be envisioned as a supersystem wi...

Chargement des enrichissements...