Virus Pop—Expanding Viral Databases by Protein Sequence Simulation

Archive ouverte

Kende, Julia | Bonomi, Massimiliano | Temmam, Sarah | Regnault, Béatrice | Pérot, Philippe | Eloit, Marc | Bigot, Thomas

Edité par CCSD ; MDPI -

International audience. The improvement of our knowledge of the virosphere, which includes unknown viruses, is a key area in virology. Metagenomics tools, which perform taxonomic assignation from high throughput sequencing datasets, are generally evaluated with datasets derived from biological samples or in silico spiked samples containing known viral sequences present in public databases, resulting in the inability to evaluate the capacity of these tools to detect novel or distant viruses. Simulating realistic evolutionary directions is therefore key to benchmark and improve these tools. Additionally, expanding current databases with realistic simulated sequences can improve the capacity of alignment-based searching strategies for finding distant viruses, which could lead to a better characterization of the “dark matter” of metagenomics data. Here, we present Virus Pop, a novel pipeline for simulating realistic protein sequences and adding new branches to a protein phylogenetic tree. The tool generates simulated sequences with substitution rate variations that are dependent on protein domains and inferred from the input dataset, allowing for a realistic representation of protein evolution. The pipeline also infers ancestral sequences corresponding to multiple internal nodes of the input data phylogenetic tree, enabling new sequences to be inserted at various points of interest in the group studied. We demonstrated that Virus Pop produces simulated sequences that closely match the structural and functional characteristics of real protein sequences, taking as an example the spike protein of sarbecoviruses. Virus Pop also succeeded at creating sequences that resemble real sequences not included in the databases, which facilitated the identification of a novel pathogenic human circovirus not included in the input database. In conclusion, Virus Pop is helpful for challenging taxonomic assignation tools and could help improve databases to better detect distant viruses.

Suggestions

Du même auteur

Bat coronaviruses related to SARS-CoV-2 and infectious for human cells

Archive ouverte | Temmam, Sarah | CCSD

International audience. The animal reservoir of SARS-CoV-2 is unknown despite reports of SARS-CoV-2-related viruses in Asian Rhinolophus bats1-4, including the closest virus from R. affinis, RaTG13 (refs. 5,6), and ...

De nouveaux coronavirus de chauve-souris similaires à SARS-CoV-2 éclairent l'origine de la pandémie de COVID-19

Archive ouverte | Temmam, Sarah | CCSD

International audience

Horseshoe bats from Southeast Asia host a high diversity of Sarbecoviruses, including close ancestors of SARS-CoV-2

Archive ouverte | Temmam, Sarah | CCSD

International audience. Background: Bats are a major reservoir for zoonotic viruses, including coronaviruses. Since the emergence of SARS-CoV, considerable efforts have been directed towards describing the diversity...

Chargement des enrichissements...