vALId: validation of protein sequence quality based on multiple alignment data.

Archive ouverte

Bianchetti, Laurent | Thompson, Julie Dawn | Lecompte, Odile | Plewniak, Frederic | Poch, Olivier

Edité par CCSD ; World Scientific Publishing -

International audience. The validation of sequences is essential to perform accurate phylogeny and structure/function analysis. However among the thousands of protein sequences available in the public databases, most have been predicted in silico and have not systematically undergone a quality verification. It has recently become evident that they often contain sequence errors. To address the problem of automatic protein quality control, we have developed vALId, an interactive web interfaced software. Taking advantage of high quality multiple alignments of complete protein sequences (MACS), vALId first warns about the presence of suspicious insertions, deletions (indels) and divergent segments, and second, proposes corrections based on transcripts and genome contigs. In a first evaluation test, hundreds of indels and divergent segments were randomly generated in a manually refined MACS. The sensitivity (Sn) and specificity (Sp) of indel detection were excellent (0.96) while the mean Sn(0.49) and Sp(0.56) of divergent segment delineation depended on the percent identity between sequence neighbors. In a second test, 6195 sequences in 100 MACS corresponding to different functional and structural protein families were analyzed. 65% of the sequences were in silico predictions and 44% of eukaryote predicted proteins were partially incorrect with at least one suspicious indel or divergent segment.

Consulter en ligne

Suggestions

Du même auteur

vALId: validation of protein sequence quality based on multiple alignment data

Archive ouverte | Bianchetti, Laurent | CCSD

The validation of sequences is essential to perform accurate phylogeny and structure/function analysis. However among the thousands of protein sequences available in the public databases, most have been predicted in silico and hav...

PipeAlign: A new toolkit for protein family analysis

Archive ouverte | Plewniak, Frederic | CCSD

International audience. PipeAlign is a protein family analysis tool integrating a five step process ranging from the search for sequence homologues in protein and 3D structure databases to the definition of the hier...

Multiple alignment of complete sequences (MACS) in the post-genomic era

Archive ouverte | Lecompte, Odile | CCSD

Multiple alignment, since its introduction in the early seventies, has become a cornerstone of modern molecular biology. It has traditionally been used to deduce structure / function by homology, to detect conserved motifs and in ...

Chargement des enrichissements...