HoloOLIGO corpus, a manually annotated text dataset supporting schema-based relational information extraction for mammalian milk oligosaccharide diversity pattern comprehension

Archive ouverte

Rumeau, Mathilde | Bossy, Robert | Sauvion, Clara | Loux, Valentin | Ba, Mouhamadou | Knudsen, Christelle | Combes, Sylvie | Nédellec, Claire | Deléger, Louise

Edité par CCSD -

International audience. Research on milk oligosaccharides (MO) has gained pace in recent years due to the growing evidenceof their numerous health benefits. Many studies have assessed the MO composition in a wide rangeof mammalian species. This has shown that oligosaccharide structure diversity pattern variesconsiderably between species [1] , between individuals of the same species [2] for the same individualover the course of lactation [3] or other factors such as maternal physiological state and geographicallocation. Data on associate or causal variability of the MO structure pattern are spread in an increasingvolume of scientific articles that must be considered to understand the implications of such diversityon health. On the basis of these observations, the HoloOLIGO ANR project intends to build an accurateand exhaustive reference database to query, find, visualize, compare and analyze MO patterns amongmammalian species by automatically exploring the literature and extracting relevant information onMO diversity within or across species. However, the many typographic variants and chemicalnomenclature conventions in MO chemical compound names between literature make it difficult toautomatically retrieve the textual information from bibliographic databases [4]. Moreover, identifyingthe relationships between the various factors influencing MO composition is a challenge due to thehigh level of information density in the documents. Once extracted from documents, the informationstored in the future database must be normalized with respect to semantic references to be easilyfindable and reused. Information extraction by natural language processing methods is an appealingsolution in such complex cases [5]. Its application requires a manually annotated text corpus. To thatend, we developed the HoloOLIGO gold standard, for training and evaluating named entity recognition,and relation extraction tools meant to extract MO composition and its related factors from scientificliterature.The HoloOLIGO gold standard is a collection of 30 abstracts and extracts from full-text scientificarticles. A relational information annotation schema was created with 14 concepts, 10 binary relationsand multiple n-ary relations relevant to MO composition. External semantic resources were chosen toassign a unique identity to entities and 4 proprietary glossaries were created to overcome the lack ofreference sources. Concept annotation guidelines with a definition of the entities, semantic referencesand relations, annotation rules, and examples were developed to dictate the instructions on how toperform the annotations. Each concept and relation have been annotated in the articles resulting inthe markup of MO qualities and quantities, female metadata and experimental characteristics. Theoriginality of the HoloOLIGO gold standard corpus lies in the complex annotation schema that offers arelevant and valuable basis for modeling the oligosaccharide diversity pattern of milk to anunprecedented degree of detail and better understanding the role of MO on health. Corpus revision isalmost completed and will shortly be made available. The corpus contains 3,642 entities, 2,488 binaryrelations and 2,191 n-ary relations. The most frequent entities are oligosaccharide name (25.6%),species (20.9%) and sample type (17.3%). The binary relation types are distributed as follows:composed of (38.1%), has produced (31.8%) and found in quantity (9.9%). These relations are at thecenter of our annotation schema, emphasizing the relevance of it. In future work the corpus will beused to train and evaluate the automatic extraction models.References1. Urashima T, Messer M, Oftedal OT. Oligosaccharides in the Milk of Other Mammals. In: Prebiotics and Probiotics inHuman Milk [Internet]. Elsevier; 2017 [cité 26 janv 2023]. p. 45‑139. Disponible sur:https://linkinghub.elsevier.com/retrieve/pii/B97801280272570000382. Bode L, Jantscher-Krenn E. Structure-Function Relationships of Human Milk Oligosaccharides. Advances in Nutrition. 1mai 2012;3(3):383S-391S.3. Azad MB, Robertson B, Atakora F, Becker AB, Subbarao P, Moraes TJ, et al. Human Milk OligosaccharideConcentrations Are Associated with Multiple Fixed and Modifiable Maternal Characteristics, Environmental Factors,and Feeding Practices. J Nutr. 1 nov 2018;148(11):1733‑42.4. Krallinger M, Rabal O, Leitner F, Vazquez M, Salgado D, Lu Z, et al. The CHEMDNER corpus of chemicals and drugs andits annotation principles. J Cheminform. déc 2015;7(S1):S2.5. Dunn A, Dagdelen J, Walker N, Lee S, Rosen AS, Ceder G, et al. Structured information extraction from complexscientific text with fine-tuned large language models [Internet]. arXiv; 2022 [cité 13 mai 2024]. Disponible sur:http://arxiv.org/abs/2212.05238

Suggestions

Du même auteur

MilkOligoCorpus annotation guidelines

Archive ouverte | Rumeau, Mathilde | CCSD

This document describes the guidelines for annotating the MilkOligoCorpus. The goal is to design a corpus to be used for evaluating and training extraction methods of information related to milk oligosaccharides (MO) of different ...

MilkOligoThesaurus, a dataset of mammalian milk oligosaccharide synonyms

Archive ouverte | Rumeau, Mathilde | CCSD

International audience. There is a growing interest in milk oligosaccharides (MOs) because of their numerous benefits for newborns’ and long-term health. A large number of MO structures have been identified in mamma...

Omnicrobe, an open-access database of microbial habitats and phenotypes using a comprehensive text mining and data fusion approach

Archive ouverte | Dérozier, Sandra | CCSD

International audience. The dramatic increase in the number of microbe descriptions in databases, reports, and papers presents a two-fold challenge for accessing the information: integration of heterogeneous data in...

Chargement des enrichissements...