Balkan Journal of Medical Genetics

METADATA MANAGEMENT AND SEMANTICS IN MICROARRAY REPOSITORIES
Kocabaş F1,2,*, Can T3, Baykal N1
*Corresponding Author: Fahri Kocabaş, NATO HQ C3S, Blvd Leopold III B, 1110 Brussels, Belgium; Tel.: +32-2-707-5533; Fax: +32-2-707-5834; E-mail: FK:fahri@ii.metu.edu.tr; f.kocabas@hq.nato.int
page: 49

MATERIALS AND METHODS

The Solution – MAdmf (Microarray Discovery Metadata Framework). The GEO repository is one of the main submission areas and a primary information resource for biomedical inquiries. There are three records (Platform, Sample, and Series) that are supplied by submitters on GEO. A GEO Series (GSExxx) record summarizes an experiment by linking a group of related samples. The GEO curator reassembles this data (one or more GSE records) into a GEO Dataset (GDSxxx), which represents samples processed using the same platform [4]. The GEO provides an XML file (MINiML) for each submitted record. Our focus has been on the MINiML file which includes both data (such as summary, platform, and sample data) and metadata (such as title, description and contact information) in this study. The MINiML file should serve as metadata card, but it is not named and designed as such. We propose a framework, MAdmf, which includes a format for metadata in microarray results to address listed issues. The metadata card, semantic net and metadata registry are the key elements of this framework. The metadata card is an index card for storing basic data elements about specific domain information. The metadata card would provide the reader with information to assist him/her in making a decision as to whether the record(s) might suit his/her needs. Sem- Net is a small data model to represent domain-specific information. The metadata cards and SemNets are encoded in RDF/XML (a language for metadata and knowledge representation format). Syntax encoding schemes are used in SemNets. The metadata registry is a shareable repository for metadata and its related SemNet(s). The framework has four components as depicted in Table 1. First, we provide a metadata card (Madmc, Microarray Discovery Metadata Card) to include common exchange elements in a standard format in accordance with metadata standards. Thus, discoverability, semantic interoperability, and integration operations are supported. The format and structure of MAdmc is the extension of MINiML [16] and based on DC, and Metadata Registry Standard [17]. Second, SemNets are developed for experimenters and results for related experiments. Third, Queries in SPARQL (Simple Protocol and RDF Query Language) [18] format, have been developed for information access and discovery operations. Finally, these products (MAdmc, SemNets, and associated queries) are stored in a common reference area for further use. They can also be exchanged among microarray repositories. Such an exchange or share may reduce the need for multiple submissions and undesired redundancy where raw data resides at its original place. The metadata card and its associated SemNet(s) may hold frequently accessed data patterns as well as previously hidden or unavailable content in a structured format. Thus, much more automated processing can be involved. They can be queried without a need for a dedicated application. It is because they are represented in RDF/XML that is extendable, integrable, and queryable. The proposed framework is about organizing and structuring the microarray metadata in its syntax and semantics. The user may perform complex queries and backlogs can be reduced with the use of such machine processable metadata cards and their related SemNet(s). Microarray analysis has already evolved into microarray informatics. We believe that such architectural solutions are needed in the microarray domain. The goal to reach shared semantics and common understanding can be realized by applying data management principles over structured and semantically enriched data. There are two main contributions of this study with the proposition of such a metadata framework. The experimenter could submit more contextual data. And, machine interpretable content is promoted that would support curation and analysis work. The expressive power gained is twofold. The producer is tempted to include more of the experimental findings and the implicit or previously unavailable data becomes discoverable by consumers who get the intended meaning. The life cycle management of the records is important. The experimentation and its publication together with some updates on specific databases constitute the first part of the activities in the lifetime of the record. The biomedical community has been successful in this part. However, the important part, which has largely been overlooked, follows this first part and ends when the record is deleted. This second part involves in validation, modification and knowledge discovery (for example, developing research hypotheses in meta-analysis) operations. The weakness lies here as highlighted in several publications [19]. This study is performed on this part to make the results visible, understandable and usable. MAdmf will require additional resources but such an effort will pay off in data-centric operations. We enforced data management by organizing and structuring data that would improve the quality of microarray data analysis. Data management must be built into the process from the beginning to support information system development. It is a knowledge-interoperable development that allows domain experts to build or contribute to a separate data layer which can then be incorporated into knowledge-based design [20]. For example, the domain expert may create a SemNet to include the information “P53 gene related experiments which finds relevance on arsenite and apoptosis on breast cancer as verified by RT-PCR, published in peer-reviewed journal, with citation >10, curated into GDS record and inputted to a specialized repository (such as GO or pathway database, Reactome [21]) in the last decade,” provided that metadata cards contain it. We used the tools from W3C resources in the development of these products. Respective concepts and techniques are borrowed from semantic web (Sem- Web), data management, structured reporting, electronic business management, configuration management, and metadata standards. We state that shareable metadata cards which are semantically powered by semantic nets can be a solution. The framework presented in this study can be used in any high throughput repositories as well as third party platforms. MAdmc (Microarray Discovery Metadata Card). MAdmc is a metadata card for a microarray experiment. The metadata card is a stable concept and used for resource discovery. In our framework, it not only facilitates the visibility but also the usability and common understanding. With that goal in mind, we extended the structure, organization, and syntax of the MINiML file to produce MAdmc. The overall syntax of MAdmc is said to be a format layout for the content. We propose the standardization of metadata in the MINiML file by including DC elements and by introducing the metadata card concept. The metadata card has administrative, descriptive, structural and semantic elements. Dublin core is a standard (ISO 15386) for cross-domain resource description. The use of DC elements in metadata definition also promotes structured entry. Thus, it becomes easy to find and understand information resources. The MINiML seems to serve this purpose but its structure and content is not appropriate to support this function. Structuring the records and making structured entry for data elements within the records are closely related and complementing paradigms. The structured entry for the values is enforced by selecting a value from a controlled vocabulary or entering a value dictated by a pattern in the schema file. Microarray records pose more meaning when analyzed in a batch and placed in a biological context. Since the experimental settings, samples, methods, tools, and format widely differ; it is a challenging task for microarray repositories to offer such an analysis in an efficient manner. We introduced the layers into the organization of metadata elements and employed data and syntax encoding schemes. Repeatability and structural relationships between elements were defined. For example, the title may be repeated (alternative title). Or, the use of an element can depend on a condition of another one. Life cycle management concept was introduced with the use of versioning and modification status information. The life cycle management covers the period from the submission until the retirement, thus bringing up the living record concept. It is implemented based on the relation element which may include the values ‘is version of,’ ‘replaces,’ or ‘part of.’ Thus, this becomes a part of the microarray data rather than the software code. The human or automated users can modify, annotate, and verify a record several times throughout its lifetime. We developed an XML application (MAdmc program) so that the user selects the elements from the MINiML document and add new ones from the DC Metadata Set and attributes from the Metadata Registry standard to create the MAdmc. The DC Metadata Set includes 15 information elements. In MAdmc, we added four new information elements (three in Security, one in Format Specification layer) and detailed each element with the introduction of four attributes including an obligation category. We then organized them into four layers as shown in Table 2. The detail of metadata card definition is given in MAdmc.xsd file, Figure 1. The user can reference this schema file to create his/her own instance document (metadata card). The experimenter or curator can create the MAdmc file by using the MINiML file and the MAdmc program, as explained in the Case Study section. The structure of MAdmc can also be extended by employing associations among the tags. The associations can be represented in EBNF (Extended Backur Naur Form) syntax and defined in the schema file, as was the case for the structured messaging system at NATO (North Atlantic Treaty Organization). For example, an element may occur several times; information elements such as the title, location, organization may have alternate contents; information elements are labelled with one of the categories such as ‘Mandatory,’ ‘Optional’ or ‘Conditional,’ requirement and prohibition of use on a condition (e.g., mutual exclusivity) may be enforced. The rules are encoded in Xpath expressions [22]. Although it is an optional extension, this topic could be visited upon recognition of the metadata concept. The layers (segmentation), repeat,and structural constraints in the mark-up tags can be designed to enhance the structure and meaning in the metadata card. Semantic Nets – Micro Formats. Different parts of the metadata card can be detailed with SemNets. Such work is analogous to the one performed by domain experts on data layer in knowledge-based systems. The SemNets can be generated for each GEO record, or a group of related records or the whole repository, depending on the contextual requirements. The SemNets accompany their related metadata cards and they can all be integrated into a related RDF store. The RDF store can be coupled with any platform and can then be used for ontology development, database modeling, and for any semantic task. Data and syntax encoding schemes are used for information elements such as experimenters, address, description and summary. The data encoding schemes could be Controlled Vocabularies [e.g., Code lists (ISO 3166-Country codes), Classifications (ICD), Subject headings (MeSH)] or formal notations such as ISO 3601(Date Time Group), ISO 639 (Language), or use of a specific name space. Friend of a Friend (FOAF) and Rule Mark-up Language (RuleML) syntaxes are used for encoding relevant data into SemNet. The FOAF is a SemWeb language that describes relationships among people in RDF by forming ontology on its own [23]. RuleML is a mark-up language for publishing and sharing rule bases. It is based on a deductive reasoning engine and its statements can be embedded in knowledge-based systems [24]. The experimenter and the summary parts are extended with SemNets in accordance with relevant syntax to add meaning and to build semantic expressiveness in this study. The experimenters are modeled by using FOAF syntax, and the result part is modeled by using RuleML data log syntax. Online tools in the public-domain, as suggested by W3C, are used in the development of the SemNets. The human concept in the microarray record should be structured. There are types such as human, automated; categories such as scheduled, unscheduled; status such as novel, experienced; roles such as producer, consumer; actors such as submitter, contact, contributor, author of publication, publisher, curator, funding agency representative, government official, meta-analyst, verifier, system developer, reviewer, etc. Such a detailed definition may hold valuable information for a potential consumer. Data sets are at different maturity levels in terms of structure and content. One’s data may be labeled as metadata or information by someone else. And today’s information may become data in the future in its lifetime. An experimenter may need to make a search for the human element to make some decisions for experiment design. There are mature formats such as hcard [25], vcard [26], or W3C’s PIM (Personal Information Management) [27] to include this information into the FOAF model to form a coalition of complementing vocabularies.The summary information has been a frequently accessed area. This portion of the microarray record should also have a machine understandable structure and content. For that reason, we employed an encoding process for the statements to create a SemNet. We included free text statements, the encoded format, and annotations which are all in RDF notation. More data are stored in the RDF format to create linked data today. The RDF files can be integrated into a persistent RDF store to form connected graphs. The properties and relationships of information resources are described within RDF graphs for Sem- Nets [experimenter net (in FOAF) and result net (in RuleML Datalog)] in our study. These are associated to each or a group of related MAdmc record(s) in accordance with which specific knowledge is represented. Thus, Experimenter and Result SemNets can be packed with metadata cards while ontology use is in place. SemNets are data models that are easy to create for specific domain information, which can support both ontology development and database design. Ontology extensions can subsequently be built from these SemNets. For example, describing a person in ontology may eventually converge to a FOAF model. A new vocabulary and ontology extension can be generated from the RDF resources. The RDF triples for information objects may become instances for existing Ontology Web Language (OWL) classes or they may trigger the creation of new classes for specific concepts. It is obvious that ontology terms should be used as the tokens in a SemNet. Ontology is used for annotation, but we encode data and metadata with syntax systems in SemNets. There is a proliferation of ontologies, and there are interoperability problems among them. Ontology for Biomedical Investigations (OBI) standardization initiative focuses on upper ontology development, whereas lower level ontology remains in the realm of domain-specific ontology such as MGED Ontology. Ontology is a conceptual model that may not map to physical data sources, whereas a SemNet does. Semantic net can serve as a basis for bottom up ontology development. Ontology is monotonic where new statements should not falsify previous conclusions [28]. Regarding microarray experiments, there are conflicting results as well as supporting ones and SemNets may include such non monotic statements. Queries. Some frequently asked queries can be materialized in SPARQL within the framework and be posted to a shared registry; SPARQL is similar to Structured Query Language (SQL) and is de-facto standard as RDF Query language. The answers for specific queries for which the results are difficult to obtain at the moment such as the following can then become possible when MAdmf is employed: 1) list submitters who have worked on breast cancer over Tamoxifen effect on humans within X organization for which the records have been curated to GDS; 2) list breast cancer records that have been published in SCI journals with citation numbers >10 and verified and have been included in special databases; 3) list all facts and hypotheses from records related to the P53 gene between 2000 and 2009; 4) list the versions, states (modified, retired, etc.), type (comparative, collaborative, validation, etc.) and modification details of BRCA1 and BRCA2 related records; 5) list super GSE records and their child records that are related to experimentation on gene ATM that finds relevance on apoptosis on breast cancer by submitters from USA in the last decade. The metadata card and SemNets can hold data to answer these questions in a knowledge representation format. One sample query and its result are demonstrated within the Case Study section. MAdmr (Microarray Discovery Metadata Registry). Madmr will be the key element to enforce a data strategy by facilitating visibility, usability and understandability of data assets. The submission package to this ebXML (Electronic Business using XML) based shared space may include MAdmc, SemNet, Schema file, Query file, and a Guidance document, Figure 2. MAdmr can be either GEO or another repository. A federated system of microarray repositories can also assume a metadata registry role to host microarray discovery data. Different users (such as submitter, reviewer, or web services program) can subscribe to such a registry. And producer(s) can make modifications and create new versions throughout the lifetime of the microarray records before retirement on metadata registry. The Case Study. The GEO records (Series, Platform, and Sample) and contact data have been downloaded and stored in OpenOffice BASE Database and examined with a domain specialist in terms of structure and semantics. We accessed 677 Breast Cancer experiment results (677 GSE records, 89 GDS records) in more than 22,000 Series records for the case study. We developed the metadata card by using our MAdmc program, Figure 3.Then, two sets of SemNets have been created per record(s) using RDF Editor Protégé [29], online W3C XML Schema Validation [30] and RDF Validation tools [31]. SemNets (RDF graphs) in Protégé are queried by using SPARQL. First SemNet was for experimenters in FOAF/ RDF (was not included for brevity), and the second one was about the result section, Tables 3 and 4. Note that the examples about these SemNets are given for proof of concept only. Two encoded statements by using RuleML Datalog (casual first order logic) are given in Table 3. We show an entry level encoding in Table 3 to give an insight. The encoding could have gone further with deeper mark-ups as demonstrated in Table 3, a.2. The statements could have been further categorized such as experimental, statistical, and computational or its status could be labeled as verified, challenged, withdrawn, or modified. The goal is to highlight the elements of MAdmf. Thus, we do not claim to present the optimal representation. We here demonstrate that the results can be formatted in a syntax encoding scheme like RuleML Datalog. This structured set of statements can then be shared and processed by automated means. The individual statements for each of these 677 breast cancer GEO records can form a semantic net that is associated to the relevant MAdmc. There may also be global statements about meaningful findings for a specific sub-group of records or whole breast cancer records. SemNets can be in different representations such as triple notation, and graph diagram as well as XML/RDF format. We include three elements in this encoding of the SemNet: the original statements, the encoded format, and annotations. The annotation part of this package provides contextual information and may include if: 1) there is a related publication?; 2) the results are posted somewhere else such as GO or a pathway database?; 3) there are other versions?; 4) it is a fact or hypothesis?; 5) it is verified or challenged? Relevant name space declarations like “MAdmc” can be included into a MAdmc schema file to support the additional definitions, Table 4. A sample Result SemNet is given in RDF/XML format in Table 4, and its graphical output from RDF Validator is given in Figure 4. There may be a different level of encoding for each record based on the availability of relevant information. We recommend entry level encoding at the be ginning, and as acceptance and experience grows, the encoding may be more sophisticated. There are platforms such as jDREW [32] on RuleML Data log in that direction. We not only encode and represent the freetext result section but also open the way for triggering derivations from an already stored rule base. In fact, this is the job of a rule-based system. We demonstrate the capability. Rules can extend the OWL as included in the Semantic Web architecture. In that regard, for example SWRL (semantic web rule language) combines RuleML (Horn-like rules) with OWL (axioms) [33]. And the RIF (rule interchange format) mechanism allows different representations to be grouped for further use [34]. The metadata card and SemNets can also be queried using the online SPARQL tool [35]. The query file in Figure 5 can be attached to the related SemNet file.

Number 27 VOL. 27 (2), 2024	Number 27 VOL. 27 (1), 2024
Number 26 Number 26 VOL. 26(2), 2023 All in one	Number 26 VOL. 26(2), 2023
Number 26 VOL. 26, 2023 Supplement	Number 26 VOL. 26(1), 2023
Number 25 VOL. 25(2), 2022	Number 25 VOL. 25 (1), 2022
Number 24 VOL. 24(2), 2021	Number 24 VOL. 24(1), 2021
Number 23 VOL. 23(2), 2020	Number 22 VOL. 22(2), 2019
Number 22 VOL. 22(1), 2019	Number 22 VOL. 22, 2019 Supplement
Number 21 VOL. 21(2), 2018	Number 21 VOL. 21 (1), 2018
Number 21 VOL. 21, 2018 Supplement	Number 20 VOL. 20 (2), 2017
Number 20 VOL. 20 (1), 2017	Number 19 VOL. 19 (2), 2016
Number 19 VOL. 19 (1), 2016	Number 18 VOL. 18 (2), 2015
Number 18 VOL. 18 (1), 2015	Number 17 VOL. 17 (2), 2014
Number 17 VOL. 17 (1), 2014	Number 16 VOL. 16 (2), 2013
Number 16 VOL. 16 (1), 2013	Number 15 VOL. 15 (2), 2012
Number 15 VOL. 15, 2012 Supplement	Number 15 Vol. 15 (1), 2012
Number 14 14 - Vol. 14 (2), 2011	Number 14 The 9th Balkan Congress of Medical Genetics
Number 14 14 - Vol. 14 (1), 2011	Number 13 Vol. 13 (2), 2010
Number 13 Vol.13 (1), 2010	Number 12 Vol.12 (2), 2009
Number 12 Vol.12 (1), 2009	Number 11 Vol.11 (2),2008
Number 11 Vol.11 (1),2008	Number 10 Vol.10 (2), 2007
Number 10 10 (1),2007	Number 9 1&2, 2006
Number 9 3&4, 2006	Number 8 1&2, 2005
Number 8 3&4, 2004	Number 7 1&2, 2004
Number 6 3&4, 2003	Number 6 1&2, 2003
Number 5 3&4, 2002	Number 5 1&2, 2002
Number 4 Vol.3 (4), 2000	Number 4 Vol.2 (4), 1999
Number 4 Vol.1 (4), 1998	Number 4 3&4, 2001
Number 4 1&2, 2001	Number 3 Vol.3 (3), 2000
Number 3 Vol.2 (3), 1999	Number 3 Vol.1 (3), 1998
Number 2 Vol.3(2), 2000	Number 2 Vol.1 (2), 1998
Number 2 Vol.2 (2), 1999	Number 1 Vol.3 (1), 2000
Number 1 Vol.2 (1), 1999	Number 1 Vol.1 (1), 1998

About the journal ::: Editorial ::: Subscription ::: Information for authors ::: Contact