METADATA MANAGEMENT AND SEMANTICS
IN MICROARRAY REPOSITORIES Kocabaş F1,2,*, Can T3, Baykal N1 *Corresponding Author: Fahri Kocabaş, NATO HQ C3S, Blvd Leopold III B, 1110 Brussels, Belgium;
Tel.: +32-2-707-5533; Fax: +32-2-707-5834; E-mail: FK:fahri@ii.metu.edu.tr; f.kocabas@hq.nato.int page: 49
|
MATERIALS AND METHODS
The Solution – MAdmf (Microarray Discovery
Metadata Framework). The GEO repository is one of
the main submission areas and a primary information
resource for biomedical inquiries. There are three records
(Platform, Sample, and Series) that are supplied
by submitters on GEO. A GEO Series (GSExxx) record
summarizes an experiment by linking a group of related
samples. The GEO curator reassembles this data (one
or more GSE records) into a GEO Dataset (GDSxxx),
which represents samples processed using the same
platform [4]. The GEO provides an XML file (MINiML)
for each submitted record. Our focus has been
on the MINiML file which includes both data (such
as summary, platform, and sample data) and metadata
(such as title, description and contact information) in
this study. The MINiML file should serve as metadata
card, but it is not named and designed as such.
We propose a framework, MAdmf, which includes
a format for metadata in microarray results to address
listed issues. The metadata card, semantic net and
metadata registry are the key elements of this framework.
The metadata card is an index card for storing
basic data elements about specific domain information.
The metadata card would provide the reader with
information to assist him/her in making a decision as
to whether the record(s) might suit his/her needs. Sem-
Net is a small data model to represent domain-specific
information. The metadata cards and SemNets are
encoded in RDF/XML (a language for metadata and
knowledge representation format). Syntax encoding
schemes are used in SemNets. The metadata registry
is a shareable repository for metadata and its related
SemNet(s). The framework has four components as
depicted in Table 1.
First, we provide a metadata card (Madmc, Microarray
Discovery Metadata Card) to include common
exchange elements in a standard format in accordance
with metadata standards. Thus, discoverability,
semantic interoperability, and integration operations are supported. The format and structure of MAdmc is
the extension of MINiML [16] and based on DC, and
Metadata Registry Standard [17]. Second, SemNets
are developed for experimenters and results for related
experiments. Third, Queries in SPARQL (Simple Protocol
and RDF Query Language) [18] format, have
been developed for information access and discovery
operations. Finally, these products (MAdmc, SemNets,
and associated queries) are stored in a common reference
area for further use. They can also be exchanged
among microarray repositories. Such an exchange or
share may reduce the need for multiple submissions
and undesired redundancy where raw data resides at
its original place.
The metadata card and its associated SemNet(s)
may hold frequently accessed data patterns as well as
previously hidden or unavailable content in a structured
format. Thus, much more automated processing
can be involved. They can be queried without a need
for a dedicated application. It is because they are represented
in RDF/XML that is extendable, integrable,
and queryable. The proposed framework is about organizing
and structuring the microarray metadata in its
syntax and semantics. The user may perform complex
queries and backlogs can be reduced with the use of
such machine processable metadata cards and their
related SemNet(s). Microarray analysis has already
evolved into microarray informatics. We believe that
such architectural solutions are needed in the microarray
domain. The goal to reach shared semantics and
common understanding can be realized by applying
data management principles over structured and semantically
enriched data.
There are two main contributions of this study with
the proposition of such a metadata framework. The experimenter
could submit more contextual data. And,
machine interpretable content is promoted that would
support curation and analysis work. The expressive
power gained is twofold. The producer is tempted to
include more of the experimental findings and the implicit
or previously unavailable data becomes discoverable
by consumers who get the intended meaning.
The life cycle management of the records is important.
The experimentation and its publication together
with some updates on specific databases constitute
the first part of the activities in the lifetime of
the record. The biomedical community has been successful
in this part. However, the important part, which
has largely been overlooked, follows this first part and
ends when the record is deleted. This second part involves
in validation, modification and knowledge discovery
(for example, developing research hypotheses
in meta-analysis) operations. The weakness lies here
as highlighted in several publications [19]. This study
is performed on this part to make the results visible,
understandable and usable.
MAdmf will require additional resources but such
an effort will pay off in data-centric operations. We enforced
data management by organizing and structuring
data that would improve the quality of microarray data
analysis. Data management must be built into the process
from the beginning to support information system
development. It is a knowledge-interoperable development
that allows domain experts to build or contribute
to a separate data layer which can then be incorporated
into knowledge-based design [20]. For example, the
domain expert may create a SemNet to include the information
“P53 gene related experiments which finds
relevance on arsenite and apoptosis on breast cancer as
verified by RT-PCR, published in peer-reviewed journal,
with citation >10, curated into GDS record and
inputted to a specialized repository (such as GO or
pathway database, Reactome [21]) in the last decade,”
provided that metadata cards contain it.
We used the tools from W3C resources in the development
of these products. Respective concepts and
techniques are borrowed from semantic web (Sem-
Web), data management, structured reporting, electronic
business management, configuration management,
and metadata standards. We state that shareable
metadata cards which are semantically powered by
semantic nets can be a solution. The framework presented
in this study can be used in any high throughput
repositories as well as third party platforms.
MAdmc (Microarray Discovery Metadata
Card). MAdmc is a metadata card for a microarray
experiment. The metadata card is a stable concept and
used for resource discovery. In our framework, it not
only facilitates the visibility but also the usability and
common understanding. With that goal in mind, we
extended the structure, organization, and syntax of the
MINiML file to produce MAdmc. The overall syntax
of MAdmc is said to be a format layout for the content.
We propose the standardization of metadata in the
MINiML file by including DC elements and by introducing
the metadata card concept. The metadata card
has administrative, descriptive, structural and semantic
elements. Dublin core is a standard (ISO 15386) for cross-domain resource description. The use of DC elements
in metadata definition also promotes structured
entry. Thus, it becomes easy to find and understand information
resources. The MINiML seems to serve this
purpose but its structure and content is not appropriate
to support this function. Structuring the records and
making structured entry for data elements within the
records are closely related and complementing paradigms.
The structured entry for the values is enforced
by selecting a value from a controlled vocabulary or entering
a value dictated by a pattern in the schema file.
Microarray records pose more meaning when
analyzed in a batch and placed in a biological context.
Since the experimental settings, samples, methods,
tools, and format widely differ; it is a challenging task
for microarray repositories to offer such an analysis in
an efficient manner. We introduced the layers into the
organization of metadata elements and employed data
and syntax encoding schemes. Repeatability and structural
relationships between elements were defined. For
example, the title may be repeated (alternative title).
Or, the use of an element can depend on a condition
of another one. Life cycle management concept was
introduced with the use of versioning and modification
status information. The life cycle management covers
the period from the submission until the retirement,
thus bringing up the living record concept. It is implemented
based on the relation element which may include
the values ‘is version of,’ ‘replaces,’ or ‘part of.’
Thus, this becomes a part of the microarray data rather
than the software code. The human or automated users
can modify, annotate, and verify a record several times
throughout its lifetime.
We developed an XML application (MAdmc program)
so that the user selects the elements from the
MINiML document and add new ones from the DC
Metadata Set and attributes from the Metadata Registry
standard to create the MAdmc. The DC Metadata
Set includes 15 information elements. In MAdmc, we
added four new information elements (three in Security,
one in Format Specification layer) and detailed
each element with the introduction of four attributes
including an obligation category. We then organized
them into four layers as shown in Table 2.
The detail of metadata card definition is given in
MAdmc.xsd file, Figure 1. The user can reference this
schema file to create his/her own instance document
(metadata card). The experimenter or curator can create
the MAdmc file by using the MINiML file and the MAdmc
program, as explained in the Case Study section.
The structure of MAdmc can also be extended by
employing associations among the tags. The associations
can be represented in EBNF (Extended Backur
Naur Form) syntax and defined in the schema file, as
was the case for the structured messaging system at
NATO (North Atlantic Treaty Organization). For example,
an element may occur several times; information
elements such as the title, location, organization
may have alternate contents; information elements are
labelled with one of the categories such as ‘Mandatory,’
‘Optional’ or ‘Conditional,’ requirement and prohibition
of use on a condition (e.g., mutual exclusivity)
may be enforced. The rules are encoded in Xpath
expressions [22]. Although it is an optional extension,
this topic could be visited upon recognition of the
metadata concept. The layers (segmentation), repeat,and structural constraints in the mark-up tags can be
designed to enhance the structure and meaning in the
metadata card.
Semantic Nets – Micro Formats. Different parts
of the metadata card can be detailed with SemNets.
Such work is analogous to the one performed by domain
experts on data layer in knowledge-based systems.
The SemNets can be generated for each GEO
record, or a group of related records or the whole repository,
depending on the contextual requirements.
The SemNets accompany their related metadata cards
and they can all be integrated into a related RDF store.
The RDF store can be coupled with any platform and
can then be used for ontology development, database
modeling, and for any semantic task.
Data and syntax encoding schemes are used for
information elements such as experimenters, address,
description and summary. The data encoding schemes
could be Controlled Vocabularies [e.g., Code lists (ISO
3166-Country codes), Classifications (ICD), Subject
headings (MeSH)] or formal notations such as ISO
3601(Date Time Group), ISO 639 (Language), or use
of a specific name space. Friend of a Friend (FOAF)
and Rule Mark-up Language (RuleML) syntaxes are
used for encoding relevant data into SemNet. The
FOAF is a SemWeb language that describes relationships
among people in RDF by forming ontology on its
own [23]. RuleML is a mark-up language for publishing
and sharing rule bases. It is based on a deductive
reasoning engine and its statements can be embedded
in knowledge-based systems [24]. The experimenter
and the summary parts are extended with SemNets in
accordance with relevant syntax to add meaning and to
build semantic expressiveness in this study. The experimenters
are modeled by using FOAF syntax, and the
result part is modeled by using RuleML data log syntax.
Online tools in the public-domain, as suggested by
W3C, are used in the development of the SemNets.
The human concept in the microarray record
should be structured. There are types such as human,
automated; categories such as scheduled, unscheduled;
status such as novel, experienced; roles such as
producer, consumer; actors such as submitter, contact,
contributor, author of publication, publisher, curator,
funding agency representative, government official,
meta-analyst, verifier, system developer, reviewer, etc.
Such a detailed definition may hold valuable information
for a potential consumer. Data sets are at different
maturity levels in terms of structure and content. One’s
data may be labeled as metadata or information by
someone else. And today’s information may become
data in the future in its lifetime. An experimenter may
need to make a search for the human element to make
some decisions for experiment design. There are mature
formats such as hcard [25], vcard [26], or W3C’s
PIM (Personal Information Management) [27] to include
this information into the FOAF model to form a
coalition of complementing vocabularies.The summary information has been a frequently
accessed area. This portion of the microarray record
should also have a machine understandable structure
and content. For that reason, we employed an encoding
process for the statements to create a SemNet. We
included free text statements, the encoded format, and
annotations which are all in RDF notation. More data
are stored in the RDF format to create linked data today.
The RDF files can be integrated into a persistent
RDF store to form connected graphs.
The properties and relationships of information
resources are described within RDF graphs for Sem-
Nets [experimenter net (in FOAF) and result net (in
RuleML Datalog)] in our study. These are associated
to each or a group of related MAdmc record(s) in accordance
with which specific knowledge is represented.
Thus, Experimenter and Result SemNets can be
packed with metadata cards while ontology use is in
place. SemNets are data models that are easy to create
for specific domain information, which can support
both ontology development and database design. Ontology
extensions can subsequently be built from these
SemNets. For example, describing a person in ontology
may eventually converge to a FOAF model. A new
vocabulary and ontology extension can be generated
from the RDF resources. The RDF triples for information
objects may become instances for existing Ontology
Web Language (OWL) classes or they may trigger
the creation of new classes for specific concepts. It is
obvious that ontology terms should be used as the tokens
in a SemNet. Ontology is used for annotation, but
we encode data and metadata with syntax systems in
SemNets.
There is a proliferation of ontologies, and there
are interoperability problems among them. Ontology
for Biomedical Investigations (OBI) standardization
initiative focuses on upper ontology development,
whereas lower level ontology remains in the realm of
domain-specific ontology such as MGED Ontology.
Ontology is a conceptual model that may not map to
physical data sources, whereas a SemNet does. Semantic
net can serve as a basis for bottom up ontology
development. Ontology is monotonic where new statements
should not falsify previous conclusions [28].
Regarding microarray experiments, there are conflicting
results as well as supporting ones and SemNets
may include such non monotic statements.
Queries. Some frequently asked queries can be
materialized in SPARQL within the framework and
be posted to a shared registry; SPARQL is similar to
Structured Query Language (SQL) and is de-facto
standard as RDF Query language. The answers for
specific queries for which the results are difficult to
obtain at the moment such as the following can then
become possible when MAdmf is employed: 1) list
submitters who have worked on breast cancer over
Tamoxifen effect on humans within X organization
for which the records have been curated to GDS; 2)
list breast cancer records that have been published in
SCI journals with citation numbers >10 and verified
and have been included in special databases; 3) list
all facts and hypotheses from records related to the
P53 gene between 2000 and 2009; 4) list the versions,
states (modified, retired, etc.), type (comparative, collaborative,
validation, etc.) and modification details
of BRCA1 and BRCA2 related records; 5) list super
GSE records and their child records that are related to
experimentation on gene ATM that finds relevance on
apoptosis on breast cancer by submitters from USA in
the last decade. The metadata card and SemNets can
hold data to answer these questions in a knowledge
representation format. One sample query and its result
are demonstrated within the Case Study section.
MAdmr (Microarray Discovery Metadata Registry).
Madmr will be the key element to enforce a data
strategy by facilitating visibility, usability and understandability
of data assets. The submission package to
this ebXML (Electronic Business using XML) based
shared space may include MAdmc, SemNet, Schema
file, Query file, and a Guidance document, Figure 2.
MAdmr can be either GEO or another repository. A
federated system of microarray repositories can also
assume a metadata registry role to host microarray discovery
data.
Different users (such as submitter, reviewer, or
web services program) can subscribe to such a registry.
And producer(s) can make modifications and create
new versions throughout the lifetime of the microarray
records before retirement on metadata registry.
The Case Study. The GEO records (Series, Platform,
and Sample) and contact data have been downloaded
and stored in OpenOffice BASE Database and
examined with a domain specialist in terms of structure
and semantics. We accessed 677 Breast Cancer
experiment results (677 GSE records, 89 GDS records)
in more than 22,000 Series records for the case study.
We developed the metadata card by using our MAdmc
program, Figure 3.Then, two sets of SemNets have been created per
record(s) using RDF Editor Protégé [29], online W3C
XML Schema Validation [30] and RDF Validation tools
[31]. SemNets (RDF graphs) in Protégé are queried by
using SPARQL. First SemNet was for experimenters
in FOAF/ RDF (was not included for brevity), and the
second one was about the result section, Tables 3 and
4. Note that the examples about these SemNets are given
for proof of concept only. Two encoded statements
by using RuleML Datalog (casual first order logic) are
given in Table 3.
We show an entry level encoding in Table 3 to give
an insight. The encoding could have gone further with
deeper mark-ups as demonstrated in Table 3, a.2. The
statements could have been further categorized such as
experimental, statistical, and computational or its status
could be labeled as verified, challenged, withdrawn,
or modified. The goal is to highlight the elements of
MAdmf. Thus, we do not claim to present the optimal
representation. We here demonstrate that the results can
be formatted in a syntax encoding scheme like RuleML
Datalog. This structured set of statements can then be
shared and processed by automated means.
The individual statements for each of these 677
breast cancer GEO records can form a semantic net
that is associated to the relevant MAdmc. There may
also be global statements about meaningful findings for
a specific sub-group of records or whole breast cancer
records. SemNets can be in different representations
such as triple notation, and graph diagram as well as
XML/RDF format. We include three elements in this
encoding of the SemNet: the original statements, the
encoded format, and annotations. The annotation part
of this package provides contextual information and
may include if: 1) there is a related publication?; 2)
the results are posted somewhere else such as GO or a
pathway database?; 3) there are other versions?; 4) it is
a fact or hypothesis?; 5) it is verified or challenged?
Relevant name space declarations like “MAdmc”
can be included into a MAdmc schema file to support
the additional definitions, Table 4. A sample Result
SemNet is given in RDF/XML format in Table 4, and
its graphical output from RDF Validator is given in
Figure 4.
There may be a different level of encoding for
each record based on the availability of relevant information.
We recommend entry level encoding at the be ginning, and as acceptance and experience grows, the
encoding may be more sophisticated. There are platforms
such as jDREW [32] on RuleML Data log in that
direction. We not only encode and represent the freetext
result section but also open the way for triggering
derivations from an already stored rule base. In fact,
this is the job of a rule-based system. We demonstrate
the capability. Rules can extend the OWL as included
in the Semantic Web architecture. In that regard, for
example SWRL (semantic web rule language) combines
RuleML (Horn-like rules) with OWL (axioms)
[33]. And the RIF (rule interchange format) mechanism
allows different representations to be grouped for
further use [34]. The metadata card and SemNets can
also be queried using the online SPARQL tool [35].
The query file in Figure 5 can be attached to the related
SemNet file.
|
|
|
|
|
Number 27 VOL. 27 (1), 2024 |
Number 26 Number 26 VOL. 26(2), 2023 All in one |
Number 26 VOL. 26(2), 2023 |
Number 26 VOL. 26, 2023 Supplement |
Number 26 VOL. 26(1), 2023 |
Number 25 VOL. 25(2), 2022 |
Number 25 VOL. 25 (1), 2022 |
Number 24 VOL. 24(2), 2021 |
Number 24 VOL. 24(1), 2021 |
Number 23 VOL. 23(2), 2020 |
Number 22 VOL. 22(2), 2019 |
Number 22 VOL. 22(1), 2019 |
Number 22 VOL. 22, 2019 Supplement |
Number 21 VOL. 21(2), 2018 |
Number 21 VOL. 21 (1), 2018 |
Number 21 VOL. 21, 2018 Supplement |
Number 20 VOL. 20 (2), 2017 |
Number 20 VOL. 20 (1), 2017 |
Number 19 VOL. 19 (2), 2016 |
Number 19 VOL. 19 (1), 2016 |
Number 18 VOL. 18 (2), 2015 |
Number 18 VOL. 18 (1), 2015 |
Number 17 VOL. 17 (2), 2014 |
Number 17 VOL. 17 (1), 2014 |
Number 16 VOL. 16 (2), 2013 |
Number 16 VOL. 16 (1), 2013 |
Number 15 VOL. 15 (2), 2012 |
Number 15 VOL. 15, 2012 Supplement |
Number 15 Vol. 15 (1), 2012 |
Number 14 14 - Vol. 14 (2), 2011 |
Number 14 The 9th Balkan Congress of Medical Genetics |
Number 14 14 - Vol. 14 (1), 2011 |
Number 13 Vol. 13 (2), 2010 |
Number 13 Vol.13 (1), 2010 |
Number 12 Vol.12 (2), 2009 |
Number 12 Vol.12 (1), 2009 |
Number 11 Vol.11 (2),2008 |
Number 11 Vol.11 (1),2008 |
Number 10 Vol.10 (2), 2007 |
Number 10 10 (1),2007 |
Number 9 1&2, 2006 |
Number 9 3&4, 2006 |
Number 8 1&2, 2005 |
Number 8 3&4, 2004 |
Number 7 1&2, 2004 |
Number 6 3&4, 2003 |
Number 6 1&2, 2003 |
Number 5 3&4, 2002 |
Number 5 1&2, 2002 |
Number 4 Vol.3 (4), 2000 |
Number 4 Vol.2 (4), 1999 |
Number 4 Vol.1 (4), 1998 |
Number 4 3&4, 2001 |
Number 4 1&2, 2001 |
Number 3 Vol.3 (3), 2000 |
Number 3 Vol.2 (3), 1999 |
Number 3 Vol.1 (3), 1998 |
Number 2 Vol.3(2), 2000 |
Number 2 Vol.1 (2), 1998 |
Number 2 Vol.2 (2), 1999 |
Number 1 Vol.3 (1), 2000 |
Number 1 Vol.2 (1), 1999 |
Number 1 Vol.1 (1), 1998 |
|
|
|