
METADATA MANAGEMENT AND SEMANTICS
IN MICROARRAY REPOSITORIES Kocabaş F1,2,*, Can T3, Baykal N1 *Corresponding Author: Fahri Kocabaş, NATO HQ C3S, Blvd Leopold III B, 1110 Brussels, Belgium;
Tel.: +32-2-707-5533; Fax: +32-2-707-5834; E-mail: FK:fahri@ii.metu.edu.tr; f.kocabas@hq.nato.int page: 49
|
RESULTS AND DISCUSSION
There is a rising volume of microarray data. The
challenge is if we can provide meaning as well as
structure and syntax to this information space for automated
means.
The summary part of the records on microarray
repositories and related publications are not synchronized,
not appropriately structured. They are in freetext
format. The statements are usually incomplete and
ambiguous, thus not easily comparable with others
in similar studies. The results should be visible, understandable, and usable throughout their life cycles.
This is an information management principle. Once
we structure (MAdmc) and encode the contextual data
(SemNet), not only certain operations such as discovery
and exchange become feasible, but also hidden and
previously unavailable facts may be extracted from such structured and encoded data sets. The structured
entry paradigm can also be enforced in addition to annotation
via ontology within a SemNet.
If one searches MAdmr (MAdmc and SemNets), it
will be more efficient than a search on GEO for domain
specific information at present. It is something like
sorting data before an efficient search. It is the process
of linking data for which the resources-properties-relationships
are identified. MAdmf brings about an overhead,
but future benefits will justify this start-up cost.
Describing data in a structured manner can be
better done in a database, but microarray information
space includes several microarray repositories,
experimenter web sites, publications, and specialized
databases. Practically, they cannot all be stored in a
database or easily be federated. If all parties could
have agreed to use MAGE-OM object model and MAGE-
ML exchange platform, there would have been no
format, exchange and integration issues. But, this is
unlikely and there will always be different implementations
that bring about exchange and interoperability
problems. Note that metadata cards and semantic nets
can also be used in a MAGE-OM/MAGE-ML based
repository.
We can say that the microarray domain includes
semi-structured data that can be best managed with
SemWeb technology. SemWeb emphasizes the use of
metadata standards and connected data to support data
centric operations. The proposed framework, MAdmf
follows SemWeb paradigm. The microarray community
should adopt such a data centric approach because
the operations are data intensive. Data management is
the vehicle for data centric initiatives, and an IT system
is as weak as its data management. A data layer is
built separately than the business logic layer in futureproof
applications. MAdmf is related to the data layer.
It promotes the data standardization on microarray repositories. Any modelling or application development
effort can then follow its use.
We examined the MINiML file and introduced
an extended format for a metadata card in this study.
We created domain-specific SemNets and offered their
posting to an ebXML based metadata registry, which
provides a shared information space. Thus, in the proposed
framework: 1) the producer can add structured
data and the consumer can get the conveyed meaning
(what has been received is limited to what has been
understood), 2) due to the possibility for more automation,
backlog is reduced in curation work (from submitted
records to GEO Series or GEO Series to GEO
Datasets or GEO Datasets to Array Express records),
3) ambiguity and redundancy is reduced with standard
format and additional semantics, 4) data centric approach
is adopted, and the quality and expressiveness
of data are promoted where a separate data layer from
business logic is maintained, 5) consumers reach data
otherwise unavailable (new entries in descriptive information
and semantic layer), 6) life cycle management
(lifetime modification and living data set) concept
is introduced, 7) visibility, understandability and
usability are enforced, 8) users can use W3C and the
public-domain tools to extract data, 9) the controlled
vocabularies (Countries, Date/Time Group, Names)
are used not only to annotate but also to encode the
metadata and data, 10) the produced metadata card and
its associated SemNet(s) are extendable, integrable,
queryable and exchangeable, 11) microarray records
and subsequent entries (publication, specialized databases)
can be synchronized.
The extension on the MINiML file has three aspects.
First, content is detailed in summary and experimenters.
Second, format is materialized through
the employment of data and syntax encoding schemes.
The organization and structure is improved with the
introduction of layers, additional metadata elements
and attributes. Third, the process is extended with the
new concepts such as life cycle management, metadata
registry use, and structured entry. In this manner,
the MINiML file has been transformed into a metadata
card and its semantics is extended with SemNets.
Then, they can be used in any similar data center.
The people, experiment, and result data are linked
as the proposed framework provides such a foundation.
Thus, for example, a meta-analyst can get a consolidated
summary of the result part of all breast cancer
data sets by using a SPARQL query. The originator, the
curator, the developers and other experimenters may
benefit from this framework. We give the specification
and present key products in a case study where a proof
of concept is introduced.
The MAGE-ML and MINiML seem to be alternative
structures but they are not in reality. The MINiML
is an intermediary data structure, whereas a MAGEML
application can be developed onto. The creation of
MAdmc and SemNet includes two different and complementary
contributions to support MINiML towards
a format and exchange standard. They do not replace
any existing work. However, if adopted, they can be
a focus for discovery, integration and exchange. The
SemNets can be created for other parts of microarray
record, in addition to the experimenter and summary
data. Note also that this study can easily be adapted
to other microarray repositories or high throughput
repositories.
There is up to a 3% monthly increase in records at
GEO in recent years. There is a backlog of up to 20%
in Series records for varying reasons. There is also
a serious backlog of 80% in Dataset transformation
(GSE to GDS) tasks performed by GEO curators. This
is likely to increase because the amount of data and its
complexity are on the rise (Table 5).
An RDF-enabled database that provides both reasoning
and ontology modeling capabilities, may consume
metadata card and SemNets. Another one could
be a semantic platform that connects heterogeneous
data contained in microarray repositories and related
publications. One can combine people, location, organization, and date information with experimentation
results across microarray information space to
formulate complex inquiries over SemNets and metadata
cards. Moreover, the development of knowledge
interoperable systems with a separate data layer can
be facilitated with such a mode of operation on data.
Equally, rule based systems can make use of the summary
portion of a microarray record that is structured
and encoded.
Standardization studies like this one, which promote
machine understandability and semantic interoperability,
are required. This study not only brings metadata
card and semantic net concepts within a format
standard approach but also introduces the importance
of the life cycle management, data management and
structured entry concepts. Such a study will be beneficial,
especially for producers, curators, future experimenters
and system developers, whether they employ
manual or automated means. The experimental data,
encoded formats, and program, can be requested from
the corresponding author.
|
|
|
|



 |
Number 26 VOL. 26(1), 2023 |
Number 25 VOL. 25(2), 2022 |
Number 25 VOL. 25 (1), 2022 |
Number 24 VOL. 24(2), 2021 |
Number 24 VOL. 24(1), 2021 |
Number 23 VOL. 23(2), 2020 |
Number 22 VOL. 22(2), 2019 |
Number 22 VOL. 22(1), 2019 |
Number 22 VOL. 22, 2019 Supplement |
Number 21 VOL. 21(2), 2018 |
Number 21 VOL. 21 (1), 2018 |
Number 21 VOL. 21, 2018 Supplement |
Number 20 VOL. 20 (2), 2017 |
Number 20 VOL. 20 (1), 2017 |
Number 19 VOL. 19 (2), 2016 |
Number 19 VOL. 19 (1), 2016 |
Number 18 VOL. 18 (2), 2015 |
Number 18 VOL. 18 (1), 2015 |
Number 17 VOL. 17 (2), 2014 |
Number 17 VOL. 17 (1), 2014 |
Number 16 VOL. 16 (2), 2013 |
Number 16 VOL. 16 (1), 2013 |
Number 15 VOL. 15 (2), 2012 |
Number 15 VOL. 15, 2012 Supplement |
Number 15 Vol. 15 (1), 2012 |
Number 14 14 - Vol. 14 (2), 2011 |
Number 14 The 9th Balkan Congress of Medical Genetics |
Number 14 14 - Vol. 14 (1), 2011 |
Number 13 Vol. 13 (2), 2010 |
Number 13 Vol.13 (1), 2010 |
Number 12 Vol.12 (2), 2009 |
Number 12 Vol.12 (1), 2009 |
Number 11 Vol.11 (2),2008 |
Number 11 Vol.11 (1),2008 |
Number 10 Vol.10 (2), 2007 |
Number 10 10 (1),2007 |
Number 9 1&2, 2006 |
Number 9 3&4, 2006 |
Number 8 1&2, 2005 |
Number 8 3&4, 2004 |
Number 7 1&2, 2004 |
Number 6 3&4, 2003 |
Number 6 1&2, 2003 |
Number 5 3&4, 2002 |
Number 5 1&2, 2002 |
Number 4 Vol.3 (4), 2000 |
Number 4 Vol.2 (4), 1999 |
Number 4 Vol.1 (4), 1998 |
Number 4 3&4, 2001 |
Number 4 1&2, 2001 |
Number 3 Vol.3 (3), 2000 |
Number 3 Vol.2 (3), 1999 |
Number 3 Vol.1 (3), 1998 |
Number 2 Vol.3(2), 2000 |
Number 2 Vol.1 (2), 1998 |
Number 2 Vol.2 (2), 1999 |
Number 1 Vol.3 (1), 2000 |
Number 1 Vol.2 (1), 1999 |
Number 1 Vol.1 (1), 1998 |
|
|
|