METADATA MANAGEMENT AND SEMANTICS
IN MICROARRAY REPOSITORIES Kocabaş F1,2,*, Can T3, Baykal N1 *Corresponding Author: Fahri Kocabaş, NATO HQ C3S, Blvd Leopold III B, 1110 Brussels, Belgium;
Tel.: +32-2-707-5533; Fax: +32-2-707-5834; E-mail: FK:fahri@ii.metu.edu.tr; f.kocabas@hq.nato.int page: 49
|
INTRODUCTION
The amount of data from experiments on microarray
repositories becomes unmanageable as the number
and content of submissions grow. The annotations
and metadata additions to microarray records add to
their existing content. However, these contextual data
are not appropriately structured and do not conform
to defined standards. The biomedical community has
an interest in the interpretation of results of investigations
in which microarrays are used. There are serious
backlogs and exchange between the repositories cannot
take place.
Several standardization initiatives in the microarray
community have progressed. For example,
MIAME (Minimum Information About a Microarray
Experiment) focuses on content [1]. Others include:
minimum dataset checklist, MIBBI (Minimum Information
for Biological and Biomedical Investigations);
object model, MAGE OM (Microarray Gene Expression
Object Model); exchange platform, MAGE-ML
(Microarray Gene Expression Mark-up Language);
ontology, MGED (Microarray Gene Expression Data)
Ontology [2]. These initiatives and their developments
have been presented in review articles [3]. The three
primary microarray repositories are: NCBI GEO (National
Center for Biotechnology Information Gene Expression
Omnibus) [4], EBI (European Bioinformatics
Institute) ArrayExpress [5], and CIBEX (Center for
Information Biology Gene Expression Database) [6].
Microarray repositories not only host the experimental data but also present tools for querying and
analyzing microarray records. Public-domain software
has been developed on the BioConductor platform [7],
such as GEOmetadb [8], to extend the functionality
of the GEO repository, and to implement MAGE OM
such as Sequence Analysis and Management System
(SAMS) [9]. However, it is difficult for laboratories
with less bioinformatics support to implement these
applications. Thus, exchange and common understanding
of data among disparate repositories continues to
be an issue, despite the fact that mediating software
is available [10]. The MINiML (MIAME Notation
in Mark-up Language) and MAGE-TAB (Microarray
Gene Expression Tabular) that have been developed to
provide solutions to these problems [11] lack standard
syntax and semantics. The solution is standard-related
and can be provided with data management discipline
using architectural frameworks.
The GEO repository has been selected for this
study. We detected the following flawed and ambiguous
entries on GEO records. (1) Inconsistent, incomplete,
and incorrect entries for the same information
element. For example, there are seven different spellings
(United States of America, United States, USA,
US, U.S., U.S.A., U.S.A) in address data for the country
name ‘USA’. There are city names in the country
field. There are different patterns for the names of the
same person, organization and date. (2) Three different
versions of MINiML files for the same Series record
that have different content are i) MINiML format for
HTML Series record, ii) MINiML_family link within
the HTML Series record, and iii) programmatically
extracted Series data for the whole database. For example,
one of the contributors is missing in Series Record
GSE362 at “i.” The Summary, PubMed ID, and
Overall Design information fields are not available at
“iii.” (3) Related experiments (super Series and sub
Series records) are not visible. A super Series record
includes individually submitted subset records, all of
which belong to one experiment. Since some Series
records about an experiment are submitted separately
without stating if they are related, it is difficult to trace
records for such an experiment. For example, Vijay G.
Sankaran submitted three Series records (GSE13283,
GSE13284, and GSE13285) on 5 December 2008,
which did not seem to be part of a single experiment.
However, they prove to be connected to a single experiment
so that GSE13285 is a super Series record, which
includes subset Series GSE13283 and GSE13284. (4)
The MIAME guideline (1), that the summary part of
a microarray experiment record and the abstract in its
publication should be the same, is not followed. For
example, GSE3570 and GSE15808 have different
summary information than the abstracts of their publications.
This is a data integrity issue. GSE5546 was
submitted to GEO in 2006 and has no citation information
yet but its related publication was published in
2008 (PMID18271932.)
Some areas that have room for improvement in
GEO data management are as follows: the microarray
repositories are not connected. Thus, the records
that are on different repositories are not visible. The
MIAME is a content standard that lists the minimum
content without format guidance. The type, content,
format, and availability of data and metadata on different
repositories are at varying degrees. Therefore,
the regular exchange of data as it occurs among DNA
repositories does not happen. There is an initiative by
the ArrayExpress staff to import GEO records (approximately
10% of GEO records) on a weekly basis.
However, they are not synchronized and if the records
in GEO are updated, this will not automatically be reflected
in the corresponding ArrayExpress entry [12].
The metadata about the records are not structured
in accordance with the DC (Dublin Core) metadata
standard [13]. There are entry anomalies, inconsistent
terminology and even incorrect entries within metadata,
e.g., in contact information (names, organizations,
country names, date) or in the summary. This can be
handled with a structured data entry that is based on
controlled vocabulary and ontology. Mandating patterns
could also be included in a relevant schema file
as tested in OpenSDE projects [14]. The experimenter
could enter more of the experimental findings including
metadata on contributors, experiment settings, biomaterials,
data analyses, and especially on the result/
summary section if there was a structured format.
The quality and state of the record is not clearly
labeled at submission and throughout its lifetime. The
quality metrics (values such as “verified” and “citation
>10”) and states (values such as “incomplete” or “retired”)
can add important meaning to the records. For
example, some experiments are published in a highcitation
publication, are performed by respected scientists,
verified with RT-PCR (real-time polymerase
chain reaction), and repeated with success. However, a
record may be identified as a poor study if it is contradicted
by experiments of high quality. There are also comparability issues between different platforms as
pointed out by the MAQC (MicroArray Quality Control)
project [15].
Microarray records, related publications, and relevant
data fed into databases such as gene and biological
pathways should be consistent. The microarray
repository should be the reference for other platforms.
The semantics is not addressed in the design of microarray
repositories. Thus, understandability and usability
is weak, and life cycle management to include
version and change management is not available.
More automation would be addressing slow curation
work and the increasing number of backlogs.
For example, GEO is experiencing a significant backlog
in curated Dataset (GEO Data Set: GDS), creation
and most of the submitted Series records (GEO
Series: GSE)do not have a corresponding Dataset.
Analysis tools operate on GDS records. At present,
there are about 2721 GDS records and 22677 Series
records (two GSE in one GDS on average). There are
more than 15,000 GSE records yet to be curated. This
amounts to an 80% backlog. Also, 20% of submitted
Series records have not yet been published due to ongoing
curation work. The number of GDS records has
been unchanged since last year.
Here we report on a framework, MAdmf (Microarray
Discovery Metadata Framework), which addresses
these issues and its application to a case study.
|
|
|
|
|
Number 27 VOL. 27 (1), 2024 |
Number 26 Number 26 VOL. 26(2), 2023 All in one |
Number 26 VOL. 26(2), 2023 |
Number 26 VOL. 26, 2023 Supplement |
Number 26 VOL. 26(1), 2023 |
Number 25 VOL. 25(2), 2022 |
Number 25 VOL. 25 (1), 2022 |
Number 24 VOL. 24(2), 2021 |
Number 24 VOL. 24(1), 2021 |
Number 23 VOL. 23(2), 2020 |
Number 22 VOL. 22(2), 2019 |
Number 22 VOL. 22(1), 2019 |
Number 22 VOL. 22, 2019 Supplement |
Number 21 VOL. 21(2), 2018 |
Number 21 VOL. 21 (1), 2018 |
Number 21 VOL. 21, 2018 Supplement |
Number 20 VOL. 20 (2), 2017 |
Number 20 VOL. 20 (1), 2017 |
Number 19 VOL. 19 (2), 2016 |
Number 19 VOL. 19 (1), 2016 |
Number 18 VOL. 18 (2), 2015 |
Number 18 VOL. 18 (1), 2015 |
Number 17 VOL. 17 (2), 2014 |
Number 17 VOL. 17 (1), 2014 |
Number 16 VOL. 16 (2), 2013 |
Number 16 VOL. 16 (1), 2013 |
Number 15 VOL. 15 (2), 2012 |
Number 15 VOL. 15, 2012 Supplement |
Number 15 Vol. 15 (1), 2012 |
Number 14 14 - Vol. 14 (2), 2011 |
Number 14 The 9th Balkan Congress of Medical Genetics |
Number 14 14 - Vol. 14 (1), 2011 |
Number 13 Vol. 13 (2), 2010 |
Number 13 Vol.13 (1), 2010 |
Number 12 Vol.12 (2), 2009 |
Number 12 Vol.12 (1), 2009 |
Number 11 Vol.11 (2),2008 |
Number 11 Vol.11 (1),2008 |
Number 10 Vol.10 (2), 2007 |
Number 10 10 (1),2007 |
Number 9 1&2, 2006 |
Number 9 3&4, 2006 |
Number 8 1&2, 2005 |
Number 8 3&4, 2004 |
Number 7 1&2, 2004 |
Number 6 3&4, 2003 |
Number 6 1&2, 2003 |
Number 5 3&4, 2002 |
Number 5 1&2, 2002 |
Number 4 Vol.3 (4), 2000 |
Number 4 Vol.2 (4), 1999 |
Number 4 Vol.1 (4), 1998 |
Number 4 3&4, 2001 |
Number 4 1&2, 2001 |
Number 3 Vol.3 (3), 2000 |
Number 3 Vol.2 (3), 1999 |
Number 3 Vol.1 (3), 1998 |
Number 2 Vol.3(2), 2000 |
Number 2 Vol.1 (2), 1998 |
Number 2 Vol.2 (2), 1999 |
Number 1 Vol.3 (1), 2000 |
Number 1 Vol.2 (1), 1999 |
Number 1 Vol.1 (1), 1998 |
|
|
|