METADATA MANAGEMENT AND SEMANTICS IN MICROARRAY REPOSITORIES
Kocabaş F1,2,*, Can T3, Baykal N1
*Corresponding Author: Fahri Kocabaş, NATO HQ C3S, Blvd Leopold III B, 1110 Brussels, Belgium; Tel.: +32-2-707-5533; Fax: +32-2-707-5834; E-mail: FK:fahri@ii.metu.edu.tr; f.kocabas@hq.nato.int
page: 49

INTRODUCTION

The amount of data from experiments on microarray repositories becomes unmanageable as the number and content of submissions grow. The annotations and metadata additions to microarray records add to their existing content. However, these contextual data are not appropriately structured and do not conform to defined standards. The biomedical community has an interest in the interpretation of results of investigations in which microarrays are used. There are serious backlogs and exchange between the repositories cannot take place. Several standardization initiatives in the microarray community have progressed. For example, MIAME (Minimum Information About a Microarray Experiment) focuses on content [1]. Others include: minimum dataset checklist, MIBBI (Minimum Information for Biological and Biomedical Investigations); object model, MAGE OM (Microarray Gene Expression Object Model); exchange platform, MAGE-ML (Microarray Gene Expression Mark-up Language); ontology, MGED (Microarray Gene Expression Data) Ontology [2]. These initiatives and their developments have been presented in review articles [3]. The three primary microarray repositories are: NCBI GEO (National Center for Biotechnology Information Gene Expression Omnibus) [4], EBI (European Bioinformatics Institute) ArrayExpress [5], and CIBEX (Center for Information Biology Gene Expression Database) [6]. Microarray repositories not only host the experimental data but also present tools for querying and analyzing microarray records. Public-domain software has been developed on the BioConductor platform [7], such as GEOmetadb [8], to extend the functionality of the GEO repository, and to implement MAGE OM such as Sequence Analysis and Management System (SAMS) [9]. However, it is difficult for laboratories with less bioinformatics support to implement these applications. Thus, exchange and common understanding of data among disparate repositories continues to be an issue, despite the fact that mediating software is available [10]. The MINiML (MIAME Notation in Mark-up Language) and MAGE-TAB (Microarray Gene Expression Tabular) that have been developed to provide solutions to these problems [11] lack standard syntax and semantics. The solution is standard-related and can be provided with data management discipline using architectural frameworks. The GEO repository has been selected for this study. We detected the following flawed and ambiguous entries on GEO records. (1) Inconsistent, incomplete, and incorrect entries for the same information element. For example, there are seven different spellings (United States of America, United States, USA, US, U.S., U.S.A., U.S.A) in address data for the country name ‘USA’. There are city names in the country field. There are different patterns for the names of the same person, organization and date. (2) Three different versions of MINiML files for the same Series record that have different content are i) MINiML format for HTML Series record, ii) MINiML_family link within the HTML Series record, and iii) programmatically extracted Series data for the whole database. For example, one of the contributors is missing in Series Record GSE362 at “i.” The Summary, PubMed ID, and Overall Design information fields are not available at “iii.” (3) Related experiments (super Series and sub Series records) are not visible. A super Series record includes individually submitted subset records, all of which belong to one experiment. Since some Series records about an experiment are submitted separately without stating if they are related, it is difficult to trace records for such an experiment. For example, Vijay G. Sankaran submitted three Series records (GSE13283, GSE13284, and GSE13285) on 5 December 2008, which did not seem to be part of a single experiment. However, they prove to be connected to a single experiment so that GSE13285 is a super Series record, which includes subset Series GSE13283 and GSE13284. (4) The MIAME guideline (1), that the summary part of a microarray experiment record and the abstract in its publication should be the same, is not followed. For example, GSE3570 and GSE15808 have different summary information than the abstracts of their publications. This is a data integrity issue. GSE5546 was submitted to GEO in 2006 and has no citation information yet but its related publication was published in 2008 (PMID18271932.) Some areas that have room for improvement in GEO data management are as follows: the microarray repositories are not connected. Thus, the records that are on different repositories are not visible. The MIAME is a content standard that lists the minimum content without format guidance. The type, content, format, and availability of data and metadata on different repositories are at varying degrees. Therefore, the regular exchange of data as it occurs among DNA repositories does not happen. There is an initiative by the ArrayExpress staff to import GEO records (approximately 10% of GEO records) on a weekly basis. However, they are not synchronized and if the records in GEO are updated, this will not automatically be reflected in the corresponding ArrayExpress entry [12]. The metadata about the records are not structured in accordance with the DC (Dublin Core) metadata standard [13]. There are entry anomalies, inconsistent terminology and even incorrect entries within metadata, e.g., in contact information (names, organizations, country names, date) or in the summary. This can be handled with a structured data entry that is based on controlled vocabulary and ontology. Mandating patterns could also be included in a relevant schema file as tested in OpenSDE projects [14]. The experimenter could enter more of the experimental findings including metadata on contributors, experiment settings, biomaterials, data analyses, and especially on the result/ summary section if there was a structured format. The quality and state of the record is not clearly labeled at submission and throughout its lifetime. The quality metrics (values such as “verified” and “citation >10”) and states (values such as “incomplete” or “retired”) can add important meaning to the records. For example, some experiments are published in a highcitation publication, are performed by respected scientists, verified with RT-PCR (real-time polymerase chain reaction), and repeated with success. However, a record may be identified as a poor study if it is contradicted by experiments of high quality. There are also comparability issues between different platforms as pointed out by the MAQC (MicroArray Quality Control) project [15]. Microarray records, related publications, and relevant data fed into databases such as gene and biological pathways should be consistent. The microarray repository should be the reference for other platforms. The semantics is not addressed in the design of microarray repositories. Thus, understandability and usability is weak, and life cycle management to include version and change management is not available. More automation would be addressing slow curation work and the increasing number of backlogs. For example, GEO is experiencing a significant backlog in curated Dataset (GEO Data Set: GDS), creation and most of the submitted Series records (GEO Series: GSE)do not have a corresponding Dataset. Analysis tools operate on GDS records. At present, there are about 2721 GDS records and 22677 Series records (two GSE in one GDS on average). There are more than 15,000 GSE records yet to be curated. This amounts to an 80% backlog. Also, 20% of submitted Series records have not yet been published due to ongoing curation work. The number of GDS records has been unchanged since last year. Here we report on a framework, MAdmf (Microarray Discovery Metadata Framework), which addresses these issues and its application to a case study.



Number 27
VOL. 27 (1), 2024
Number 26
Number 26 VOL. 26(2), 2023 All in one
Number 26
VOL. 26(2), 2023
Number 26
VOL. 26, 2023 Supplement
Number 26
VOL. 26(1), 2023
Number 25
VOL. 25(2), 2022
Number 25
VOL. 25 (1), 2022
Number 24
VOL. 24(2), 2021
Number 24
VOL. 24(1), 2021
Number 23
VOL. 23(2), 2020
Number 22
VOL. 22(2), 2019
Number 22
VOL. 22(1), 2019
Number 22
VOL. 22, 2019 Supplement
Number 21
VOL. 21(2), 2018
Number 21
VOL. 21 (1), 2018
Number 21
VOL. 21, 2018 Supplement
Number 20
VOL. 20 (2), 2017
Number 20
VOL. 20 (1), 2017
Number 19
VOL. 19 (2), 2016
Number 19
VOL. 19 (1), 2016
Number 18
VOL. 18 (2), 2015
Number 18
VOL. 18 (1), 2015
Number 17
VOL. 17 (2), 2014
Number 17
VOL. 17 (1), 2014
Number 16
VOL. 16 (2), 2013
Number 16
VOL. 16 (1), 2013
Number 15
VOL. 15 (2), 2012
Number 15
VOL. 15, 2012 Supplement
Number 15
Vol. 15 (1), 2012
Number 14
14 - Vol. 14 (2), 2011
Number 14
The 9th Balkan Congress of Medical Genetics
Number 14
14 - Vol. 14 (1), 2011
Number 13
Vol. 13 (2), 2010
Number 13
Vol.13 (1), 2010
Number 12
Vol.12 (2), 2009
Number 12
Vol.12 (1), 2009
Number 11
Vol.11 (2),2008
Number 11
Vol.11 (1),2008
Number 10
Vol.10 (2), 2007
Number 10
10 (1),2007
Number 9
1&2, 2006
Number 9
3&4, 2006
Number 8
1&2, 2005
Number 8
3&4, 2004
Number 7
1&2, 2004
Number 6
3&4, 2003
Number 6
1&2, 2003
Number 5
3&4, 2002
Number 5
1&2, 2002
Number 4
Vol.3 (4), 2000
Number 4
Vol.2 (4), 1999
Number 4
Vol.1 (4), 1998
Number 4
3&4, 2001
Number 4
1&2, 2001
Number 3
Vol.3 (3), 2000
Number 3
Vol.2 (3), 1999
Number 3
Vol.1 (3), 1998
Number 2
Vol.3(2), 2000
Number 2
Vol.1 (2), 1998
Number 2
Vol.2 (2), 1999
Number 1
Vol.3 (1), 2000
Number 1
Vol.2 (1), 1999
Number 1
Vol.1 (1), 1998

 

 


 About the journal ::: Editorial ::: Subscription ::: Information for authors ::: Contact
 Copyright © Balkan Journal of Medical Genetics 2006