These specific design elements align with the concept of knowledge reengineering and represent a sharp departure from top-down approaches in grid initiatives such as Ca BIG.They also present a much more interoperable and reproducible alternative to the still pervasive use of data portals.The development of the TCGA Roadmap engine was found to provide specific clues about how biomedical big data initiatives should be exposed as public resources for exploratory analysis, data mining and reproducible research.
This report details the creation and study of a road map of the TCGA’s HTTP repository, created to enable the use of this unprecedented biomolecular data resource in the creation of Web 3.0 applications, and enhance the reproducibility of biomolecular research delivered as elements of a computational ecosystem.In a previous report (Deus et al., 2010), the authors have identified a Resource Description Framework (RDF) data model describing the contents of TCGA file repository.Availability: A prepared dashboard, including links to source code and a SPARQL endpoint, is available at Contact: - The Cancer Genome Atlas (TCGA) is a joint project of the National Cancer Institute (NCI) and the National Human Genome Research Institute (NHGRI) to comprehensively apply genome analysis technology to the study of the biomolecular basis of cancer (NCI Wiki, 2011).Concretely, this project has analyzed tumor and normal samples from over 6000 patients, which resulted in the collection and public availability of 37 types of genomic and clinical data for 33 cancers.More importantly, in 2011, there was a momentous change in the level of data interoperability of the TCGA data repository: data files are now available directly through HTTP calls to a central directory, located at Jac.
This opens entirely new opportunities for interactive reproducible data analysis and visualization.
Results: We developed an engine to index and annotate the TCGA files, relying exclusively on third-generation web technologies (Web 3.0).
Specifically, this engine uses Java Script in conjunction with the World Wide Web Consortium’s (W3C) Resource Description Framework (RDF), and SPARQL, the query language for RDF, to capture metadata of files in the TCGA open-access HTTP directory.
Simultaneous to the expansion of the TCGA, the tooling required for enabling computational ecosystems for data-driven medical genomics (Almeida, 2010) is maturing rapidly, to the point that tools operating within and providing such ecosystems are beginning to appear (Almeida et al., 2012b).
The concern that the web browser is computationally inefficient for advanced numerical procedures has also been amply overcome, as we found when identifying sequence analysis procedures making use of the Map Reduce (Dean and Ghemawat, 2008) distributed computing template (Almeida et al., 2012a; Vinga et al., 2012).
The resulting index may be queried using SPARQL, and enables file-level provenance annotations as well as discovery of arbitrary subsets of files, based on their metadata, using web standard languages.