An open-source community begins to gel at the eTRIKS/transMART Developer’s Workshop

Summary of workshop

A tranSMART developer’s workshop was held at the Imperial College London, organised by Prof Yi-Ke Guo. Prof Guo is the academic lead of the European Translational Information and Knowledge Management Services (eTRIKS) project and was recently appointed CTO of the tranSMART foundation.  40 attendees from 12 organisations participated in the 3 day meeting.

Prof. Guo described tranSMART as a great opportunity for all stakeholders. We are in the midst of a transition from a vender-owned, pharmaceutical company centric application to an open  platform centered on the common needs of all translational research stakeholders.

The vision, mission and goals of the tranSMART Foundation are as follows:

Vision: Realisation of the promise of translational research.

Mission: Effective sharing, integration, standardisation and analysis of heterogeneous data from collaborative translational research using the tranSMART platform.

Goals:

  • Establishment and maintenance of transMART as the preferred data sharing and analytics platform for translational research.
  • Facilitatiton of collaborative research across academic, non-profit and corporate communities.
  • Alignment and growth of a vibrant developer network around the scientific goals of the tranSMART community.
  • Reduction of barriers to entry through use of advanced technologies and an active marketplace.

There is a need for convergence of development efforts in order to achieve  a stable and uniform core, a commonly agreed development process, a disciplined quality control process and a shared vision of the development roadmap.

The aim of the workshop was “productive convergence for progressive divergence” by achieving :

  • A common understanding of the state of the art.
  • A collectively designed and agreed development process and platform.
  • An agreed development roadmap for next 6 and 12 months resulting in a stable tranSMART 1.x core system.
  • A shared vision of tranSMART future direction beyond 1.x.
  • An organisational mechanism for delivering the core system.

Knowledge Sharing & Current State of the Art

Jinlei Liu, director of Engineering at Recombinant by Deloitte, described the current status of tranSMART development. Adoption of tranSMART in pilot, pharmaceutical company, academic medical centres/government and non-profit organisation projects has been increasing over the last 2 years. As a consequence there is now an emerging community (Figure 1).

Figure 1. tranSMART Adoption and Emerging community

 

Liu went on to say that the core problem of collaborative analysis of medical research data sets, is that is not scalable today.  This is due to a lack of standard integration between data sets.

He then went on to provide an overview of the currents system architecture, data categories accommodated and storage, alongside an explanation of existing tables in the tranSMART database.

Liu’s vision of future code improvements encompasses data storage redesign, including move large part of the experiment data to a NoSQL DB for high performance and scalable management,  curation and ETL enhancements, analytics integration via an R plugin and moving to a service and plugin based, multilayer (N-tier) architecture model.

He envisions the tranSMART open source project as becoming a project that resembles Drupal[1].  Drupal currently has 20.585 modules and a community of 23.921 developers.

Paul Avillach from the Pompidou University Hospital in Paris (APHP - HEGP) described the current technology infrastructure at HEGP, where tranSMART is used for storage of clinical and tumoral omics data (mRNA, miRNA and SNP). tranSMART is also used to produce Kaplan–Meier survive analysis plot to assess the impact of mutations on patient survival, with the resulting plots being comparable to published figures. Avillach also highlight the adoption of tranSMART by the CARPEM: CAncer Research and PErsonalized Medicine and EMIF: European Medical Information Framework[2] projects.

Stephen Larson from One Mind for Research, introduced the organisation’s vision and described the Neuroscience Information Framework (NIF)[3] and how the TRACK-TBI initiative[4] is using an amazon web services (AWS) instance of tranSMART in a case study.

Mike Westaway and Ian Dix from AstraZeneca (AZ) and eTRIKS presented an AZ tranSMART proof of concept study.  Issues encountered include the onerous curation process, inadequate analysis capabilities, only basic visualisation and lack off cross-study analysis. Nevertheless, the study conclusion was that AZ would actively monitor tranSMART developments & review their position in future.

Florian Guitton from Imperial College London introduced  the eTRIKS project. He related his experience of installing and debugging PostgreSQL version of the tranSMART on the Imperial College and CNRS cloud platforms. There are a number of latent challenges which are broken down as follows:

  • Design
    • Database
    • APIs
    • Heterogeneous construction
    • Potential for parallelism
    • Installation
      • No package, no scripts
      • Security

Serge Eifes from the University of Luxembourg/ eTRIKS, presented the loading of the Gene Expression Atlas[5] data into the tranSMART Search App as a public data resource. The data curation workflow was detailed and potential changes in tranSMART architecture enabling curation optimisation highlighted. Better capture of data would be enhanced by the incorporation of CDISC standards. There is also a need for minimal information templates for omics / lab data and a need for APIs.

Terry Weymouth from the University of Michigan, provided an overview of the tranSMART PostgreSQL migration effort by reporting on the contributors, timeline, code affected, installation process and open issues.

Hackathon

This workshop was very much about forming the transMART open source development community. Accordingly a ‘hackathon’ took place as a means of establishing the development process.

Attendees were separated into 4 groups:

  • Architecture[6]
  • Data Model
  • ETL[7]
  • Tests

Each group carried out discussions on their topic and worked on issue resolution. The results of the Architecture Group discussion can be found on the tranSMART WIKI (footnote 7), those coming from the Data Model is included in the attached presentation, the ETL (footnote 8) and the Test group results were not documented, but 2 of the 26 failing unit tests were resolved by the group.

On Twitter: Kees van Bochove @keesvanbochove I’m chairing the #tranSMARTHackathon today at eTRIKS meeting. So great to see an open source community form live! Results are presented now

Roadmap

As a prelude to the roadmap definition, features being developed in private or forked projects were presented.

  • J. Cornibe (JnJ) – “Extending tranSMART: Developing a Faceted Search interface for data mining”: described J&J new interface
  • T.Weymouth (UMICH) – “The Umich development effort: NCIBI tools added to tranSMART”: described the Umich efforts in building smart plug-in for the transSMART
  • C. Raillere / D. Peyruc (Sanofi) – “Translational Research at Sanofi - FC&L4tranSMART”: described  Sanofi’s private version of the system and the research on MongoDB based experiment data and analysis results management
  • K. van Bochove (The Hyve/TraiT) – “Using tranSMART for copy number variation analysis”: described the effort in building a new data structure to support CNV.

Limited Feature List:

  1. Core
    1. Support for unified search
    2. I2b2 / tranSMART decomposition
    3. Access Control (Single sign on)
    4. New data modalities (CNV, proteomics …)
    5. Data Storage / Access
      1. ETL / Ontology
      2. Rich clinical data source such as OpenClinica / Redcap
      3. CDISK standards
      4. API Development to support the integration of the following developed components
        1. JnJ Faceted Search
        2. NCIBI Tools
        3. Sanofi custom UI
        4. Raw file storage (NoSQL – MongoDB[8])
        5. Metadata extraction and storage in MongoDB
        6. Galaxy

These features will be enriched during the new 12 months as the 1.x releases.

The key decision coming out of the meeting is the revision of the core of transMART into a set of pluggable components (Legos). Different projects and users will build up their own tranSMART instances based on their specific requirements. To achieve this  development will focus on the following three areas :

1)     ETL and Ontology Plug-in: well developed ETL plug-in will be crucial to the immediate use of many tranSMART deployment.

2)     Core architecture reengineering based on plug-in model.

3)     Establishing a well documented API.

The workshop concluded with the definition of the immediate (3 month), the planning of the  short term (6 month) development roadmap and renewed their “rendez-vous” for early June in the Netherlands to review the 3 month progress, resolve pressing issues and cement the 6 month road map.


[1] http://drupal.org/

[2] http://www.imi.europa.eu/content/emif

[3] http://www.neuinfo.org/

[4] http://www.brainandspinalinjury.org/research.php?id=189

[6] http://transmartproject.org/wiki/display/TSMTGPL/Architecture

[7] http://transmartproject.org/wiki/display/TSMTGPL/Data+ETL

[8] http://www.mongodb.org/