Analysis pipelines for very rich data sets

The Challenge

OncoTrack is looking at deep data sets for cancer patients to discover new markers for Colon Cancer. The goal of OncoTrack is to identify and characterize biomarkers that will help our understanding of the variable make-up of tumours and how this affects the way patients respond to treatment. This can be used to guide appropriate therapy choices for each individual patient. The data sets created are extremely rich containing clinical data, animal xenograft data and a wide range of genomic information including GWAS and NGS. Already hundreds of terabytes of data have been collected and are ready to be analysed. Large data sets such as these are difficult to navigate, understand and analyse.

The eTRIKS solution

Comparing these large data sets requires cooperation between the clinical scientists and the data informaticians. eTRIKS and OncoTrack have worked together to create a data model that encompasses all the different data available and have developed standard data analysis pipelines to allow scientists to easily repeat complex data analysis tasks. Together, we have also pioneered the “datafest” approach in which OncoTrack scientists work with their data in the newest version of tranSMART with the close support of eTRIKS curation and data experts.

The details

  • tranSMART 1.1 installed on OncoTrack’s own server
  • tranSMART 1.2 installed on eTRIKS hosted server
  • Ontology tree developed for phenotypical data in in-vivo, in-vitro and in-silico experiments
  • Requirements for data support collected for xenograft and cell line data
  • Data structures for NGS data defined, VCF standards and integration of genomic browser
  • Analysis pipeline developed to link cohorts from different data trees and provide easy access to Galaxy capabilities
  • Data transferred from tranSMART 1.1 to fully hosted tranSMART 1.2
  • Training for users and curators
  • “Datafest” hosted day for “hands on” data analyses

