You know that you are on a threshold of something special when you see experts feverously working away on something they want to test live. Data integration sounds like a relatively dull activity, but nothing could be further from the truth. Is there anything more exhilarating than discovering something that has not been seen before? Of course discovering something new that is impactful is easier said than done, but Keith Elliston, CEO of the tranSMART Foundation, has turned this into his bread and butter. Keith wants to do more than make new discoveries; he wants to change the very philosophy behind the way science is conducted.

People see value in hiding their data, actually it’s more valuable when you share it.

The tranSMART Foundation oversees the development of an open source platform (the tranSMART platform) that enables the upload, integration and exploration of very large biomedical datasets, and organizes a scientific and development community around the platform and translational research. The purpose of the platform is to enable translational scientists to identify the tell tale markers of disease. From this, new hypotheses can be generated that tackle the technical issues behind disease, and to guide the development of new therapies.

Data integration enables treatment of disease.

A disease can manifest itself in many ways, but still produce the indistinguishable symptoms between patients. As a result, often only a few patients will end up benefiting from the generic treatment they are often administered. What translational researchers do is sub-type, personalize or stratify disease domains such as asthma, encephalopathies and cancer to the molecular level by studying integrated datasets taken from a variety of sources. Having discovered new biomarkers of disease (indicators of subtype), the development of treatments that specifically target a patient’s sub-condition can begin – this is the basis of precision medicine.

Keith wants to push the open source philosophy as far as it can go.

When you listen to the community, there is a view that research and treatment development is not simply the responsibility of the pharmaceutical companies or of governments, it’s the responsibility of the whole of society. Keith wants to forward some of the responsibility of medical research to the general public through open source and open data.

“Sooner or later, we could see a seventeen year old discover a key medical biomarker whilst playing around with data on his personal home computer”

Is this a joke? Not according to Keith. By giving the public access to the tools and the research data, we can tap into a huge resource, which promises to provide amazing new discoveries. But we`re not there yet!

The tranSMART Foundation provides the open source tool that enables the public to get to work, but what of the data?

In previous years, we focused our efforts on developing the tranSMART open source platform that enables the integration and exploration of datasets. Having done that we now focus our efforts on open content, after a lot of effort we have access now to two hundred datasets. Most recently the Michael J. Fox Foundation, a not for profit research foundation dedicated to finding a cure for Parkinson`s disease, has provided access to its data. With access to these data, the tranSMART Foundation wanted to test the principle that if you give a group of people access to datasets, and the platform needed to analyse those datasets, something exciting can happen. With the data provided, the tranSMART Foundation decided to conduct a datathon on neurodegenerative diseases.

A datathon is an event that brings together cross-disciplinary experts to analyse data, in this case; data scientists, neuro-scientists and biostatisticians. The experts are split up into groups, and specific challenges are suggested to those groups, though they are free to choose their own as well. They are given access to the platform, access to the data and a short period of time (3 days in total).

From June 30th to July 2nd 2015, a serious attempt to discover biomarkers of Parkinson’s and Alzheimer’s disease was made using the tranSMART platform. Several datasets were provided by: ADNI (Alzheimer’s Disease Neuroimagaing Initiative), PPMI (Parkinson’s Progression Markers Initiative), LRRK2 (Leucine Rich Repeat Kinase 2, a Michael J. Fox Foundation dataset) and BioFIND (a Parkinson’s disease study). These were combined with 10 publicly available gene expression datasets that were curated by the University of Luxembourg. Specifically, we wanted to see if we could discover similarities and differences across neurodegenerative diseases.

Each team selected a specific challenge:

  1. Identify novel biomarkers that predict the progression of Parkinson’s disease.
  2. To build disease profiles for Parkinson’s disease and Alzheimer’s disease through biomarker signatures discovery.
  3. Investigate whether there is anything in common between gene expression analysis of PD and AD blood samples.
  4. To identify a gene signature that differentiates Parkinson’s disease from healthy normal and to compare findings with Alzheimer’s disease.
  5. To compare differential gene expression for PD verses controls with microarray data and check the effects of those genes in AD.

The datathon was a smash hit

The design and outcomes of the datathon have been published, with a number of significant discoveries made; including the identification of many biomarkers, with one or two having diagnostic potential. The groups brought together were very encouraged, and plan to follow up the work conducted during the datathon.

The platform proved highly successful.

Not everything was plain sailing

We found barriers with the open data provided to us, in that it wasn’t as open as we first thought.

We learned that:

  1. Open access to data is not easy.
  2. There are strong restrictions in transformative access to data.
  3. Open source doesn’t readily translate into open data.

The NIH advertises open datasets, but somehow access to the data proved quite difficult. Happily, a deal was made which enabled us to provide access to the data, at least for the duration of the datathon.

We have hosted a number of hackathons and we have this process nailed down pretty well. We have even had some variations on the hackathon theme, for example the testathon, where we saw teams focus on different aspects of the platform and test its performance, functionality, scalability; everything we could think of we tested. These types of events require relatively low activation energy, whereas the datathon requires a high activation energy.

Regulations are geared towards providing access to data in the USA, but this is not as easy as it sounds. Flat files can be made available to individual scientists, but these types of datasets are not useful in that form, and we are prevented from formatting them in ways that become useful, i.e. integration within the tranSMART platform. Open data is defined by Wikipedia as “the idea that some data should be freely available to everyone to use and republish as they wish, without restrictions from copyright, patents or other mechanisms of control.” In the US openness is generally a well received concept and there is a lot of rhetoric for openness, but the reality is quite different.

What of the future?

We are focusing future efforts on scalability and the integration of new data types as we develop the tranSMART Platform. The new 1.3 version of tranSMART (release 16.2) will see great new features and improved stability. After this, we will continue our efforts to produce a commercial quality product. We want to continue our focus on precision medicine and become the ultimate enabler of translational research through our efforts in open source, open data and open science.

We believe open source and open data are critical to innovation and the development of new technology in the life sciences.

Software tools are powerful and access to them is complicated due to the high costs of development, this of course limits the size of the community that can adopt quality tools. Open source invites the community to collaboratively define the problems and offers them an opportunity to collaboratively address those problems. The tranSMART Foundation fills the role of community activator and organizer here. We like to think that the tranSMART platform provides an enabling platform that defines the need for open data, giving scientists access to high end technology without the need for a big budget. This prospect has been born out by the adoption of tranSMART by a growing group of non-profit research foundations, including the Michael J. Fox Foundation and OneMind4Research, amongst others.

With this in mind, the prospect of a seventeen year old with access to powerful tools like tranSMART making a major scientific discovery, doesn’t feel like a joke, it feels very possible.