A sad truth about all the translational research data that is being collected.

Remember when farmers were producing too many crops?

Subsequently, everyone stood and watched a lot of crops rot in barns. There was no way to organize, ship and appropriately store the unused crops. It was simply too difficult to get the crops where they were most urgently needed. The same thing is happening again.

This time with data!

The more money invested into research, the more data is produced. And that’s it, more data is being produced, a lot more data. Data might not fill the hunger of empty bellies, but it holds the potential to produce answers to a lot of woes.

Unlike crops, data does not rot. In fact, data as old as 64 years can be useful. There are plenty ‘data storage barns’, and more importantly there are well established means of analyzing and integrating data sets, effectively multiplying their value. However, you have to first separate the useful data from the chaff.  Is there a machine for that? Something like those wheat harvesters you see in iconic farming photos?

Unfortunately no, but there are methods. Methods for cleaning data, methods for describing data and methods for loading data. Methods that require the effort of bioinformaticians, which is all good and well, but there is another problem.

Datasets must be universally recognizable before they can be universally exploited.

Unfortunately, a lot of data is filed away without much consideration given to format or future use. This data could be re-used. One of the greatest weaknesses to any translational research study is lack of enough data to make definitive conclusions.  Old data that cannot be compared to new data is opportunity lost.

Scientists now have too much choice when it comes to data formats. In fact, it’s quite common for researchers to invent formats for each new technique and sometimes each experiment. This makes the work of integrating large datasets significantly more difficult. Lots of bioinformaticians toiling away in dark cubicles. This is at the very least inefficient, if not an outright insurmountable barrier. It could be better.

It would be better if for each data type there was a set format. All stored using a uniform set of standards. Coming up with standards is not an easy task. ‘Standard’ is a synonym for consensus and consensus is often hard if not impossible to reach. Is it even worth the effort to try?

Gina Kolata writes in a recent New York Times article:

“the fear is that this avalanche of genetic and clinical data about people and how they respond to treatments will be hopelessly fragmented and impede the advance of medical science.”

She goes on to point out that there are currently no standards for genetic and clinical data. Quoting Brad Margus, founder of the A-T Children’s Project she also points out that having big datasets will enable researchers to be proactive.

This is exactly the value of robust translational research knowledge management. It allows you to perform data driven exploratory analyses.  Even just harmonizing across a few studies can add significant value. So, while aiming to create a set of global standards is a gallant effort, even having a single set of studies under one standard is immensely valuable.

This is indeed what eTRIKS is aiming to do for European translational research projects. One of the eTRIKS partners, CDISC, is already achieving some success in this regards for clinical data. CDISC is an international non-profit organization that is working to form consensus around clinical data standards. The list of those adopting CDISC standards is growing rapidly. However, this only partially fulfills the need.

Translational research includes a whole host of nonclinical data from genomic data from patients to laboratory assays. All are collected in the effort to advance medical care through research. Non-clinical data is where there is a multitude of standards. Each new technique creates a new data format. Each machine collects and stores results in a slightly different way.

Enter people like Susanna Sansone.

Susanna is Associate Director and Principle Investigator at the University of Oxford e-Research Centre, and Honorary Academic Editor of Scientific Data, the online, open access data publication platform under development by Nature Publishing Group. The NPG Scientific Data Team has developed a road map aimed to steer researchers towards non clinical data unification. In the roadmap the NPG describes 4 steps you can take to become part of the scientific data community:

“1. Get in contact

We encourage community representatives to .  Please explain your motivations and who you represent. This helps us to gauge interest within a community, build a list of expert contacts, and plan ahead for the next phases.

2. Register your community standards and databases

When relevant, we strongly encourage those developing or maintaining open, community standards, and/or implementing them at community repositories, to publish and register their initiatives at BioSharing.  This helps us to monitor the development and uptake of standards.  First check if your standards and/or database are listed, or register to submit or claim one or more records to update them, if needed.

3. Help us create improved templates for your community

We will invite designated community representatives to enrich our existing generic ISA configurations and Word and Excel templates to help maximize compliance to community-developed minimum information requirements.  These templates will be vetted through a community feedback process and ultimately released on the Scientific Data website to help authors meet community standards.

4. Get integrated!

We will work with existing data repositories, service providers and data producers to implement direct pipelines, using the ISA framework, to minimize authors’ work and streamline information flow.  This could be used to help build direct submission pipelines from other data management systems or repositories – so that authors will only have to write complex experimental metadata descriptions once.”

You know a topic is important when Nature decides to focus an entire journal on it. You might be saying to yourself “A journal about data? What’s the value?”

The value is that having a place to publish descriptions of datasets is an important communication tool for the drive towards a uniform set of standards. The data scientist is elevated to a deserved level prominence in the field of translational research.

In short, there is a growing effort to make sure translational research data does not figuratively ‘rot’.  The eTRIKS consortium looks to be very much a part of that effort. With CDISC on board the clinical standards are well underway. The next big challenge is the nonclinical standards. eTRIKS is convening a panel of experts to come up with a initial set of recommendations and is planning on ongoing effort to drive this forward. This will however only be a small step. We need a community wide, persistent effort. Hope you will join us.

  • Luciana Cavalini

    Our research group solved the problem of semantic AND syntactic interoperability of Translational Research databases. See our code on https://github.com/mlhim/SemanticMedWeb and join our G+ community on https://plus.google.com/b/114637547900683999300/communities/100734838664245160263

  • tgarrett

    Thanks Luciana. Perhaps you could share some thoughts here on how you see the field developing…

    • Luciana Cavalini

      Thank you Mr Garrett. Our research group, the Multilevel Healthcare Information Modeling (MLHIM) Laboratory has developed a set of specifications that allow distributed, independently developed databases, to send syntactically and semantically valid data back and forth. For that, we had to build a completely new type of software, since our research proved that conventional software is syntactically and semantically not-interoperable by design. But this new type of software is not a newborn; it sits on the shoulders of giants that are developing healthcare interoperability standards for more than 20 years. By adoptiong a pure open souce mindset, we got the best from HL7, openEHR and 13606 to build the MLHIM specifications, combined with XML technologies, which are de facto industry standards, with a huge toolkit for software development and validation. The Semantic Web link was made through an innovative way of combining RDF and OWL in XML Schema 1.1 code. Our source code is 100% validated and it can produce databases and biomedical applications now. See our code on https://github.com/mlhim/SemanticMedWeb and join our G+ community on https://plus.google.com/b/114637547900683999300/communities/100734838664245160263

      • Tim Cook

        What Luciana left out of her reply is that the Lab is currently implementing the contents of the CDE collection as MLHIM Concept Constraint Definitions (CCDs) in order to make them computable. The current CDE collection represents an enormous amount of well defined concepts. The downside is that they are trapped basically as documentation. Computers cannot traverse this and make direct inferences without a computable context model that can be shared between applications.

        We have a weekly Q&A starting 25 June, 2013 where you can learn more about MLHIM and how it represents a harmonization of HL7 and openEHR (and ISO/CEN 13606).

        https://plus.google.com/events/c8i6u3huq6nvhg583dv5sqprbtg