Remember when farmers were producing too many crops?
Subsequently, everyone stood and watched a lot of crops rot in barns. There was no way to organize, ship and appropriately store the unused crops. It was simply too difficult to get the crops where they were most urgently needed. The same thing is happening again.
This time with data!
The more money invested into research, the more data is produced. And that’s it, more data is being produced, a lot more data. Data might not fill the hunger of empty bellies, but it holds the potential to produce answers to a lot of woes.
Unlike crops, data does not rot. In fact, data as old as 64 years can be useful. There are plenty ‘data storage barns’, and more importantly there are well established means of analyzing and integrating data sets, effectively multiplying their value. However, you have to first separate the useful data from the chaff. Is there a machine for that? Something like those wheat harvesters you see in iconic farming photos?
Unfortunately no, but there are methods. Methods for cleaning data, methods for describing data and methods for loading data. Methods that require the effort of bioinformaticians, which is all good and well, but there is another problem.
Datasets must be universally recognizable before they can be universally exploited.
Unfortunately, a lot of data is filed away without much consideration given to format or future use. This data could be re-used. One of the greatest weaknesses to any translational research study is lack of enough data to make definitive conclusions. Old data that cannot be compared to new data is opportunity lost.
Scientists now have too much choice when it comes to data formats. In fact, it’s quite common for researchers to invent formats for each new technique and sometimes each experiment. This makes the work of integrating large datasets significantly more difficult. Lots of bioinformaticians toiling away in dark cubicles. This is at the very least inefficient, if not an outright insurmountable barrier. It could be better.
It would be better if for each data type there was a set format. All stored using a uniform set of standards. Coming up with standards is not an easy task. ‘Standard’ is a synonym for consensus and consensus is often hard if not impossible to reach. Is it even worth the effort to try?
Gina Kolata writes in a recent New York Times article:
“the fear is that this avalanche of genetic and clinical data about people and how they respond to treatments will be hopelessly fragmented and impede the advance of medical science.”
She goes on to point out that there are currently no standards for genetic and clinical data. Quoting Brad Margus, founder of the A-T Children’s Project she also points out that having big datasets will enable researchers to be proactive.
This is exactly the value of robust translational research knowledge management. It allows you to perform data driven exploratory analyses. Even just harmonizing across a few studies can add significant value. So, while aiming to create a set of global standards is a gallant effort, even having a single set of studies under one standard is immensely valuable.
This is indeed what eTRIKS is aiming to do for European translational research projects. One of the eTRIKS partners, CDISC, is already achieving some success in this regards for clinical data. CDISC is an international non-profit organization that is working to form consensus around clinical data standards. The list of those adopting CDISC standards is growing rapidly. However, this only partially fulfills the need.
Translational research includes a whole host of nonclinical data from genomic data from patients to laboratory assays. All are collected in the effort to advance medical care through research. Non-clinical data is where there is a multitude of standards. Each new technique creates a new data format. Each machine collects and stores results in a slightly different way.
Enter people like Susanna Sansone.
Susanna is Associate Director and Principle Investigator at the University of Oxford e-Research Centre, and Honorary Academic Editor of Scientific Data, the online, open access data publication platform under development by Nature Publishing Group. The NPG Scientific Data Team has developed a road map aimed to steer researchers towards non clinical data unification. In the roadmap the NPG describes 4 steps you can take to become part of the scientific data community:
“1. Get in contact
We encourage community representatives to . Please explain your motivations and who you represent. This helps us to gauge interest within a community, build a list of expert contacts, and plan ahead for the next phases.
2. Register your community standards and databases
When relevant, we strongly encourage those developing or maintaining open, community standards, and/or implementing them at community repositories, to publish and register their initiatives at BioSharing. This helps us to monitor the development and uptake of standards. First check if your standards and/or database are listed, or register to submit or claim one or more records to update them, if needed.
3. Help us create improved templates for your community
We will invite designated community representatives to enrich our existing generic ISA configurations and Word and Excel templates to help maximize compliance to community-developed minimum information requirements. These templates will be vetted through a community feedback process and ultimately released on the Scientific Data website to help authors meet community standards.
4. Get integrated!
We will work with existing data repositories, service providers and data producers to implement direct pipelines, using the ISA framework, to minimize authors’ work and streamline information flow. This could be used to help build direct submission pipelines from other data management systems or repositories – so that authors will only have to write complex experimental metadata descriptions once.”
You know a topic is important when Nature decides to focus an entire journal on it. You might be saying to yourself “A journal about data? What’s the value?”
The value is that having a place to publish descriptions of datasets is an important communication tool for the drive towards a uniform set of standards. The data scientist is elevated to a deserved level prominence in the field of translational research.
In short, there is a growing effort to make sure translational research data does not figuratively ‘rot’. The eTRIKS consortium looks to be very much a part of that effort. With CDISC on board the clinical standards are well underway. The next big challenge is the nonclinical standards. eTRIKS is convening a panel of experts to come up with a initial set of recommendations and is planning on ongoing effort to drive this forward. This will however only be a small step. We need a community wide, persistent effort. Hope you will join us.
- Tweet
-