Like gold, the form that data takes directly affects the ease in which you can extract, export and form it into something more valuable. The more data you have the greater the potential value, and everywhere the eye can see data prospectors are feverously extracting volumes of data from patients. We are living through a data rush, and when you stand still and watch the rush an incontrovertible truth emerges; transforming gold nuggets into bars is easy; semantically integrating different forms of health data to create true wealth of knowledge is not.
This is translational research
There are many clinical research studies taking place globally, and each produces large quantities of data. To get the best out of each dataset we need to align them and find patterns that lead to improved treatments for patients. The more data you align, the greater the chance a pattern will be found. However, before you can harmonise data you must first ensure a certain level of data quality through curation/standardization, so that it can be meaningfully integrated with all other available data. Aligned data enables high quality: data storage, access, extraction and analysis. It is this type of data management that provides the fertile soil for the development of new treatments.
Standardising data is challenging, standardizing all health data is socially, technically, economically and politically very challenging. However, non-standardised data runs the very real risk of becoming old and forgotten. It will get lost.
Who is best placed to address the data standards issue
The European Medical Agency and the Federal Drug Administration work closely with organisations that establish data standards; for example the International Organization for Standardisation (ISO) and the International Conference on Harmonization of Technical Requirements for Registration of Pharmaceuticals for Human Use (ICH). These groups provide a data standards regulatory framework through consultation with key pharmaceutical companies. The JIC was setup in 2007 to interface six of the main SDOs in the industry to determine what projects fall with each SDO and how standards can be integrated ‘end to end’ to provide seamless integration from data capture to analysis. To a lesser extent, publishers also play a role in data standardisation by providing stringent requirements to be met prior to publication in prestigious journals.
Clinical and Omics data
For clinical data types (surgery, laboratory and hospital analysis), Standard Development Organisations (SDOs) such as CDISC and HL7 work with the regulatory agencies and have to date, provided a wealth of standards. The standardisation of the omics field (proteomics, lipidomics, transcriptomics, metabolomics, etc.) is also rapidly evolving, but standardization efforts here are largely driven by communities of interest rather than formal SDOs. Individual initiatives are often funded as part of research grants, which have finite life times. When these projects conclude, ongoing support of the standards developed is difficult to obtain. Continued funding to maintain and evolve these standards is hard to achieve due to the loss of novelty, and as a consequence initially valuable standards can quickly become neglected and outdated.
How much data has been standardized?
Perhaps 10% of the world’s health data is standardized, but this figure could be optimistic. There are data standards already available and recommended, for example the multiple suites of clinical trial standards recommended by CDISC. The adoption of these standards by the pharmaceutical industry is rapidly progressing while the pace of adoption by academic institutions is picking up.
What makes the uptake and implementation of data standards so slow?
Landscape Blockers – Complexity of the environment is a key roadblock, there are tens of thousands of groups producing health data for many different purposes. Ontologies (structured descriptions of knowledge domains) typically have a defined purpose and scope and work well within this scope. However, there is frequently a need to adapt/modify/extend existing ontologies and terminologies for new applications, and in the absence of rigorous yet efficient processes to adjust existing terminologies, the creation of new competing/overlapping terminologies frequently becomes a short term solution which ultimately compounds the difficulties in the strive for broader improvement of data quality and standardization.
Technical Blockers – the standards themselves need to be defined, agreed, and adopted., and be machine-readable. There are no specific blockers here as the technology is available. Technology today is largely modular, and interestingly it’s the shear breadth of available technology that has become the blocker. The specific issue here is one of module compatibility. To apply standards efficiently you need good quality expertise to curate the data and of course time. A highly modular environment requires vast efforts to link through curation. This is counter productive, and we need to slim down the use of modules.
Automated curation is something being worked on, but this will never be 100%, as language is too complex and new data types are constantly being produced – there will always be a need for a human hand. Currently there is a lot of work being done on vast volumes of legacy data. This requires a lot of focused resource, again preventing timely progress in the field.
Mind blocks – I think the penny has dropped but the commitment in the uptake of standards is lagging behind due to many organisations and institutions fearing the costs but not fully appreciating the time savings and efficiency gains in the longer term. It is a complex business with a rapid learning curve but with many expensive legacy data standardization projects being carried out the value of historical data is not being ignored and the advantages and benefits of standards has long since begun with many governments and foundations leading the way.
The most expensive miscomprehension is that standards are there to retrofit existing data to. The standards are designed right from the point of data gathering and clinical trial protocol design to the data analysis. The cost savings in application of standards all the way through the lifecycle of the data far outweigh the costs and you are also then creating data that can go on to be compared with similar studies giving greater gains from these valuable medical studies; studies where patients have often given their blood, some sweat, but hopefully not any tears for the greater good.
How does eTRIKS contribute to the data standards landscape?
Work Package 3 of the eTRIKS project (standards research and coordination) is lead by Michael Braxenthaler of Roche, Paul Houston of CDISC and Philippe Rocca-Serra of the University of Oxford. Their team works with global standards organisations on new standards development, and they frequently engage a world class advisory board ensuring a cutting edge view on the data standards landscape. The team establishes and maintains interaction with IMI and non-IMI translational research projects concerning their data standard and data interchange requirements. Through CDISC inter alia, they promote gold standards for translational research knowledge sharing, and these standards are being applied by eTRIKS client projects.
We also provide recommendations for omics standards. The current efforts are not about enforcement but rather recommendations as a guideline for standards decisions. With the recent inclusion of the University of Oxfords e-Research Centre as an eTRIKS partner with highly valuable expertise around biomedical standards we are now in a position to not only select standards for recommendations but also begin to provide critically important tools to support the application of terminology standards in the experiment and trial design phase as well as in the data acquisition and curation phases.
Finally, we are defining a metadata registry and repository approach to enable consistent application of standards. In close collaboration with other work packages of eTRIKS, namely the curation work package (WP4) and the tranSMART platform development work package (WP2) we are focussing on ensuring the highest possible quality of translational data as a basis for progressing the development of new treatments and options to improve patients’ lives.