Data Standards in IDEA-FAST

By Chengliang Dai

One of the main goals of the IDEA-FAST project is to identify novel digital measures for fatigue and sleep disturbances. To achieve this goal, a large amount of data, including clinical data (e.g., demographic data of the recruited patients) and device data (e.g., data collected by sensors and wearable devices for monitoring fatigue, sleep and selected activities of daily living), is needed for analysis. Given the scale of the data involved in IDEA-FAST, adoption of data standards brings a few obvious benefits.

First of all, data standards create efficiency by introducing consistent naming of individuals, events, etc., in the IDEA-FAST dataset. Therefore, a standardised dataset enables better data exchange among systems used by different WPs and allows easier reuse of the data to maximise the impact and exploitation.

Secondly, standardising the dataset underpins higher data quality. Errors and mistakes are more likely to be identified during the data acquisition and entry, which helps to reduce the time for cleaning/transforming the data before analysis.

Last but not least, adopting industry recognised data standards raises awareness among industrial partners interested in developing solutions to the problems encountered in IDEA-FAST project.

Last but not least, adopting industry recognised data standards raises awareness among industrial partners interested in developing solutions to the problems encountered in IDEA-FAST project.

CDISC Standards for IDEA-FAST

Data standards are normally developed by either government agencies or a group of people and organisations. In IDEA-FAST, we adopted CDISC standards, which are a set of well-defined standards popular in both industry and academia for standardising clinical research data.

CDISC stands for Clinical Data Interchange Standards Consortium. CDISC is an open, multidisciplinary, non‐profit standards organization formed in 1997. CDISC standards are supported by over 150 member companies including pharmaceutical companies, biotech companies, contract research organisations/service providers, and technology providers. Moreover, CDISC standards have been adopted by FDA as the mandatory standards for data submissions since 2016.

CDISC has developed different models and standards for different stages in a research process and WP5 has been mainly focusing on developing the standard for the stage of organising the research data in IDEA-FAST.

The CDISC standard for data organisation is called Study Data Tabulation Model (SDTM), which provides a standard for organizing and formatting data to streamline processes in collection, management, analysis and reporting [1]. As the team that is responsible for developing data management strategy, data standards and the Data Management Platform (DMP), WP5 is working on developing SDTM-based standards for all the IDEA-FAST datasets stored on the DMP.

Other CDISC standards and models include Protocol Representation Model (PRM), that defines a standard for planning and designing a research protocol [2], Clinical Data Acquisition Standards Harmonization (CDASH), that is the standard which is used for the data collection stage [3], and Analysis Data Model (ADaM), that defines the dataset and metadata standards that support efficient generation, replication, and review of clinical trial statistical analyses, as well as traceability among analysis results, analysis data , and data represented in the SDTM [4].

Another important document included in CDISC standards is Define.xml, which describes the metadata for all submitted data structures (SDTM & ADaM).

CDISC standards and models for different stages of the research

Domains in IDEA-FAST Dataset

In SDTM, data and observations are grouped into a series of standardized domains. Each domain is a collection of logically related observations and contains standard structures, variable names, variable attributes and controlled terminology.

Most of the data collected in IDEA-FAST belong to the following domains:

  • Demographics (DM)
    • Age, gender, etc.
  • Medical History (MH)
  • Vital Sign (VS)
    • Blood pressure, height, weight, etc
  • Questionnaires (QS)
    • EQ-5D, MFI, etc.
  • Functional Test (FT)
    • Actigraphy data, etc.

Examples of Standardised IDEA-FAST Clinical Data

Demographics data of subject A, B and C
Standardised EQ-5D questionnaire data of subject A

Conclusion

In IDEA-FAST, data come from quite diversified sources, therefore data standards play an important role in improving the data quality and reducing the time cost for data cleaning/transformation. A standardised dataset also facilitates the data exchange across different studies, potentially increasing the impact of the IDEA-FAST project. WP5 will continue developing and maintaining data standards for the IDEA-FAST data stored on the DMP.

References

[1] https://www.cdisc.org/standards/foundational/sdtm, accessed 30 April, 2021
[2] https://www.cdisc.org/standards/foundational/protocol, accessed 30 April, 2021
[3] https://www.cdisc.org/standards/foundational/cdash, accessed 30 April, 2021
[4] https://www.cdisc.org/standards/foundational/adam, accessed 30 April, 2021


About the author

Chengliang Dai is a research associate in Data Science Institute at Imperial College London. He works in data management team in IDEA-FAST project and his main duty is to develop data standards for the research data collected in the project. He previously worked on a project to develop a translational platform for analysis of brain imaging datasets held in the UK Biobank. He acquired his PhD in computer science from the University of York in 2016 and joined the IDEA-FAST project in 2021.

His research interests include biomedical data processing, time series analysis and deep learning in medical image analysis.