Requirements specification data analytics software package

This document specifies the preliminary requirements for the data analysis software package in the IDEA-FAST project. The package was agreed to run in the analytical environment which is a part of the Data Management Platform (DMP) developed in the project. Therefore, the document is written for: 1) the developers of the analytical environment to understand the specific needs of the analysis process; and 2) researchers and analysts to realise the data characteristics and analysis methods developed in the project so far.

The DMP has been developed to store and manage the data collected throughout the project. The analytical environment (AE) will allow users to analyse the project datasets without downloading and keeping a copy of the dataset on their local computers, which further prevents data breaches. The AE provides a web-based UI for users to access remote computing resources. Users can access the AE via web browsers, using their email address and password to securely login. Scientific applications such as Jupyter, Matlab, Tensorboard, RStudio will be available via the AE while the primary programming language is Python. Access to Github repositories which are hosting code to perform the analysis is also supported.

The main aim of the analysis is to predict fatigue and sleep disturbances. The analysis methods presented in this document are based on the analysis pipeline described in detail in deliverables D4.1 and D4.3. They focus on device specific data processing methods divided in to 4 Concepts of Interest (COI): Activity, Physiology, Sleep and Social/Cognitive. Devices of the Activity, Sleep and Physiology COI share similar characteristics: they are collecting data continuously with a sampling rate of 25-250Hz. The features of interest, step count, movement magnitude, heart rate and heart rate variability, are calculated using basic signal processing methods available in Python libraries such as Scipy and Numpy and the Matlab Signal processing toolbox. With cognitive and social COIs, the data contains responses to cognitive tests conducted twice a day on a tablet, or mobile phone usage logs. The amount of data in these cases is rather low and usually compressed to daily aggregates for further analysis.

Actual prediction of fatigue and sleep disturbances is based on association- and multivariate analysis. Featured device data is aggregated into time windows and then compared to each other and subjective fatigue and sleep related ratings (PROs). General methods to be used are data normalisation, repeated measures of correlation and regressor investigations. These methods can be implemented e.g. using following Python libraries: pandas, numpy, scipy, pingouin, statsmodels, and sklearn. Finally, the analysis results are typically reported in table or graph format using e.g. the matplotlib library.