DEBATE-statistical analysis plans for observational studies

Background All clinical research benefits from transparency and validity. Transparency and validity of studies may increase by prospective registration of protocols and by publication of statistical analysis plans (SAPs) before data have been accessed to discern data-driven analyses from pre-planned analyses. Main message Like clinical trials, recommendations for SAPs for observational studies increase the transparency and validity of findings. We appraised the applicability of recently developed guidelines for the content of SAPs for clinical trials to SAPs for observational studies. Of the 32 items recommended for a SAP for a clinical trial, 30 items (94%) were identically applicable to a SAP for our observational study. Power estimations and adjustments for multiplicity are equally important in observational studies and clinical trials as both types of studies usually address multiple hypotheses. Only two clinical trial items (6%) regarding issues of randomisation and definition of adherence to the intervention did not seem applicable to observational studies. We suggest to include one new item specifically applicable to observational studies to be addressed in a SAP, describing how adjustment for possible confounders will be handled in the analyses. Conclusion With only few amendments, the guidelines for SAP of a clinical trial can be applied to a SAP for an observational study. We suggest SAPs should be equally required for observational studies and clinical trials to increase their transparency and validity.


Signatures
We the undersigned, certify that we read this SAP and approve it as adequate in scope of the mainanalyses of the SICS-I.

Background and rationale
About one-third of all critically ill patients suffer from circulatory shock, which places them at increased risks of multi-organ failure, long-term morbidity, and mortality (1,2). Combinations of clinical, haemodynamic and biochemical variables are recommended for establishing the diagnosis and instigation of treatment (3,4). If necessary, more advanced and sequential haemodynamic assessments using critical care ultrasound (CCUS) as preferred modality are recommended (3)(4)(5)(6).
Clinical examination in the critically ill comprises frequent measurement of heart rate, blood pressure, body temperature, skin perfusion, urine output and mental status (3). Daily use of clinical examination (in any patient) for diagnostic purposes contrasts with the limited number and quality of studies, so that the level of evidence for use of clinical examination in the critically ill is considered best practice (3). Previous studies have suggested different prognostic or diagnostic variables and many studies have analysed single or dual variable associations, while no research has evaluated their additional value on top of the accepted predictors (7). The reason for inconsistency of results in these studies potentially originate from several methodological flaws, including improper research design, lack of confirmation cohorts, and power and sample size issues.
The additive diagnostic and prognostic value of combinations of clinical, biochemical and haemodynamic variables remains to be established with a higher quality of evidence. These variables have never been evaluated collectively in a large, broad, prospective cohort of critically ill patients.
Therefore, we established the Simple Intensive Care Studies I (SICS-I) with the aim to evaluate the diagnostic and prognostic value of a comprehensive selection of clinical and haemodynamic variables in the critically ill (7).
Prospective registration of protocols of observational studies are promoted to prevent outcome reporting bias (8,9). Likewise, prospective publication of a detailed statistical analysis plan (SAP) is encouraged to prevent data-driven analyses (9-11).

Objectives and research questions
The objective of the SICS-I study was to establish a cohort with a dual aim: to evaluate the (1) diagnostic and (2) prognostic value of a comprehensive selection of clinical examination, haemodynamic and biochemical variables in the critically ill. More specific, the two research questions of the basic study were (1): which combination of clinical examination findings is associated with cardiac index measured with CCUS? And (2): which combination of clinical examination, haemodynamic and biochemical variables is associated with 90-day mortality?
In the basic study of the SICS we collected a broad number of clinical examination, haemodynamic and biochemical variables, and used CCUS to only measure cardiac output. The infrastructure and design enabled (temporarily) addition of sub-studies in which additional variables were collected. Research questions of the sub-studies all address the overall aim of the SICS-I cohort (table 1).

Hypotheses
The hypothesis of research question 1, i.e. the diagnostic study, are:

Scope
This SAP will be the guiding document for the analyses that will be conducted in the basic study. We intend to present the results of the two primary aims in separate manuscripts. All the aims and research questions of the sub-studies will be included in the appendix of this SAP, and we aim to present the SAPs of the sub-studies as an addendum in the future.

General study design and plan
The SICS-I is a prospective cohort study which is conducted in the department of critical care of the University Medical Center Groningen (UMCG). The entire study was purely observational in design; no interventions were applied as part of the study protocol.
The protocol of this study was published on the website of the department of critical care of our hospital before the start of the study (Project number: 201500144) and registered at clinicaltrials.gov (NCT02912624). This analysis plan has been written while data collection was ongoing, but before full access to the study database. For our design paper, we only extracted baseline data and we did not have access to the validated outcome data.

Sample size, power and detectable difference
There are no previous studies which included combinations of clinical examination, haemodynamic and biochemical variables into one model for estimation of cardiac output and mortality. This makes it difficult to calculate sample size based on previous literature. Alternatively, we made an estimation of the power of our multivariable models given the set sample size of our cohort.

Diagnostic study
We are planning our main analysis with 1075 patients in which we will regress their values of cardiac index against clinical examination findings. For this power calculation, we used the clinical examination variables urine output and capillary refill time an example (7). In our design paper, the SD of cardiac index was 0.99 and the SD of urine output was 0.98, with a slope estimate of 0.040 obtained when cardiac index was regressed against urine output. The SD of capillary refill time was 2.4, with a slope estimate of -0.045 obtained when cardiac index was regressed against capillary refill time. If the true slope of the line obtained by regressing cardiac index against urine output or capillary refill time is 0.10, we will be able to reject the null hypothesis that this slope equals zero with probability (power) 0.81 for urine output, and 1.00 for capillary refill time. The type I error probability associated with this test of this null hypothesis is 0.015 (see paragraph 6.6 below).

Prognostic study
We are planning our main analysis with 1075 patients and take skin mottling as example for our power calculation: in our design paper, 46% of the patients has skin mottling. If we assume a similar proportion in our total cohort, we have 495 patients with skin mottling and 580 without. Pilot data from our design paper indicate that the 30-day mortality proportion among controls is 0.18 (7). If the true mortality proportion for patients with skin mottling is 0.27, we will be able to reject the null 7 hypothesis that the mortality proportion for patients with skin mottling and control patients are equal with probability (power) 0.84. The type I error probability associated with this test of this null hypothesis is 0.015. We will use an uncorrected chi-squared statistic to evaluate this null hypothesis.

Timing of final analysis
Data cleansing and CCUS image validation will be performed upon completion of the 90-day follow-up of the last patient included in the study. The final analysis will be conducted hereafter.
This statistical analysis plan was added to the study protocol at clinicaltrials.gov, before closure of the database and before any analyses had been conducted. Independent study monitoring was conducted in adherence to the Good Clinical Practice guidelines (10).

Timing of outcome assessments
Follow-up on all-cause mortality was conducted at 1 November 2017, i.e. 90-days after the inclusion of the last patient.

Multiplicity
The diagnostic and prognostic basic study and each sub-study consist of one primary outcome and one or more secondary, exploratory outcomes. We will encounter multiplicity issues due to the multiple primary outcomes that are tested for significance in the same cohort. The SICS-I cohort addresses six different primary outcomes (table 2, last column); two primary outcomes of the basic study (diagnostic and prognostic), and four additional outcomes in the nine sub-studies.
We will apply an adjustment for multiplicity based on the total numbers of different primary outcomes tested. Our cohort mainly addresses haemodynamic research questions, so that most outcomes will probably be positively correlated. Therefore, a Bonferroni adjustment of the P-value might be too conservative. We chose for multiplicity adjusted thresholds by following the pragmatic approach stated by Jakobsen et al. (12). The authors suggested that the 'true' threshold lies somewhere between the unadjusted threshold (most often 0.05) and the Bonferroni adjusted threshold. Where in this interval the 'true' threshold is placed is dependent on the correlation between the outcomes, if two outcomes are perfectly correlated (a correlation coefficient of 1) no adjustment of the threshold for statistical significance is needed, if two outcomes are totally independent (a correlation coefficient of 0) a full Bonferroni adjustment is needed. Therefore, Jakobsen et al suggest "dividing the pre-specified P-value threshold with the value halfway between 1 (no adjustment) and the number of primary outcome comparisons (full Bonferroni adjustment)" as such an adjustment will come closer to the 'true' statistical significance level than the 'extreme thresholds' in a majority of situations (12). The corresponding formula is: Wherein: • α is the unadjusted threshold for significance, usually 0.05 • m is the number of primary outcomes or tests (used) in the same cohort (in this case: six) The threshold for significance in the SICS-I cohort will be:

Statistical significance and confidence interval
As calculated in 4.1, we will consider a P < 0.015 as statistically significant for our primary outcomes.
When we find a 0.015 ≤ P ≤ 0.05, we will consider this association of dubious significance and will emphasise the increased chance of a type I error. Results will be presented with their values (e.g. regression coefficients, odds ratios, etc.) with 98.5% confidence intervals.

Definitions of protocol deviations
Protocol deviations are defined when the activities on a study diverge from the local institutional review board-approved protocol, however without significant consequences (13).

Protocol deviations to be summarised
We opted for the following subgroup analyses in our study protocol: different types of shock (distributive, obstructive, hypovolemic, cardiogenic), CVVH, heart failure by any cause, myocardial infarction, atrial fibrillation or surgery versus no-surgery patient groups.
We decided to replace these proposed subgroup analyses by the subgroups described in section 6.3.
We believe that these subgroups agree better with the perspective from which a physician approaches a critically ill patient.

Screening data
Eligible patients who were not included will be compared to included patients by comparing their general characteristics (age, sex), and SAPS-II and APACHE-IV scores.

Eligibility
All eligible patients were included on their first day of ICU admission. Inclusion in the basic study consisted of a protocolised clinical examination and subsequent CCUS. The attending ICU physician estimated the expected duration of ICU treatment. Patients expected to stay beyond 24 hours who were eventually discharged within 24 hours were included in our main analyses.

Inclusion criteria
• Emergency admission

Recruitment
A flow diagram will be used to visualise the flow of patients. In this flow diagram, we will report the population from which the eligible patients were selected, reasons for exclusion of eligible patients, and how many CCUS images and measurements were validated (diagnostic study) or how many patients died (prognostic study). See figure 1 for an example of our flow diagram.

Level and timing of withdrawal
The withdrawal rate of the SICS-I is below 2%, since it was an observational study in which no interventions were applied.
Following hospital regulations, patients or their legal representatives were informed and were excluded if they refused to participate. Withdrawal from our study occurred when informed consent was obtained from the patient's legal representative, but the patient refused at a later time.

Reasons and details of withdrawal
Reasons for withdrawal or lost to follow-up will be reported in the manuscript and/or flow diagram.
Our observational study consists of a one-time measurement (snapshot), so that drop-outs or lost to follow-up reasons are unrelated to the study.
Patients who were lost to follow-up are considered alive until their last outpatient visit or hospital discharge, and were censored thereafter.

Collected baseline patient characteristics
The cohort study was designed to register a set of clinical examination, biochemical and haemodynamic variables in each included patient. We extracted baseline demographic data from the Dutch National Intensive Care Evaluation (NICE) registry and collected clinical data by protocolised clinical examination and CCUS. We obtained the biochemical values from arterial blood gas analyses closest to study inclusion. Table 2 provides an overview of all collected variables and indicates for each variable whether it is categorised as a clinical examination, haemodynamic, or biochemical variable.

Descriptive summarization of baseline patient characteristics
We will list general patient characteristics in a baseline characteristics table. Data will be presented as mean with standard deviation (SD) when normally distributed or as median with interquartile range in case of skewed data. Dichotomous and categorical data will be presented in proportions. Normality of the data will be assessed using P-P plots, Q-Q plots, and histograms. Linearity will be assessed using scatter plots. Differences between continuous variables will be assessed using Student's t-tests or

Assumed confounding covariates
The majority of variables measured in our study are inevitably correlated, as most relate to the haemodynamic status of a patient. While definitions have been recorded in the protocol, the values of the variables can be confounded by unmeasured factors, such as environmental, genetic, or psychological influences. Therefore, we provide an example of possible confounding variables and categorise these into 'measured' and 'unmeasured'.
• Cardiac output and clinical examination of the central haemodynamics (i.e. heart rate, blood pressures, central venous pressure) are assumed to be confounded by: ○ Measured: body surface area (therefore we will use cardiac index), quality of measurements (therefore data will be validated), distributive shock as the underlying pathology, administration of inotropes and/or vasopressors, administration of propofol (negative inotropic effect), mechanical ventilation including ventilation pressures ○ Unmeasured: stress, pain, anxiety • Urine production is assumed to be confounded by: ○ Measured: history of chronic renal failure, distributive shock as the underlying pathology ○ Unmeasured: total amounts of fluids administered • AVPU score is assumed to be confounded by ○ Measured: sedation (propofol, midazolam) • Clinical examination of peripheral perfusion (i.e. mottling, peripheral capillary refill times, peripheral temperatures) are assumed to be confounded by: ○ Measured: heating blankets, distributive shock as the underlying pathology, cardiac output, administration of inotropes and/or vasopressors, administration of propofol (vasodilation) ○ Unmeasured: regular blankets, environmental temperature, peripheral arterial disease.
• Mortality proportion is assumed to be confounded by: ○ Measured: age, comorbidities, and several other variables that are all embedded in the simplified acute physiology score II (SAPS-II) score ○ Unmeasured: cause of mortality (e.g., death due to multi-organ failure or failure to wean, a patient's or family's personal wishes regarding the extent of ICU treatment. This will, however, always be a mix of causes).
We acknowledge that there will be residual confounding in our dataset due to the presence of unmeasured confounding, some of which is listed above. However, the actual measured variables reflect daily practice and so, is assumed to reflect similar confounding in daily assessments of the haemodynamic status of ICU patients.

Primary and secondary outcomes
The research questions and the design of the study have been published (7). We elaborate on the outcomes of the basic study below and described the primary outcomes of each sub-study in appendix 1.
The outcomes of research question 1, i.e. the diagnostic study, are: • Primary: the association of a single or combination of clinical examination findings with cardiac index measured by CCUS At a later time, the images and measurements were validated by technicians from an independent core laboratory, whom were blinded for all other measurements and outcomes.
We used cardiac index instead of cardiac output for interindividual comparisons. Cardiac index is the cardiac output adjusted for body surface area: Where body surface area was calculated with the DuBois formula (14): Cut-offs for a low cardiac index for critically ill patients are inconsistent (15). Haemodynamic criteria to diagnose cardiogenic shock vary from a cardiac index of 1.8 to 2.5 L/min/m 2 (16)(17)(18)(19). A cardiac index below 2.2 L/min/m 2 is often used to diagnose a low cardiac output syndrome after cardiac surgery (20), whereas a large clinical trial used a cut off below 2.5 L/min/m 2 in patients with acute lung injury (19). These criteria, however, apply to patients with heart failure or after cardiac surgery. There currently is no consensus on how much cardiac index is low, normal or high for the critically ill patient.
In the secondary outcome of our diagnostic study, we will both use a cut-off of 2.2 and 2.5 L/min/m 2 for a low, and a cut-off of 4.0 and 4.5 L/min/m 2 for a high cardiac index.
For the prognostic study we obtained follow-up on all-cause mortality from the municipal personal records database. Analysis of mortality will be performed using time-to-event data (patients were censored at 90-days of follow-up).

Correlations
We will use the Pearson r correlation and Spearman correlation coefficient rho (r) to evaluate the degree of relationship between variables. For normally distributed variables, we will use the Pearson r correlation with 98.5% confidence interval upon checking for linearity and homoscedasticity, while in case of skewed, ordinal data, the degree of association between variables will be quantified using Spearman correlation coefficient rho (r). We will use Cohen's d to evaluate the correlation coefficient and assess the strength (or effect size) of the relationships, where a correlation coefficient between 0.10 and 0.29 will represent a small association, between 0.30 and 0.49 a medium association, and a coefficient of 0.50 and above a large association.

Continuous and dichotomous outcomes (diagnostic study)
We will conduct a least-squares linear regression analysis for continuous dependent variables and a logistic regression for dichotomous dependent variables. A univariable regression analysis will be conducted on 17 clinical examination findings with cardiac index as the dependent variable. A univariable regression analysis will be conducted on all variables and a p < 0.25 will be used for inclusion in the multivariable model. As there are no previous studies that include (a combination of) all available clinical examination variables into one model estimating cardiac index, we will not include any variable on a theory driven basis. We will construct the multivariable model using forward stepwise regression by adding blocks of variables. In case of a multivariable linear regression model, we will construct a kernel density plot to assess normality of the residuals and check the homogeneity of variance by plotting the residuals versus the fitted values. We will use the variance inflation factor (VIF) to check for multicollinearity; as a rule of thumb, we will assume multicollinearity when a variable has a VIF-value greater than 10. If the assumptions are not met, we will use an ordinal regression analysis.

Time-to-event data (prognostic study)
Analysis of mortality will be performed using time-to-event data (patients were censored at 90-days of follow-up). Categorical variables will be analysed using the log-rank test and continuous variables will be assessed using a univariable Cox proportional hazard regression analysis. Analysis of mortality proportion will be presented by Kaplan-Meier survival curves when independent variables are dichotomous or categorical. A univariable Cox regression analysis will be conducted on 22 clinical examination, haemodynamic and biochemical variables with 90-day mortality as the dependent variable. Covariates with a of p < 0.25 in the univariable analysis will be included in the multivariable model. We will include the SAPS-II score in our multivariable Cox regression model of 90-day mortality.
The SAPS-II is a predictive score for in-hospital mortality and includes seventeen covariates among which age, haemodynamic and biochemical variables, and presence of metastatic or haematological cancer (21). The proportionality assumption will be tested by using the Schoenfeld and scaled Schoenfeld residuals. We will test the proportionality of the model as a whole and the proportionality for each predictor and reject proportionality in case of significant test findings (p<0.05). We will plot graphs of the scaled Schoenfeld assumption, where horizontal lines in the graph indicate that the proportionality assumption is not violated. We will also construct log-log plots, where two parallel lines indicate that proportionality was not violated. The goodness of fit of the final model will be evaluated by using the Cox-Snell residuals. In our Cox regression models, at least 15 events are necessary for each variable included in the final model (22,23).

Validation (both)
Because of the fixed sample size of our study, we will use bootstrapping validation to assess the accuracy of our model. We will randomly select 1075 individuals from the dataset with replacement, build our model on the bootstrapped sample and validate it on the original data. We will repeat this process 100 times and include a variable into our final model if it was significant in at least 80 models.
We will still emphasise the hypothesis-generating aspect of our findings. We aim to validate our findings in an independent cohort in the future.

Sensitivity and subgroup analyses
In the diagnostic study, we will conduct a sensitivity analyses on the subgroup of patients in which the quality of the cardiac output measurements by CCUS by the core laboratory is considered 'good'.
In the prognostic study, we will conduct a sensitivity analyses on different follow-up times of mortality: we will also use mortality at 7 and 30 days.
If the sample size permits, we will conduct subgroup analysis in different subpopulations. We will create the following subgroups in the basic study and test both our prognostic and diagnostic hypotheses on: • Subgroup 1: subdivide the population into three groups: no shock, shock associated with a low cardiac output, shock associated with a high cardiac output.
• Subgroup 2: subdivide the population by underlying pathologies that could influence the haemodynamic measurements in a patient: Patients admitted due to cardiac arrest, heart failure, after liver transplantation or liver failure, central nervous system pathologies, and septic shock. We will identify clinical subgroups by using variables such as APACHE-IV admission diagnoses, confirmed infection, cardiac arrest from the Dutch National Intensive Care Evaluation (NICE) registry. The NICE registry contains good quality and complete data due to several assurance quality procedures (24). Patients with septic shock are identified using the quick Sequential Organ Failure Assessment (qSOFA) score combined with a confirmed infection according to the latest definition (25).

Reasons for missing data
In our design paper, we extracted invalidated baseline data and obtained some insight in the missingness of our variables (7). We expect to have no missing data for the variables blood pressure, heart rate, urine output, central temperature, arterial haemoglobin and lactate levels. Some data will be unobtainable for the variables mottling score, capillary refill times and peripheral temperatures.
Reasons for unobtainability included a dark or icteric skin colour (mottling and capillary refill times) and compression stockings (capillary refill time at the knee and peripheral temperature at the dorsum of the foot

Imputation method
If our missing values are missing at random, primary analyses will be performed with imputation for missing data using multiple imputations (MI). A threshold of up to 50% missing data will be considered acceptable for use of MI. Robustness of conclusions will be checked by secondary sensitivity analyses including available data and imputation of worst-best and best-worst case scenarios covering also missing not at random (MNAR) scenarios. We will use multiple imputation using the MI impute chained equation command in SPSS. We will compare the imputed values with the observed values to establish the validity of the imputed data; we will check whether the imputed values are realistic or if they require a cut-off to avoid unrealistic negative values. The imputation will be repeated 20 times and Rubin's rule will be used to combine variable estimates and standard errors (27).
If our missing values are MCAR or missingness is confined to the outcome variable, we will use complete case analysis for our primary analyses.

.1. Relationship between clinical examination signs and cardiac index
There is currently no consensus on how much cardiac index is sufficient for the critically ill patient.
Different studies used differing cut-offs for a low cardiac index (15). Therefore, we will use 2 cut-offs for both a low and high cardiac index: Based on the multivariable logistic regression findings, we will construct a risk score for cardiac index based on the coefficients in the model. Scores will be calculated by dividing all coefficients by the lowest coefficient or by transforming all coefficients so that the score is easy to count.

Diagnostic test evaluation
Depending on the optimal and number of cut-offs (i.e. low or normal cardiac index or low, normal and high cardiac index), we will display our diagnostic test in a 2 x 2 table, 2 x 3, or a 3 x 3 table (28,29). An example table can be found below. In a 3 x 3 table, data will be ordinal and we will give an indication of the agreement with the weighted kappa (30). In addition, we will also calculate the 98.5% confidence intervals. Below we interpret each diagnostic test parameter for our diagnostic study.

Cardiac index cut-offs
• True positive: the number of patients with a low cardiac index who also had clinical examination signs indicating hypoperfusion.
• False positive: the number of patients without a low cardiac index, but with clinical examination signs indicating hypoperfusion.
• True negative: the number of patients without a low cardiac index and with clinical examination signs indicating a normal perfusion.
• False negative: the number of patients with a low cardiac index, but with clinical examination signs indicating a normal perfusion.
• Sensitivity: the probability of the presence of a low cardiac index when there were clinical examination signs indicating hypoperfusion.
• Specificity: the probability of the absence of a low cardiac index when clinical examination signs indicating a normal perfusion.
• Positive predictive value: the probability that a low cardiac index is present given the clinical examination signs indicate hypoperfusion.
• Negative predictive value: the probability that a low cardiac index is absent given the clinical examination signs indicate a normal perfusion.
• Positive likelihood ratio: the ratio between the probability of clinical examination signs indicating hypoperfusion among the patients with a low cardiac index, relative to the patients with the same test result but a normal cardiac index.
• Negative likelihood ratio: the ratio between the probability of clinical examination signs indicating normal perfusion among the patients with a low cardiac index, relative to the patients with the same test result but a normal cardiac index.
• ROC curve: "The receiver operating characteristic" curve. Grayscale image sensitivity as a function of (1 -specificity) for each possible cut-off. Most useful for comparison of two methods.
• Area under the ROC curve: displays the accuracy of the test and will be classified according to the following point system: We will use machine learning (ML) algorithms to generate hypotheses, validate observations of conventional models, and to unravel heterogeneity. Predictive modelling using ML algorithms requires the original data be split into two smaller sets, one for training and one for testing. We randomly split the original data into two groups with, respectively, 70 to 80% and 20 to 30% of the individuals. The split before further division of the training data into several folds for cross-validation virtually assures no information leakage is possible, making the training method virtually completely unbiased.
The training set will then be further divided into k similarly sized partitions (k-folds). By doing this, we divide the training set each time into k parts and then use each of these k parts once as testing dataset Three algorithms will primarily be used to develop the models: Gradient Boosted Machine, Support Vector Machine, and Random Forest. A summary of the necessary hyper-parameters for each of the algorithms is provided in table 3. Finally, an Ensemble Model will be built consisting of the best models build with each of the three algorithms. The choice for these three models is based on them having previously been shown to have similar, high performances in datasets with sparse data, despite the different structure of each algorithm (31). All three algorithms are not very sensitive to over-fitting (i.e. tend not to over-fit), but achieve this in different ways, where SVM is a disadvantage compared to the other two, since it attempts to minimize over-fitting for each kernel (i.e. type of model), but the user is still left to determine which kernel best fits the data, which can be error prone (32). In addition, both GBM and Random Forest are tree-based algorithms which allow for almost unprocessed data to be analysed. While SVM requires significant pre-processing, and is computationally slower than the other two, its high performance and the possibility to adapt a Radial Basis Function to almost all highdimensionality problems makes it an easy-to-use, rather interpretable algorithm.

Gradient Boosting Machine
The Gradient Boosting Machine (GBM) algorithm is perhaps the strongest and the one with potentially the most interesting clinically-oriented properties. In GBM, new models are consecutively fitted to the training data set in order to provide a more accurate estimate of the outcome variable (33). By combining multiple decision trees, and increasingly weighting the "difficult to predict" events to a greater degree, GBM will fit k models (one per fold of the cross-validation process we defined earlier) to compute the error estimate, before making a final model using all of the data.
Our GBM model was tuned during training for four hyper-parameters: n.trees (the number of iterations to be generated by the algorithm), n.minobsinnode (the minimum number of observations in the terminal nodes of a tree (see Fig. 5)), interaction depth (defines the number of terminal nodes or leaves of a tree), and shrinkage (34). This parameter controls the learning rate of the algorithm by controlling the rate at which the boosting algorithm descends the error surface (35,36) .

Support Vector Machine
A Support Vector Machine (SVM) is a class of supervised learning algorithms often used in classification problems such as this. It classifies data points into two different classes (e.g. "Alive" or "Deceased") by taking these points in a multidimensional space and separating them by means of the hyperplane that best differentiates between the two groups (37). Due to the great variation in range of our numeric predictor variables in datasets with patient data, the data will be scaled and centred, to prevent attributes with a greater range dominating those with smaller ranges. Since SVMs don't allow for categorical variables, further processing is needed to encode all categorical variables into dummy variables (binary variables (0 or 1) which indicate the absence or presence of the effect of some parameter). After this, variables with zero or near-zero variance are excluded.
The SVM model will be tuned during training for two hyper-parameters: sigma (σ) and cost (C) (34).
The best values for various pairs of exponentially growing C and sigma values will be determined during training by means of a "grid-search" using cross-validation (37,38). In addition, due to the expected dimensionality of the classification problem, we will use a Radial Basis Function (RBF) kernel to allow the hyperplane boundary between classes to be non-linear. Tuning of sigma is required in this context to determine the influence of a single training example on the overall prediction. For instance, an excessively large value would constrain the model back to linearity (37). Similarly, cost will control the misclassification tolerance by forcing the SVM towards a harder margin or allowing a smoother (i.e. softer) decision boundary and an increased probability of misclassification.

Random Forest (figure 3)
A Random Forest (RF) is an ensemble-based technique that attempts to minimise the limitations of classical decision trees by building multiple trees from a random subset of the original training data and considering only a random number of predictor variables at each split, instead of trying all the variables at every split, before aggregating their results (39-41).
As a learning method, it is more robust to overfitting than normal decision trees, which is especially important in relatively small training sets, and shows good predictive performance despite considerable noise (37). Furthermore, it requires virtually no pre-processing, running efficiently even on datasets with a large number of input variables, categorical and continuous.
Hyper-parameter optimization is done by defining a parameter-value grid with a wide range of trees and multiple values for mtry. The best model is selected either by comparison of the AUROCs obtained for each combination of these two parameters, or according to Kuhn and Johnson's threshold-based approach for datasets with imbalanced outcome classes (42).

Ensemble models
Lastly, an ensemble model will be built that combined the best model of each of the three base learner algorithms. Using an ensemble model is expected to increase model robustness, when compared to using individual models, which is achieved by incorporating the predictions from all the base learners.
The final predictions are then given twice, one label per type of ensemble: for the weighted ensemble, based on the weights defined for each model, the weighted predicted probabilities are calculated, and a label (e.g. "Alive" or "Deceased") is given to a case based on the defined threshold (of 0.50 in this case); for the majority vote ensemble, the label, and not the predicted probabilities, of each model are taken, and the class with at least two-our-of-three votes is attributed to a certain patient.

Model testing
All final models (for each algorithm or for the ensemble model) will be fitted to the testing dataset, resulting in a prediction of each patient's individual probabilities of belonging to either of the outcome classes ("Alive" or "Deceased"). Based on a defined threshold, these probabilities will then be converted to a binary label, and the AUROC and other additional performance measures for the models will be calculated for the testing dataset.

Principal Component Analysis
The robust implementation of all three algorithms used to model ICU mortality prediction in the SICS-I database in the caret package used in R, allows us to obtain an importance-based ranking of the variables, with respect to their ability to predict the outcome variable. This feature is potential applicable in clinical practice and provides an intuitive way to reduce the dimensionality (i.e. number of input variables). However, Principal Component Analysis (PCA) can be an even better tool to shed light on parameter prioritization for data-driven studies in the ICU.
PCA is an adaptive data analysis technique for reducing the dimensionality of large datasets, thus increasing interpretability, while simultaneously minimizing information loss. It identifies the largest sources of variation in a dataset and constructs a lower dimensional subspace of the data by creating new uncorrelated variables, or principal components (PCs), that successively maximize variance (43).
To form each additional PC, it seeks a second linear combination that can explain the maximum of the remaining variance. For each PC, variable loadings are then provided, which represent the contribution of each original variable in explaining the variance in each of the PCs. Unlike predictive modelling, PCA is a descriptive tool, rather than inferential or predictive, which can partly substantiate or explain the predicted findings.
With iterative PCA, a variation on normal PCA, scores for both variables and individuals of complete or incomplete datasets are returned, providing an idea of the contribution of a certain variable, or a certain individual, to the total variance in that dataset (44). PPCA and BPCA are both iterative techniques with a probabilistic basis, which have been shown to perform better than traditional PCA when applied to data with missing values, especially as the missingness level increases (43). Using the The R packages factoextra (45) and factoMineR (46), the PCA results are presented graphically by means of a graphic showing the distribution of individuals across a plane containing two PCs, a correlation circle graphic with the most explanatory variables or factors for each PC, and the PCA biplot showing a combination of both for clarity. The squared cosine (cos2) of the most relevant individuals and variables is also given, as is their percentual contribution (47).
Due to the heterogeneity in variable scaling (for example, the maximal urine per hour was 600 ml/hour, the highest value for haematocrit being 0.53, and 11.5 mmol/L for haemoglobin), all variables were centred and scaled (i.e. subtracting the variable mean, and dividing the value by the standard deviation), so as to standardise the variance of the dataset, allowing for comparability between variables with different scales and units (48).

Statistical software
Except for the predictive and PC analyses, statistical analyses will be performed using SPSS version 23

Discussion
The overall aims of the SICS-I study are to address the diagnostic and prognostic value of clinical examination findings, haemodynamic variables, and biochemical values. This broad aim resulted in multiple research questions and hypotheses, which results in many variables to be tested repetitively.
We drafted this SAP to avoid outcome reporting bias and data-driven results. This plan only addresses the hypotheses of the basic study which is, according to our sample size calculations, able to detect a 5% difference for the primary outcomes. Results of the sub-studies will be published as separate manuscripts and SAPs will be written before analysing their hypotheses.
We strive to eliminate inflated type I error rates by adjusting our primary outcomes for different confounders. We also strive to reduce the chance of a type I error due to multiplicity by pragmatically adjusting the P-value as proposed by Jakobsen et al. (12). This adjustment for multiplicity is based on the number of different primary outcomes of the basic study and sub-studies. In each manuscript, we will identify and discuss dubious significant findings. Although we will adjust for multiple testing, we will emphasise the hypothesis-generating aspect of results.
Our diagnostic hypotheses will be tested in two steps: first, we will investigate the association of cardiac index with clinical examination variables and second, we will investigate at which cardiac index the clinical examination findings of hypoperfusion become apparent. We aim to present a cut-off for low cardiac index relevant to critically ill patients in our secondary outcome, as current cut-offs for low cardiac index are based on patients with acute lung injury, heart failure or after cardiac surgery (16)(17)(18)(19)(20). Diagnostic test accuracies will be described for each finding or combinations of findings for our proposed cut-off of cardiac index. This secondary analysis, as well as the predefined subgroup analyses, will be considered exploratory and we will emphasise the need for validation in an independent cohort.

Conclusion
This SAP presents the principles of analysis of the SICS-I cohort and discusses its major methodologic and statistical concerns. We hope that the results of the SICS-I will be as transparent and robust as possible, so that we minimised the risk of outcome reporting bias and data-driven results.  The circles represent the nodes, with the red and green circles signalling the terminal nodes, and the arrows show the "optimal split" computed for each node of each tree. The aggregated vote of all (n) trees is then combined and expressed as a final classification, for example, "alive".

Pulmonary ultrasound
What is the diagnostic accuracy of pulmonary oedema measured with pulmonary ultrasonography and auscultation for pulmonary crackles compared to pulmonary oedema diagnosed on a chest radiograph?
Is there a statistically and clinically significant difference in cardiac output between patients with and without a B-profile?

Repeated measurements
Are changes in clinical examination findings over 24 hours associated with changes in cardiac output?
Are changes in clinical examination, haemodynamic and biochemical variables over 24 hours associated with 90-day mortality?
Cardiac output

RV-function & AKI
Is RV-volume overload measured by tricuspid insufficiency and RV-diameters associated with acute kidney injury?
Are clinical examination, biochemical, and haemodynamic variables associated with the development of AKI in patients without known pre-existent chronic kidney disease? Acute kidney injury

Fluid responsiveness
What is the diagnostic accuracy of fluid responsiveness assessed by changes in end-tidal carbon dioxide (EtCO2), heart rate and blood pressure compared to fluid responsiveness assessed by the passive leg raising (PLR) test?
What is the diagnostic accuracy of fluid responsiveness assessed by a PLR test without lowering the head of the bed compared to fluid responsiveness assessed by the standard PLR test?   Defines the minimum number of observations in the terminal nodes of a tree.
Defines the number of terminal nodes or leaves of a tree.
Controls the learning rate of the algorithm.

ntree mtry
Defines the number of trees built by the algorithm.
Defines the number of candidate variables randomly selected and tried at each split of a tree.
Defines how much a single training example influences the model (a higher sigma constrains the model towards linear).

APPENDIX 1: sub-studies
This appendix elaborates on the study methods of the sub-studies of the Simple Intensive Care Studies-I (SICS-I). Per sub-study, we elaborate on the research question, variable of interest, timing of the measurement in the basic study, additional exclusion criteria, sample size calculations, and the analyses. on the sub-study procedure, additional exclusion criteria, and primary and secondary outcomes.
The conclusions based on the analyses in the sub-studies are explorative in nature and must all be confirmed in separate (external) validation cohorts.

Research question
Which clinical examination, biochemical, and haemodynamic variables are associated with tissue (muscle) oxygen saturation (StO2) measured by near-infrared spectroscopy (NIRS)?

Timing of measurement
During inclusion in the basic study (one-time snapshot).

Additional exclusion criteria
Liver failure (increased bilirubin levels cause interferences with StO2 measurements), patients with dark skin complexion (large quantities of melanin interfere with the near-infrared light emanated by the probe).

Sample size, power and detectable difference
We will conduct analysis of this sub-study with 32 patients, in which we will regress their values of NIRS against clinical examination findings such as central-to-toe temperature difference. Prior data indicate that the standard deviation of central-to-toe temperature is 4.1 with and estimated standard deviation of the regression errors of 23. If the true slope of the line obtained by regressing NIRS against central-to-toe temperature is 3.406, we will be able to reject the null hypothesis that this slope equals zero with probability (power) 80%. The type I error probability associated with this test of this null hypothesis is 0.015.

Primary analysis
The association between clinical, biochemical and haemodynamic variables and tissue (muscle) StO2 measured by NIRS at the knee.

Secondary analyses
The association between clinical, biochemical and haemodynamic variables and tissue (muscle) StO2 measured at the thenar muscle.
The association between tissue (muscle) StO2 measured by NIRS and 90-day mortality.

Research question
What is the diagnostic accuracy of pulmonary oedema measured with pulmonary ultrasonography and auscultation for pulmonary crackles compared to pulmonary oedema diagnosed on a chest radiograph?

Variables of interest
Pulmonary ultrasound was conducted with the cardiac probe M3S of M4S with default cardiac imaging and maximal frequency (3.6 MHz) setting of the General Electric Vivid-S6 mobile ultrasound machine.
We measured the presence or absence of B-lines at the six locations specified in the BLUE-protocol.
The presence of a B-profile was defined by three or more B lines observed in at least three of the six BLUE points, or in two of the four lower BLUE points.
Pulmonary oedema was diagnosed by the radiologist who reviewed chest radiographs as part of daily care. The radiologist was blinded for the variables collected in our study.

Timing of measurement
During inclusion in the basic study (one-time snapshot).

Additional exclusion criteria
Patients after pulmonary transplantation.

Sample size, power and detectable difference
Due to a set sample size we were able to estimate the precision of our estimate (i.e. the maximum marginal error). We calculated the maximum marginal error with the formulas of Hajian-Tilaki (49)

Primary analysis
The diagnostic test accuracy of a B-profile measured with pulmonary ultrasonography compared to pulmonary oedema diagnosed by chest radiography.
The diagnostic test accuracy of pulmonary crackles assessed with auscultation compared to pulmonary oedema diagnosed by chest radiography.

Secondary analysis
The statistically and clinically significant difference in CO between patients with and without a Bprofile.

Research question
Is an increase in positive end-expiratory pressure (PEEP) associated with a decrease in cardiac output?

Variable of interest
During the PEEP-challenge, an additional 10 cm H2O of PEEP was temporarily applied when supervised by the treating ICU physician. The PEEP was elevated for a maximum duration of 5 minutes during which the changes in cardiac output, heart rate, blood pressures and central venous pressure were recorded.

Timing of measurement
During inclusion in the basic study and after temporary PEEP application (repeated measurements).

Additional exclusion criteria
The attending ICU physician was not interested in fluid responsiveness at time of inclusion.

Sample size, power and detectable difference
We are planning a study with paired measurements in 25 patients. Prior pilot data indicate that the difference in the cardiac output of matched pairs is normally distributed with standard deviation 0.60.
If the true difference in the cardiac output before and after application of PEEP is 0.42, we will be able to reject the null hypothesis that this response difference is zero with probability (power) 80%. The type I error probability associated with this test of this null hypothesis is 0.015.

Primary analysis
The association between PEEP increase and cardiac output.

Secondary analysis
None.

Research question
Is right ventricular (RV)-function measured by the tricuspid annular plane systolic excursion (TAPSE) and peak tissue Doppler systolic velocity in the tricuspid annulus (RV s') independently associated with 90-day mortality?

Timing of measurement
During inclusion in the basic study (one-time snapshot).

Additional exclusion criteria
None.

Sample size, power and detectable difference
In our design paper, we have 112 patients with an impaired TAPSE and 279 patients with a normal TAPSE (controls). Pilot data indicate that the 90-day mortality proportion among controls is 0.28 7 . We will be able to detect true relative risks of 0.46 or 1.62 in patients with impaired TAPSE relative to controls with probability (power) 80%. The type I error probability associated with this test of the null hypothesis that this relative risk equals 1 is 0.015. We will use an uncorrected chi-squared statistic to evaluate this null hypothesis.

Primary analysis
The association between RV-function measured by TAPSE or RV s' and 90-day mortality.

Secondary analysis
The association between RV-function measured by TAPSE or RV s' and clinical examination findings and cardiac output.

Research question
Is peripheral blood flow measured with CCUS associated with cardiac output?

Timing of measurement
During inclusion in the basic study (one-time snapshot).

Additional exclusion criteria
Obstruction of the desired window of peripheral artery flow (e.g. by a central venous or dialysis catheter, postsurgical incisions, wounds, etc.).

Sample size, power and detectable difference
We will conduct analyses of this sub-study with 59 patients, in which we will regress their values of cardiac output against peripheral flow measurements such as common carotid artery flow. Our pilot data indicate that the standard deviation of common carotid artery flow is 7.9, with an estimated standard deviation of the regression errors of 97. If the true slope of the line obtained by regressing cardiac output against common carotid artery flow is 5.4, we will be able to reject the null hypothesis that this slope equals zero with probability (power) 80%. The type I error probability associated with this test of this null hypothesis is 0.015.

Primary analysis
The association between peripheral blood flow measured at the common carotid, subclavian, and common femoral arteries and cardiac output.

Secondary analysis
The association between a proxy for abdominal organ blood flow and acute kidney injury (AKI) or 90day mortality.

Research question
What is the level of agreement between cardiac output measured by the FloTrac compared to cardiac output measured with CCUS?

Timing of measurement
Paired cardiac output measurements were conducted after admission every hour for 4 hours and once 24 hours after admission.

Additional exclusion criteria
Patients not requiring vasopressors and/or inotropes, an inadequate acoustic window for cardiac output measured by CCUS, absence of an arterial line, atrial fibrillation during inclusion, aortic and/or mitral valve disease.

Sample size, power and detectable difference
We are planning a study with paired measurements in 55 patients. Prior pilot data indicate that the difference in the cardiac output of both measurements devices is normally distributed with standard deviation 0.86. If the true difference in the cardiac output of both measurements devices is 0.39, we will be able to reject the null hypothesis that this response difference is zero with probability (power) 80%. The type I error probability associated with this test of this null hypothesis is 0.015.

Primary analysis
The level of agreement between cardiac output measured by the FloTrac and cardiac output measured with CCUS.

Secondary analysis
The changes in levels of agreement when factors that might influence FloTrac measurements are present.

Research question
Are changes in clinical examination findings over 24 hours associated with changes in cardiac output?

Variable of interest
We repeated the measurements of the variables collected in the basic study, sub-study 2, and substudy 4. We performed these measurements 24 hours (minimum 22 to maximum 26 hours) after the first measurement and calculated differences. The sign of the variable indicates whether a variable has either increased (positive number) or decreased (negative number).

Timing of measurement
During inclusion in basic study and 24 hours (minimum 22 hours to maximum 26 hours) thereafter.

Additional exclusion criteria
Patients not requiring vasopressors and/or inotropes, expected stay less than 48 hours.

Sample size, power and detectable difference
We are planning to conduct this sub-study with 100 patients and we will regress their values of difference in cardiac output changes over 24 hours against clinical examination findings such as central-to-toe temperature difference. Prior data indicate that the standard deviation of central-totoe temperature is 4.1 with an estimated standard deviation of the regression errors of 1.35. If the true slope of the line obtained by regressing changes in cardiac output against central-to-toe temperature differences is 0.16, we will be able to reject the null hypothesis that this slope equals zero with probability (power) 80%. The type I error probability associated with this test of this null hypothesis is 0.015.

Primary analysis
The association between changes in clinical examination findings over 24 hours and changes in cardiac output.

Secondary analysis
The association between changes in clinical examination, biochemical, and haemodynamic variables over 24 hours and 90-day mortality.

Research questions
Is RV-volume overload measured by tricuspid insufficiency and RV-diameters associated with acute kidney injury (AKI)?
Are clinical examination, biochemical, and haemodynamic variables associated with the development of AKI in patients without known pre-existent chronic kidney disease?

Variable of interest
Right ventricle diameters and tricuspid regurgitation velocity have been measured with the cardiac probe M3S of M4S with default cardiac imaging setting of the General Electric Vivid-S6 mobile ultrasound machine. The measurements were obtained in the AP4CH view with a right ventricle centred view. AKI was established and classified following the kidney disease: improving global outcomes (KDIGO) criteria. Urine output and serum creatinine measurements from the first 72 hours of inclusion were analysed to establish and classify AKI for each patient.

Timing of measurement
During inclusion in basic study (one-time snapshot).

Additional exclusion criteria
None.

Sample size, power and detectable difference
In our design paper, we have 489 patients who developed AKI and 418 patients with a normal renal function (controls). Pilot data from our design paper indicate that the AKI proportion among controls is 0.54. We will be able to detect true relative risks of 0.80 or 1.20 in exposed patients relative to unexposed patients with probability (power) 0.80. The type I error probability associated with this test of the null hypothesis that this relative risk equals 1 is 0.015.

Primary analyses
The association between RV-volume overload measured by tricuspid insufficiency and RV-diameters and AKI.
The association between clinical, biochemical, and haemodynamic variables and the development of AKI.

Secondary analyses
The association between RV-volume overload measured by tricuspid insufficiency and RV diameters and 90-day mortality.
The association between clinical, biochemical, and haemodynamic variables and the development of AKI regardless of the presence of pre-existent chronic kidney disease.

Research question
What is the diagnostic accuracy of fluid responsiveness assessed by changes in end-tidal carbon dioxide (EtCO2), heart rate and blood pressure compared to fluid responsiveness assessed by the passive leg raising (PLR) test?
What is the diagnostic accuracy of fluid responsiveness assessed by a PLR test without lowering the head of the bed compared to fluid responsiveness assessed by the standard PLR test?

Variable of interest
During the fluid responsiveness study, two different PLR tests were applied when supervised by the treating ICU physician. Every passive leg raising manoeuvre was conducted for a maximum duration of 60 seconds during which the changes in cardiac output, heart rate, blood pressures, central venous pressure, and EtCO2 were recorded. Fluid responsiveness was diagnosed when cardiac output increased with 15% after the PLR-test. The PEEP-challenge was conducted in a similar manner as described in sub-study 3.

Timing of measurement
During inclusion in basic study (one-time snapshot).

Additional exclusion criteria
The attending ICU physician was not interested in fluid responsiveness at time of inclusion.

Sample size, power and detectable difference
Due to a set sample size we were able to estimate the precision of our estimate (i.e. the maximum marginal error). In our pilot data, the prevalence of fluid responsiveness was 45%, and EtCO2 increase of 0.3 had a sensitivity of 83.3% and a specificity of 71.4%. Based on 20 included patients who underwent the PLR-test, a type-I error of 0.015 (with a corresponding Z-value of 2.17), and the abovementioned prevalence, sensitivity and specificity, the estimated maximal marginal error for sensitivity is 11.9% and 2.7% for specificity.

Primary analyses
The diagnostic accuracy of fluid responsiveness assessed by changes in EtCO2, heart rate and blood pressure compared to the PLR test.

Primary analysis
The diagnostic accuracy of a B-profile assessed with pulmonary ultrasonography compared to bilateral consolidations assessed on chest radiography for the diagnosis of ARDS.

Secondary analysis
The association between B-lines measured with pulmonary ultrasonography and clinical examination, biochemical and haemodynamic variables.

Research question
Are left ventricular (LV) and RV-myocardial strain measured with tissue Doppler imaging associated with 90-day mortality?

Timing of measurement
During inclusion in basic study (one-time snapshot).

Additional exclusion criteria
None.

Sample size, power and detectable difference
In a pilot study we included 51 patients of which 9 died. Pilot data from our design paper indicate that the 90-day mortality proportion among controls is 0.28 7 . We will be able to detect a true relative risk of 3.0 in diseased patients subjects relative to patients who survived with probability (power) 0.80. The Type I error probability associated with this test of the null hypothesis that this relative risk equals 1 is 0.015.