A comparison of methods for multiple degree of freedom testing in repeated measures RNA-sequencing experiments

Background As the cost of RNA-sequencing decreases, complex study designs, including paired, longitudinal, and other correlated designs, become increasingly feasible. These studies often include multiple hypotheses and thus multiple degree of freedom tests, or tests that evaluate multiple hypotheses jointly, are often useful for filtering the gene list to a set of interesting features for further exploration while controlling the false discovery rate. Though there are several methods which have been proposed for analyzing correlated RNA-sequencing data, there has been little research evaluating and comparing the performance of multiple degree of freedom tests across methods. Methods We evaluated 11 different methods for modelling correlated RNA-sequencing data by performing a simulation study to compare the false discovery rate, power, and model convergence rate across several hypothesis tests and sample size scenarios. We also applied each method to a real longitudinal RNA-sequencing dataset. Results Linear mixed modelling using transformed data had the best false discovery rate control while maintaining relatively high power. However, this method had high model non-convergence, particularly at small sample sizes. No method had high power at the lowest sample size. We found a mix of conservative and anti-conservative behavior across the other methods, which was influenced by the sample size and the hypothesis being evaluated. The patterns observed in the simulation study were largely replicated in the analysis of a longitudinal study including data from intensive care unit patients experiencing cardiogenic or septic shock. Conclusions Multiple degree of freedom testing is a valuable tool in longitudinal and other correlated RNA-sequencing experiments. Of the methods that we investigated, linear mixed modelling had the best overall combination of power and false discovery rate control. Other methods may also be appropriate in some scenarios. Supplementary Information The online version contains supplementary material available at (10.1186/s12874-022-01615-8).


edgeR
The edgeR package employs a negative binomial generalized linear model (GLM) framework to address overdispersion 4,5 . Let Y gi be the expression level of gene g in the ith sample, with E(Y gi ) = µ gi . In order to account for differences in library size, an offset, N i , is calculated for each sample using the trimmed mean of M-values (TMM) method 4 . Then, the data can be modeled using Y gi ∼ N B(µ gi , α g ) (1) log(µ gi ) = X i β g + log(N i ) (2) where β g is a vector of fixed effect regression coefficients for gene g, and the X i are the fixed effects covariates for subject i. The genewise dispersion parameter α g is estimated using an empirical Bayes approach 5 .

DESeq2
The methodology behind the DESeq2 package is very similar to that of edgeR, using the same model outlined in Equations 1-2 with the exception of the term N i which is replaced with a size factor, s i , which is calculated using the median ratio method 6,7 . The size factor is computed usinĝ where y gi is the read counts for the gene g, subject i, and n is the total number of subjects. The fraction in this expression is the ratio of a single observation from gene g to the geometric mean across all samples for that gene. Then, for any subject i, the size factor is the median ratio across all genes. In the case that y gi = 0, the corresponding gene is not used in the calculation of the median. Furthermore, zero valued counts are excluded from the product in the denominator to avoid division by zero. For each raw count, y gi , a corresponding scaled value, r gi is then computed by dividing the raw count by its respective size factor such that r gi = ygî s i . The dispersion parameter α g is estimated using a similar shrinkage estimation technique as edgeR which is outlined by Love, Huber, & Anders 7 .

Generalized Estimating Equations (GEE)
Suppose Y gij is the expression level of gene g in participant i at observation j. Then, like in GLM, in GEE modeling the relationship between the marginal mean, E(Y gij ) = µ gij and a row vector of covariates X ij can be represented using a link function h such that: where β g is, as before, a column vector of unknown regression coefficients. The variance in this model, V ar(Y gij ), does not have to be fully specified, but instead is of the form V ar(Y gij ) = v(µ gij )ϕ g , where v is a known function and ϕ g is a scale parameter, often unknown 8 . In order to account for the correlation between observations, a structure detailing the association between measurements within a subject must be specified. This structure, R gi , is called a working correlation structure and can take the same form as covariance structures used in other types of longitudinal modeling such as an exchangeable structure (covariance parameter is assumed to be the same between all observations within a subject), or an independent structure (covariance parameters between each pair of observations are estimated independently). For a subject i with m observations, if we let A gi = diag(v(µ gi1 ), ..., v(µ gim )), then the covariance matrix for Y gi is specified by Estimates for the regression coefficientsβ g can be obtained using µ g and V g . Let D gi = ∂µ gi /∂β t g . Then estimates for the regression coefficients,β g can be found by solving the following equation (known as the generalized estimating equation): where n represents the number of subjects or clusters. The covariance matrix forβ g is typically estimated using robust (sandwich) estimators so that the estimates are robust to misspecifications of the working correlation matrix. If we estimate the covariance matrix of Y gi using then the sandwich estimates for the covariance matrix forβ g are found using where For this study we use a Poisson distribution to specify the mean of the data with the variance multiplied by a scale parameter to account for overdispersion. Thus, then E(Y gij ) = µ gij and V ar(Y gij ) = ϕ g µ gij . Often, it is assumed that the correlation between each pair of observations is the same and so an exchangeable working correlation structure is used such that:

Wang and Long's Small Sample Estimators
When applying GEE, the estimated covariance matrix for Y gi is typically based only on data from the ith subject. Liang and Zeger 9 noted that given a common correlation structure across all subjects, then an estimated common correlation matrix R gc can be found using Pan 10 suggested that if we can assume that there is common correlation structure across all subjects and that the variance for Y gi is correctly specified, then an alternate estimate for the covariance matrix of Y gi is given by This estimate is more efficient than the original covariance matrix estimator because it uses information from all subjects instead of just one. Therefore, when Cov P (Y gi ) is used in place of Cov(Y gi ) when estimating the sandwich estimators, the resulting estimators for the covariance matrices of β g are also more efficient. This is particularly true for small sample sizes.
In proposing a small sample estimator for the covariance matrix ofβ g , Mancl and DeRouen 11 presented a method for correcting the bias in the covariance estimator for Y gi . The covariance estimator in Equation 7 is based on the residuals r gi = Y gi −μ gi . However,μ gi is generally closer to Y gi than µ gi , resulting in biased residuals. Thus, using the originally proposed estimator for Cov(Y gi ) when estimating the covariance matrix ofβ g results in estimates that are too small. This problem becomes exaggerated in small sample sizes.
To adjust for this bias, first note that the expected value of Cov(Y gi ) = r gi r T gi can be approximated using where I i is the m × m identity matrix for subject i with m observations, and It follows that the covariance matrix for Y i can be estimated using When Cov M (Y gi ) is used in place of Cov(Y gi ) in Equation 9, the resulting covariance estimates forβ g are less biased, especially in small sample sizes. Wang and Long 8 sought to draw from the work of both Pan 10 and Mancl and DeRouen 11 by creating an estimator for the covariance of Y gi that both corrected for bias and polled information from all subjects/clusters. In Equation 13 the covariance of Y gi is estimated using information from all subjects/clusters, but, the bias of the residuals is not accounted for. This can be accomplished by replacing (Y gi − µ)(Y gi − µ) T with Mancl's biased corrected estimator in Equation 16. Thus, the covariance estimator would be to generate a modified estimator for the covariance ofβ g .

Negative Binomial Mixed Models (NBMM)
GLMM is an extension of generalized linear modeling (GLM) which allows the outcome variable to have a distribution other than the normal distribution, while also allowing for the inclusion of random effects. A link function h, such as a log function, is applied to the expected value of the outcome in order to make it continuous, and h(µ), where µ is the expected value of the outcome, can be modeled linearly with fixed and random effects. Because RNA-Seq data are over-dispersed counts, they are often modeled using a negative binomial distribution. Thus, in using a GLMM to model correlated RNA-Seq data, a negative binomial mixed model (NBMM) can be used. If we assume, as previously, that we have n subjects and m observations per subject and that Y gij is the expression value for gene g from subject i at observation j, then using a log link function, we can write where α g is the dispersion parameter such that V ar(Y gij ) = µ gij + α g µ 2 gij . All other notation is the same as outlined for equations 28-30.

Maximum Likelihood Approach: Laplace Approximation
Because NBMM's are not linear, the likelihood function often involves high dimensional integrals to which there are no closed form solutions 12 . Thus, in using a maximum likelihood approach, we must use approximation techniques. One such technique is automatic differentiation (AD), which utilizes a Laplace approximation to estimate model parameters. Fournier et al. 13 constructed an AD model builder (ADMB), a series of tools built specifically for likelihood based parameter estimation and model building for non-linear models.

Maximum Likelihood Approach: Adaptive Gaussian Quadrature
Adaptive Gaussian Quadrature is an alternative to Laplace approximation which still uses a maximum likelihood estimation approach. This method is often more accurate, but also more computationally expensive than the Laplace method. 14 . More details on this approximation technique and its application to RNA-seq data can be found in the recent work of Tsonaka & Spitali 14 .

Pseudo-Likelihood Approach
Estimating negative binomial mixed model (NBMM) parameters using a pseudo-likelihood approach (NBMM-PL) involves computing linearized "pseudo-data" and model weights and then iteratively alternating between fitting a weighted LMM using the pseudo-data and recomputing the pseudo-data and weights until model convergence.
The basis of this approach relies on the fact that the negative binomial model can be approximated by a weighted normal model 15 . Suppose Y gij is the expression level of gene g in subject i at observation j with E(Y gij ) = µ gij . An NBMM can be specified by: where X ij and Z ij are row vectors of fixed random effects respectively, β g is a column vector the gene specific fixed effect coefficients, and b gi is a column vector of random effects for gene g and subject i. We assume that for each gene g, the subject-specific random effects have a multivariate normal distribution with mean 0 and variance-covariance matrix Σ bg . Finally, ρ ij is an offset for subject i at observation j. Then where log(µ gij ) = η gij = X ij β g + Z ij b gi and w gij is a weight for gene g, subject i, and observation j. If P g ∼ N (η g , w g −1 σ 2 g ) represents a set of "pseudo-data" for gene g, and P g is modeled using a weighted LMM such that then this model can be used to approximate a NBMM for Y g ∼ N B(µ g , α g ).
In the weighted LMM model, both the calculation of the pseudo-data (P g ) and the weights (w g ) depend on the parameters η g and α g , which are both unknown. Thus, in order to fit an approximate model using the pseudo-data and weights, the following iterative approach is used: 1. A negative binomial GLM is fit in order to obtain initial estimates for η g and α g .
2. Pseudo-data (P g ) and the weights (w g ) are calculated using the estimates for η g and α g .

3.
A weighted LMM is fit using the current values of P g and w g .

4.
Using the results from the weighted LMM, a new estimate for η g is computed as well as a new estimate for α g , which is found using a Newton-Raphson algorithm.

Linear Mixed Models (LMM)
Assume we have RNA-Seq count data from n subjects and m timepoints. Let Y gij be the expression value for gene g from subject i at time point or observation j, then where µ gij is the mean expression value for gene g in subject i at observation j and σ 2 g is the variance for gene g. β g is a column vector of fixed effect regression coefficients for gene g, X ij is a row vector of fixed effects covariates for subject i at observation j, Z ij is a row vector of random effects covariates for subject i at observation j, and b gi is a column vector of random effects for gene g and subject i. We assume that for each gene g, the subject-specific random effects have a multivariate normal distribution with mean 0 and variance-covariance matrix Σ bg .
Of course, this model assumes a normally distributed, continuous outcome, but RNA-Seq data are overdispersed counts. Furthermore, we see from Equation 28 that for any given gene g, the variance σ 2 g is constant across all subjects and observations. However, there is typically a mean-variance relationship in RNA-Seq data, with larger mean expression values corresponding to larger variance. One approach to mitigate these issues is applying a transformation to the RNA-Seq counts. One such transformation is the variance-stabilizing transformation in the DESeq2 package 7 . This transformation both scales RNA-Seq counts and makes the variance approximately independent of the mean 6 . To scale the data to account for differences in sequencing depth between samples, a size factor,ŝ ij , for subject i, timepoint j is computed using 3.
Using the scaled data, an additional transformation is performed to eliminate the mean-variance relationship in the data. If we let v(µ gij ) be the mean variance relationship estimated by DESeq2, then the variance stabilizing transformation for r gij is given by Given that there are not a large number of zero or low count genes, the VST transformation generally results in data that is roughly normal and with a variance that is approximately the same across all values 6,16 . However, if there are a large number of zero or low count genes, the variance may not stabilize even after performing VST and thus a linear model may not be appropriate for the transformed data 17 . Assuming that the model assumptions are met for the VST counts, we can use the LMM outlined in equations 28-30 to model the transformed data.

Satterthwaite Degrees of Freedom
The Satterthwaite method uses Satterthwaite's method of moments technique to approximate the DF 18,19 . This method can be used to approximate DF for t-tests and denominator DF for F-tests 20,21,22 . Suppose that we are testing the hypothesis H 0 : Lβ = 0, where L is a contrast matrix and β is a vector of fixed effect parameters for a LMM. If Rank(L) = 1, then a t-test is appropriate, and otherwise an F-test can be used. In this context, the covariance matrix of β is where X is a vector of fixed effects covariates and V(θ) is the variance of the outcome for the LMM with θ representing the parameters for the fixed effect variance and residual error variance. This covariance matrix can be estimated byĈ = C(θ). If Rank(L) = 1, the t-statistic for this hypothesis would be If t is distributed according to a t-distribution with q DF, then, based on the relationship between the t-distribution and χ 2 distribution, the distribution of the statistic can be approximated using a χ 2 q distribution. Using Satterthwaite's method of moments technique, we can equate the variance of the test statistic, χ 2 , to the theoretical variance of a χ 2 q distribution and then solve for q. Thus The process for finding the DF when Rank(L) > 1 is similar to the process for when Rank(L) = 1. Suppose that p = Rank(L). Then, the distribution of the statistic can be approximated by an F p,q distribution, where p and q are the numerator and denominator DF respectively. Then using Satterthwaite's method of moments technique, the approximation for q can be found by setting the estimated expected value of F equal to the theoretical expectation for an F p,q distribution: We also use Satterthwaite DF approximation for the NBMM-PL models.

P-value Distribution
Difference between groups at any timepoint Difference between any timepoints in the treatment group Any significant interaction effect Any significant coefficient   Figure 5: Heatmap of predicted gene expression (row scaled) across the three study timepoints for genes that were significant in a test for differential expression between any two timepoints in the SS group. Predicted values and significance results came from the LMM analysis. Genes are clustered using a correlation distance metric and complete linkage clustering methods and are split into seven clusters indicated by the color bars along the rows.