Design and analysis of outcomes following SARS-CoV-2 infection in veterans
BMC Medical Research Methodology volume 23, Article number: 81 (2023)
Understanding how SARS-CoV-2 infection impacts long-term patient outcomes requires identification of comparable persons with and without infection. We report the design and implementation of a matching strategy employed by the Department of Veterans Affairs’ (VA) COVID-19 Observational Research Collaboratory (CORC) to develop comparable cohorts of SARS-CoV-2 infected and uninfected persons for the purpose of inferring potential causative long-term adverse effects of SARS-CoV-2 infection in the Veteran population.
In a retrospective cohort study, we identified VA health care system patients who were and were not infected with SARS-CoV-2 on a rolling monthly basis. We generated matched cohorts within each month utilizing a combination of exact and time-varying propensity score matching based on electronic health record (EHR)-derived covariates that can be confounders or risk factors across a range of outcomes.
From an initial pool of 126,689,864 person-months of observation, we generated final matched cohorts of 208,536 Veterans infected between March 2020-April 2021 and 3,014,091 uninfected Veterans. Matched cohorts were well-balanced on all 37 covariates used in matching after excluding patients for: no VA health care utilization; implausible age, weight, or height; living outside of the 50 states or Washington, D.C.; prior SARS-CoV-2 diagnosis per Medicare claims; or lack of a suitable match. Most Veterans in the matched cohort were male (88.3%), non-Hispanic (87.1%), white (67.2%), and living in urban areas (71.5%), with a mean age of 60.6, BMI of 31.3, Gagne comorbidity score of 1.4 and a mean of 2.3 CDC high-risk conditions. The most common diagnoses were hypertension (61.4%), diabetes (34.3%), major depression (32.2%), coronary heart disease (28.5%), PTSD (25.5%), anxiety (22.5%), and chronic kidney disease (22.5%).
This successful creation of matched SARS-CoV-2 infected and uninfected patient cohorts from the largest integrated health system in the United States will support cohort studies of outcomes derived from EHRs and sample selection for qualitative interviews and patient surveys. These studies will increase our understanding of the long-term outcomes of Veterans who were infected with SARS-CoV-2.
The U.S. faces continued infections with Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) following earlier waves in 2020 and 2021. Numerous studies have examined short-term symptoms, hospitalization, and death [1, 2] among patients infected with SARS-CoV-2 both before and after vaccine availability. The coronavirus disease 2019 (COVID-19) pandemic caused unprecedented disruptions across a wide range of domains relevant to health and health care including to health systems and care processes, informal family support, long-term care, safety net programs, social organizations, and economic stability, making it challenging to isolate the direct impact of infection with SARS-CoV-2 on individual health. Thus, more work is needed to characterize the direct long-term health effects of SARS-CoV-2 infection distinct from these systemic disruptions.
To research the long-term health consequences of COVID-19, the U.S. Department of Veterans Affairs (VA) provided resources in May 2021 to create the COVID-19 Outcomes Research Collaboratory (CORC). The main purposes of this program are investigating long-term health outcomes associated with SARS-CoV-2 infection using national electronic health record (EHR) data and survey research to assess factors not well captured by the EHR. The focus of this paper is on the study design utilized to facilitate EHR and survey-based research through cohort construction of Veterans with SARS-CoV-2 infection and well-matched controls, with matching for a wide range of covariates potentially associated with the exposure (SARS-CoV-2 infection) and outcomes. The creation of these populations will enable study of outcomes associated with this infection and potential mediating factors using observational research methods applied to both retrospective EHR and prospective survey-based data.
To best identify potential causal associations, we selected the target trial emulation approach to evaluate the causal effect of SARS-CoV-2 infection on long-term health-related outcomes. This approach is intended to minimize sources of bias, including observed and unobserved confounding and immortal time bias in estimating the effect of SARS-CoV-2 infection [3, 4]. To address unobserved confounding and selection bias, we matched Veterans infected with SARS-CoV-2 to similar contemporaneous uninfected Veterans before comparing a broad array of outcomes available in the VA’s comprehensive longitudinal EHR.
Analyses based on the EHR will be supplemented with a prospective longitudinal survey in which a subset of matched cohorts of infected and uninfected persons will be invited to participate. Longitudinal survey responses will provide more detailed information on patient-reported outcomes that will complement outcomes ascertained from the EHR. Selection and matching of survey samples will be conducted in such a way to ensure covariate balance is maintained in the survey subsample. In addition, source data from the EHR and survey will be used to guide purposive sampling for a prospective qualitative study to understand the diversity of experiences of Veterans infected with SARS-CoV-2.
Designing a matched cohort study to address a wide array of EHR-based outcomes and embedded survey subsets requires a more inclusive consideration of confounding than when estimating the effect of SARS-CoV-2 infection on a single outcome. This protocol paper describes the design and methodological approach to identify a matched cohort of comparable patients infected and not infected with SARS-CoV-2. This matched cohort will be used in future research to analyze the effect of SARS-CoV-2 infection on clinical, functional, and economic outcomes among Veterans. This work could also inform and support efforts by other groups interested in creating matched cohorts to address a wide range of unanswered questions related to SARS-CoV-2 infection.
Study design and data
We designed a retrospective cohort study of EHR-based outcomes with a non-equivalent comparator of uninfected Veterans. To facilitate measurement of patient-reported outcomes, this retrospective cohort is paired with an embedded smaller post-only survey-based prospective cohort study. In both components, comparator non-equivalence was reduced by generating matched cohorts.
As described previously , we assembled a cohort of VA enrollees who tested positive for SARS-CoV-2 RNA in a respiratory specimen within the VA system based on polymerase chain reaction (PCR) tests as well as those with evidence of SARS-CoV-2 infection identified outside the VA but documented in VA records as identified by the VA National Surveillance Tool between March 1, 2020 and April 30, 2021. The earliest date of a documented positive test was taken as each patient’s date of infection. We included only those Veterans who had an assigned VA primary care team (e.g., Patient Aligned Care Team) or at least one VA primary care clinic visit in the two-year period prior to infection to minimize missingness in EHR-based covariates that are generated from health system interaction. Cohorts were identified sequentially on a monthly basis, with assignment to a particular month for cases based on the date of the positive test or documentation in notes of non-VA evidence of infection. VA-enrolled Veterans without a positive test prior to or during the month who met the same inclusion criteria were considered uninfected potential comparators for that month. The uninfected control group members were eligible for repeated sampling and matching with replacement until they had a positive test. To avoid misclassification of first infection date based on a positive test, infected Veterans with COVID-19-related diagnostic codes (ICD-10: B97.29, U07.1, U09.9, J12.82, Z86.16) listed in fee-for-service Medicare claims 15 or more days before their VA test were excluded. In addition, Veterans from the uninfected comparator group with any such diagnostic codes were excluded from sampling for matching in the month the COVID-19-related code arose and any months thereafter.
We developed 14 separate monthly patient cohorts—one for each month (March 2020-April 2021) —for the purpose of defining index dates and matching covariates. For example, the March 2020 cohort included all VA enrollees with an initial positive test during March 2020 and all VA enrollees who were alive as of March 1, 2020 and had not been infected prior to April 1, 2020. SARS-CoV-2-infected patients were included as potential comparator patients in months before infection. In a given month, uninfected Veterans could be matched to multiple infected Veterans in that same month and uninfected Veterans could be included in multiple month-specific cohorts as long as they remained uninfected and continued to meet other eligibility criteria. To minimize immortal time bias, the index date was defined as the date of the earliest positive test for SARS-CoV-2- infected Veterans and as the 1st day of the relevant month for uninfected Veterans . Each patient’s index date served as the anchor for defining matching covariates (with covariate construction starting 14 days prior to the positive test date for infected patients), based on EHR data from the prior two years.
Our goal was to conduct many-to-one matching that would maximize retention of infected patients for external validity and covariate balance for internal validity. A priori, we defined a suitable matching strategy as one that would result in < 5% attrition of the infected cohort and achieve covariate balance among the selected covariates for matching based on standardized differences < 0.1 .
Coarsened exact matching (CEM) was initially attempted. Covariates used for matching were derived iteratively at a single point in time (summer 2021) with the understanding that the evidence base about causes and consequences of COVID-19 was (and is) evolving rapidly. In collaboration with clinician-investigators (see left column, Appendix 1), we identified a broad list of demographic, clinical, and health care utilization measures hypothesized to be either risk factors for pre-specified outcomes alone (e.g., survival, depression, total VA costs, disability, healthcare-related financial strain due to high out-of-pocket costs) or confounders associated with both infection and outcomes .
To minimize sample loss when attempting to match on many covariates in CEM , the five physician principal investigators then worked together to prioritize covariates for the final matching specification (see right column, Appendix 1). Modified coarsened exact matching was then implemented using this prioritized set of covariates. However, a suitable exact match could not be identified for 53.7% of infected Veterans, so we reverted to a form of combined exact and calendar time-specific propensity score matching , with cohorts identified by index month.
In a two-step process, infected patients were exact matched to uninfected controls based on index month, sex, immunosuppressive medication use (binary), state of residence, and COVID-19 vaccination status (effective in January-April 2021 cohorts only) because these covariates were strong potential confounders. In the second step, a total of 37 binary, categorical, and continuous covariates were included in the propensity score model, including immunosuppressive medication use (binary), nursing home residence any time in the prior two years, vaccination status (January-April 2021 cohorts), and diagnosed CDC high-risk conditions:  coronary heart disease, cancer (excluding non-metastatic skin cancers), chronic kidney disease, congestive heart failure, pulmonary-associated conditions (including asthma, COPD, interstitial lung disease, and cystic fibrosis), dementia, diabetes, hypertension, liver disease, sickle cell/thalassemia, solid organ or blood stem cell transplant, stroke/cerebrovascular disorders, substance use disorder, anxiety disorder, bipolar disorder, major depression, PTSD, and schizophrenia.
Other categorical variables in the propensity score model included sex, race, ethnicity, rurality of the Veteran’s home ZIP code, state of residence, smoking status, and categorization of two comorbidity scores (CAN , Nosos ). Continuous covariates included age, body mass index (BMI), comorbidity score via Gagne index, distance from a Veteran’s home to nearest VA hospital, and four VA utilization measures (inpatient admissions, primary care visits, specialty care visits, mental health visits in the prior 2 years).
A caliper of 0.2 times the pooled estimate of the standard deviation of the logit of the propensity score was used to bound which uninfected patients could be matched to each infected patient . To provide the survey team a sufficiently deep pool of matched controls to account for survey non-participation, the 25 matched uninfected patients closest in propensity score were retained for each infected patient. Infected patients with fewer than 25 matched uninfected patients had all their comparator patients selected as eligible matches. Matching was performed by the PSMATCH procedure from SAS/STAT 15.1 in SAS® 9.4M6 via the VA Informatics and Computing Infrastructure (VINCI) platform.
Outcomes comparisons to be conducted
The EHR-based clinical outcomes that we intend to compare between matched cohorts are mortality, depression, suicide, onset of new clinical diagnoses, exacerbation of prevalent conditions, development of COVID-19 sequelae, and health care use and VA health care costs. The survey-based outcomes to be compared between matched cohorts include disability, healthcare-related financial strain, and health-related quality of life. Our default approach to analyses will be “per-protocol”, such that uninfected patients who cross over to become infected will be censored at the time of infection. Future analyses will account for this potentially informative censoring via inverse probability of censoring weights  and/or censoring of the entire matched strata at time of censoring. The study team discussed inclusion of negative control outcomes, but an outcome expected to be null between comparators could not be identified due to the ubiquitous effects of SARS-CoV-2 infection and the conditioning of negative control outcomes on health care utilization that might be differential between comparators.
From a sampling frame of 231,160 Veterans who had documentation of at least one SARS-CoV-2 infection between March 2020 and April 2021, and 9,291,822 Veterans without evidence of infection over the same time period, we excluded patients who had neither a CAN comorbidity CAN score (i.e., were not assigned a PACT team) nor primary care use 24 months prior to index, or who had missing or implausible height, weight, or age (Fig. 1). We also excluded Veterans with missing ZIP codes or ZIP codes outside of Washington, D.C. or the 50 states, patients who were uninfected on the 1st of each month but became infected later in the same month (for the uninfected group), or had a prior infection documented in Medicare. Lastly, we excluded 776 (0.4%) of 209,312 infected patients who did not have a suitable match, which generated final matched cohorts of 208,536 infected and 3,014,091 uninfected Veterans (comprising 5,173,400 total person-months of follow-up because of matching with replacement). Unmatched infected patients are described in Appendix Table 2 and exhibited greater rates of missing information than those with suitable matches.
As expected, the cohorts prior to matching were imbalanced in many covariates (Appendix Table 3). After matching, the cohorts were well-balanced on all covariates, based on standardized mean differences (SMD) < 0.1 (Table 1). The cohorts included Veterans from all 50 states and Washington, D.C. Most Veterans in the matched cohorts were male (88.3%), non-Hispanic (87.1%), white (67.2%), and living in urban areas (71.5%), with a mean (standard deviation, SD) age of 60.6 (16.4), BMI of 31.3 (6.6), and mean (SD) straight-line distance to the closest VA medical center of 35.8 (35.2) miles. A minority were current smokers (12.6%), 39.3% had never smoked and 42.5% were former smokers. Comorbidity was assessed several ways, including Gagne score (mean = 1.4, SD = 2.2), count of CDC high-risk conditions (mean = 2.3, SD = 1.9) and count of 5 mental health conditions prevalent in Veterans (mean = 0.9, SD = 1.0). The most common diagnoses were hypertension (61.4%), diabetes (34.3%), major depression (32.2%), coronary heart disease (28.5%), PTSD (25.5%), anxiety (22.5%), and chronic kidney disease (22.5%). Approximately 10% of matched cohort members had been prescribed one or more immunosuppressive medications within 24 months before the index date (qualifying medications listed in the Appendix Table 4). Of the 34.0% of the cohort with index dates between January-April 2021 when vaccinations became available, 3,153 Veterans (1.5% of the entire infected cohort) received at least one dose of a vaccine before their first positive COVID-19 test result.
In the 24 months prior to the index date, cohort members had a mean (SD) of 8.3 (10.2) primary care visits, 13.4 (14.9) specialty care visits, and 7.8 (21.9) mental health visits in VA. Over one-half (53.3%) of infected patients were drawn from three of the 14 months in the observation period (November 2020, December 2020, and January 2021).
Despite the very large sample size available for this research with 231,160 infected and just over 9 million uninfected Veterans, a strategy of bias reduction based on coarsened exact matching resulted in lack of an exact match for 53.7% of cases. The large sample loss reduced statistical power and generalizability since the exposure-disease associations may have differed between successfully matched and unmatched populations. The combined exact matching and propensity score approach, on the other hand, resulted in a much lower failure to match frequency at only 0.4%, with a high rate of success as assessed by the SMDs < 0.1 for all matching covariates. The work performed by the CORC to identify important covariates on which to match cases and uninfected controls using propensity score methodology should facilitate the performance of causal research on long COVID-19 etiology in this population. Matched cohorts will be updated from May 2021 forward to be able to generate evidence on Veteran experience after April 2021. Given the ever-changing environment of variants, vaccination status, immunogenicity from prior exposure, and tests and treatments available, future analyses will consider period-specific effects and include individuals with antigen test-detected infections in these future cohorts.
As the nation’s largest national integrated publicly financed health system, the VA has the unique ability to track long-term outcomes among individuals infected with SARS-CoV-2 because it has a well-established comprehensive EHR that was developed around the mission of providing lifelong care for Veterans. In addition, Veterans are historically reliant on VA for care if they engage with the health system.
Our matching strategy defines the specific effect that will be estimated from our results. We considered historical controls of Veterans receiving care in the VA before the pandemic, but that would estimate the effect of individual SARS-CoV-2 infection combined with all the many other social and systematic disruptions that accompanied the pandemic. We also considered comparing Veterans hospitalized with SARS-CoV-2 infection to Veterans hospitalized with other conditions (e.g., influenza), which would be analogous to a randomized clinical trial with an active comparator. Such a comparator group asks whether COVID-19 hospitalization is worse than other sorts of hospitalizations. We reasoned that, for most Veterans, had they not developed COVID-19, they may not have been hospitalized with another condition that same month (although we did not exclude those hospitalized, so they do occur at whatever their natural frequency is in the comparator group).
We also did not wish to restrict to only hospitalized COVID-19 patients, as we took as a scientific question the relationship between initial severity of SARS-CoV-2 infection and subsequent outcomes—rather than presuming it by conditioning on initial severity. We also considered Veterans infected with other non-SARS-CoV-2 viruses. However, we noted the substantial body of evidence on sepsis and pneumonia—much of it viral in origin—that suggested such patients also have adverse long-term outcomes caused by non-SARS-CoV-2 viruses, including influenza. As such, we reasoned such comparators might underestimate the total individual effects survivors of COVID-19 would face and health systems would need to support. Each of these comparators may be of great interest to other research groups; they were not, however, our primary focus. Our goal was to generate matched cohorts to support cohort studies of EHR-derived outcomes and sample selection for qualitative interviews and patient surveys.
The retrospective cohort study described here is subject to several limitations. First, cohort matching results in sample loss that may reduce generalizability of results compared to weighting methods, although we were able to retain > 99% of the infected patients in the sample after matching. Second, there is likely some contamination of the uninfected comparator group with Veterans with undiagnosed SARS-CoV-2 infection or who tested positive for SARS-CoV-2 with test results not available from private insurers, Medicare Advantage plans, Medicaid, or other community sources. Third, covariate specification for matching is based on our understanding of risk factors and confounders of SARS-CoV-2 infection as of spring 2022, however, we are unable to measure all risk factors via administrative data. Specifically, unmeasured confounders such as employment, income, or other social vulnerability indicators may be imbalanced between matched groups and could confound the association between infection and outcomes. Fourth, results may not generalize to Veterans who became infected after April 2021 or to non-Veterans. CORC will update matched cohorts from May 2021-March 2022, and that work is ongoing.
Our understanding of the long-term outcomes of Veterans who were infected with SARS-CoV-2 will be gleaned from qualitative interviews, population-based surveys, and cohort studies of outcomes derived from EHRs. This study will explore all these approaches, all framed in the context of the matched cohorts generated from EHR data from the largest integrated health system in the U.S. Due to Veterans’ reliance on VA for care and eligibility for care once enrolled, we will be able to evaluate clinical and economic outcomes following their acute SARS-CoV-2 infection, as long-term outcomes two years after the onset of the pandemic are now being realized.
The datasets generated and/or analyzed during the current study are not publicly available due to Department of Veterans Affairs data restrictions prohibiting sharing. Contact the corresponding, Dr. Matthew Maciejewski, for data requests.
Care Assessment Need
Centers for Disease Control and Prevention
Chronic obstructive pulmonary disease
Coronary heart disease
Human immunodeficiency virus
Chronic kidney disease
Congestive heart failure
Substance use disorder
Serious mental illness
Post-traumatic stress disorder
Veterans Health Administration
VA community-based outpatient clinic.
Al-Aly Z, Xie Y, Bowe B. High-dimensional characterization of post-acute sequelae of COVID-19. Nature. 2021;594(7862):259–64.
Groff D, Sun A, Ssentongo AE, Ba DM, Parsons N, Poudel GR, et al. Short-term and long-term rates of Postacute Sequelae of SARS-CoV-2 infection: a systematic review. JAMA Netw Open. 2021;4(10):e2128568.
Hernan MA, Alonso A, Logan R, Grodstein F, Michels KB, Willett WC, et al. Observational studies analyzed like randomized experiments: an application to postmenopausal hormone therapy and coronary heart disease. Epidemiology. 2008;19(6):766–79.
Hernan MA, Robins JM. Using Big Data to emulate a target Trial when a Randomized Trial is not available. Am J Epidemiol. 2016;183(8):758–64.
Ioannou GN, Ferguson JM, O’Hare AM, Bohnert ASB, Backus LI, Boyko EJ, et al. Changes in the associations of race and rurality with SARS-CoV-2 infection, mortality, and case fatality in the United States from February 2020 to March 2021: a population-based cohort study. PLoS Med. 2021;18(10):e1003807.
Suissa S. Immortal time bias in pharmaco-epidemiology. Am J Epidemiol. 2008;167(4):492–9.
Austin PC. A Tutorial and Case Study in Propensity score analysis: an application to estimating the Effect of In-Hospital Smoking Cessation Counseling on Mortality. Multivar Behav Res. 2011;46(1):119–51.
Brookhart MA, Schneeweiss S, Rothman KJ, Glynn RJ, Avorn J, Sturmer T. Variable selection for propensity score models. Am J Epidemiol. 2006;163(12):1149–56.
Ripollone JE, Huybrechts KF, Rothman KJ, Ferguson RE, Franklin JM. Evaluating the utility of coarsened exact matching for Pharmacoepidemiology using real and simulated Claims Data. Am J Epidemiol. 2020;189(6):613–22.
Mack CD, Glynn RJ, Brookhart MA, Carpenter WR, Meyer AM, Sandler RS, et al. Calendar time-specific propensity scores and comparative effectiveness research for stage III colon cancer chemotherapy. Pharmacoepidemiol Drug Saf. 2013;22(8):810–8.
Prevention CfDCa. Underlying Medical Conditions Associated with Higher Risk for Severe COVID-19: Information for Healthcare Professionals Atlanta, GA: Centers for Disease Control and Prevention. ; 2022 [Available from: https://www.cdc.gov/coronavirus/2019-ncov/hcp/clinical-care/underlyingconditions.html.
Wang L, Porter B, Maynard C, Evans G, Bryson C, Sun H, et al. Predicting risk of hospitalization or death among patients receiving primary care in the Veterans Health Administration. Med Care. 2013;51(4):368–73.
Wagner TH, Upadhyay A, Cowgill E, Stefos T, Moran E, Asch SM, et al. Risk Adjustment Tools for Learning Health Systems: a comparison of DxCG and CMS-HCC V21. Health Serv Res. 2016;51(5):2002–19.
Austin PC. Optimal caliper widths for propensity-score matching when estimating differences in means and differences in proportions in observational studies. Pharm Stat. 2011;10(2):150–61.
Buchanan AL, Hudgens MG, Cole SR, Lau B, Adimora AA, Study WsIH. Worth the weight: using inverse probability weighted Cox models in AIDS research. AIDS Res Hum Retroviruses. 2014;30(12):1170–7.
The study was supported by the U.S. Department of Veterans Affairs HSR&D grant C19 21–278 and C19 21–279. MM and DMH were also supported by a senior Research Career Scientist award from the Department of Veterans Affairs (RCS 10–391 to M.M. and RCS 21–136 to D.M.H.) and by the Durham VA Center of Innovation to Accelerate Discovery and Practice Transformation (CIN 13–410).
Ethics approval and consent to participate
Initial and continuing reviews approved by were reviewed and approved by the Durham Veterans Affairs Institutional Review Board and Research and Development Committee. All methods were carried out in accordance with relevant guidelines and regulations. No informed consent was obtained because the project is secondary analysis of data. The Durham VA Institutional Review Board granted waivers for informed consent, HIPAA information, and HIPAA research on decedents.
Consent for publication
The authors declare that they have no competing interests.
Role of the sponsor
The Health Services Research and Development Service, Department of Veterans Affairs had no role in the design, conduct, collection, management, analysis, or interpretation of the data; or in the preparation, review, or approval of the manuscript. The opinions expressed are those of the authors and not necessarily those of the Department of Veterans Affairs, the United States Government, Duke University, the University of Washington, the University of Michigan, Oregon Health & Science University (OHSU), Portland State University, and Oregon State University.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The original online version of this article was revised: the authors requested to update the number of covariates in Abstract section from 39 to 37 covariates. In section “Matching specification”, 39 has been changed to 37 (fourth paragraph), and “count of CDC high-risk conditions, count of mental health conditions,” has been removed (fifth paragraph).
Electronic supplementary material
Below is the link to the electronic supplementary material.
About this article
Cite this article
Smith, V.A., Berkowitz, T.S.Z., Hebert, P. et al. Design and analysis of outcomes following SARS-CoV-2 infection in veterans. BMC Med Res Methodol 23, 81 (2023). https://doi.org/10.1186/s12874-023-01882-z