Psychometric properties of measures of substance use: a systematic review and meta-analysis of reliability, validity and diagnostic test accuracy

Background Synthesis of psychometric properties of substance use measures to identify patterns of use and substance use disorders remains limited. To address this gap, we sought to systematically evaluate the psychometric properties of measures to detect substance use and misuse. Methods We conducted a systematic review and meta-analysis of literature on measures of substance classes associated with HIV risk (heroin, methamphetamine, cocaine, ecstasy, alcohol) that were published in English before June 2016 that reported at least one of the following psychometric outcomes of interest: internal consistency (alpha), test-retest/inter-rater reliability (kappa), sensitivity, specificity, positive predictive value, and negative predictive value. We used meta-analytic techniques to generate pooled summary estimates for these outcomes using random effects and hierarchical logistic regression models. Results Findings across 387 paper revealed that overall, 65% of pooled estimates for alpha were in the range of fair-to-excellent; 44% of estimates for kappa were in the range of fair-to-excellent. In addition, 69, 97, 37 and 96% of pooled estimates for sensitivity, specificity, positive predictive value, and negative predictive value, respectively, were in the range of moderate-to-excellent. Conclusion We conclude that many substance use measures had pooled summary estimates that were at the fair/moderate-to-excellent range across different psychometric outcomes. Most scales were conducted in English, within the United States, highlighting the need to test and validate these measures in more diverse settings. Additionally, the majority of studies had high risk of bias, indicating a need for more studies with higher methodological quality.


Background
Substance use, including illicit drug use and alcohol, is prevalent worldwide with about 5% of adults using illicit substances [1] and 40% of adults consuming alcohol, in the past year [2]. Moreover, the number of people with drug use disorders was estimated at 62 million, while the number of individuals with alcohol use disorders was estimated at 100.4 million in 2016 [3]. Substance use disorders are associated with substantial morbidity and mortality globally. Illicit drug use disorders were attributed to 20 million disability-adjusted life years (DALYs) lost [4] while alcohol use disorders were attributed to 85 million DALYs lost in 2012 [5]. Specific classes of substances also play an important role in HIV risk, including needle sharing, and sexual risk behaviors, and have been linked to HIV incidence [6][7][8] [6,[9][10][11] [12][13][14][15]. Among people living with HIV (PLWH), substance use disorders may lead to less optimal HIV care outcomes because of their associations with lower likelihood of being linked to HIV care, retained in care, receiving antiretroviral therapy (ART), having high ART adherence and lower likelihood of having an undetectable HIV viral load [9,10,[16][17][18].
Given the role of substance use in the global burden of disease and the overlap between use of specific substances and HIV, it is important for clinicians and researchers to have tools with high reliability, validity, and diagnostic accuracy [19]. Yet too few use measures with known psychometric properties when assessing substance use. Currently, there are a myriad of standardized questionnaires used to screen substance use and misuse that require patients to self-report patterns of use and substance-related problems. Examples such as the Alcohol Use Disorders Identification Test and the Drug Use Disorders Identification test [20,21] provide scores that correspond with severity of substance use and related problems. It remains that there are no biological measures that define a substance use disorder; existing biological measures are considered to be indirect correlates of use disorders [22]. Examples include alcohol biomarkers like Carbohydrate-Deficient Transferrin (CDT), and Gamma Glutamyl Transferase (GGT), which are used to screen for alcohol dependence and heavy drinking, respectively [22]. There is a great need to evaluate the psychometric performance of these measures and markers across studies in settings of HIV to elucidate the overall validity, reliability, and diagnostic accuracy.
One approach to informing the use of psychometric measures in research and clinical care is pooling the psychometric characteristics of measures across studies involves the use of meta-analytic techniques, which generates summary estimates of the validity, reliability, and diagnostic accuracy of different questionnaires [23][24][25][26][27]. However, synthesis of psychometric properties of substance use measures to identify patterns of use and substance use disorders remains limited, with few exceptions [21,28,29]. One meta-analysis focused on the accuracy of self-reported assessments to diagnose alcohol and cannabis use disorders found that instruments had a pooled sensitivity of 0.88 and a pooled specificity of 0.90 among emergency room department pediatric patients [28]. Another meta-analysis observed that studies with single questions to identify alcohol use disorders in primary care had pooled sensitivity of 0.54 and pooled specificity of 0.87 while two-question measures had a pooled sensitivity of 0.87 and a pooled specificity of 0.80 [29]. More commonly, however, reviews on substance use measures present psychometric data in a descriptive fashion [19,30,31]. Therefore, more rigorous efforts to systematically pool the psychometric properties of substance use measures are needed to establish the overall performance and accuracy of these tools and point toward their utility in future research.
To address these gaps, we conducted a systematic review and meta-analysis of literature to identify studies that have reported validity and reliability of substance use measures and pooled these measure using meta-analytic techniques. For the purposes of this review, we targeted our search for measures of substance classes previously associated with HIV risk. Specifically, we focused our review on measures for the following: alcohol, methamphetamine and amphetamine, cocaine, heroin, and ecstasy, regardless of whether the study was conducted among a population at high risk for HIV. Additionally, we included measures that evaluated substance use in general (i.e., measures that did not differentiate between classes of substances) as long as those measures were inclusive of our targeted substance classes. This study's review questions are: What are the summary reliability, validity--as measured by alpha and kappa coefficients-and diagnostic accuracy-as measured by sensitivity, specificity, positive predictive value, and negative predictive value-of various substance and alcohol measures to screen for use and use disorders?

Search strategy
We conducted a systematic review of studies published prior to June 2016 on substance use measures indexed in electronic databases including PubMed, PsycINFO, and EMBASE. We developed Boolean search terms to capture substance use measures that have been previously associated with HIV risk, in consultation with the reference librarian from the University of California San Francisco with a master's degree in library and information science (MLIS). The following substance classes were included: alcohol, methamphetamine and amphetamine, cocaine, heroin, and 3,4-methylenedioxy-methamphetamine (MDMA; "ecstasy"). Because the focus of this study was to pool psychometric properties of measures, we also included search terms related to validity, reliability, and diagnostic accuracy (i.e., alpha, kappa, sensitivity, specificity, positive predictive value, negative predictive value). Search terms included MeSH headings related to our research question, general terms related to substance use and psychometric properties or interest, as well as specific terms referencing the names of wellknown substance use measures. The search terms used are provided in the appendix. This review was registered in Prospero, the International prospective register of systematic reviews (study number: CRD42017058813).

Primary outcomes
We aimed to estimate the pooled summary estimates for the following psychometric outcomes: Cronbach's alpha, kappa, sensitivity, specificity, positive predictive value, and negative predictive value. We recognize that there are a number of measure characteristics that relate to validity [32]. However, to focus our review and facilitate the feasibility of completing this study, we have decided to restrict the scope of our validity measures to Cronbach's alpha. Descriptions for these outcomes are provided below:

Eligibility criteria
We searched for relevant publications that met all of the following inclusion criteria: 1) studies that reported one or more of the psychometric outcomes of interest; 2) studies that examined on one or more substance use measures related to our substance classes of interest (i.e., alcohol, methamphetamine and amphetamine, cocaine, heroin, and ecstasy) or for substance use in general (i.e., some measures do not differentiate between multiple substances or assess classes of substances all together); 3) publication written in English (note: studies that administered measures that were not in English were eligible as long as the publication was written in English) . We excluded publications using the following exclusion criteria: 1) reporting insufficient information on reliability, validity and diagnostic accuracy for substance use measures/assessments (i.e., no numeric information on our psychometric outcomes, sample size); 2) articles that provide psychometric data for a measure/assessment that is not related to substance use (e.g., a study on internal consistency data on a depression scale among substance users); 3) articles and/or secondary data analyses that report reliability and validity data from a primary outcome paper that was already included in the review; 4) reviews, commentaries, case report studies and other publications with insufficient reporting of data; 5) substance use measures/assessments that focus on aspects other than actual substance consumption, dependence or substance use disorder (e.g., a study reporting validity of a selfefficacy scale for resisting substance use; a study that examines the underlying mechanisms of substance use among those who already have a substance use disorder); and 6. studies with psychometric properties that focus on substance classes outside the scope of our review (e.g. marijuana or tobacco).

Screening procedures
All citations (including their titles and abstracts) captured by the search strategy were imported into Covidence.org (Melbourne Victoria), which allowed research team members to independently review and screen citations using a centralized, online database. Each title/abstract was screened by two members of a team comprising master-, doctoral-, and post-doctorallevel researchers trained in the study protocol (co-authors PP, DH, RC, DS, CM, PM, and FC) and citations that were coded as eligible by both reviewers were moved to the full-text review phase. The same process was then repeated for full-text articles. In the event of discrepancies between reviewers in both the title and abstract phase and the full-text phase, a third team member (GMS) reviewed the relevant documents and helped reconcile the differences. Articles that were deemed eligible in the full-text review stage were included in the data extraction phase described below.

Data extraction
Team members extracted data on the psychometric properties, scale and study characteristics, sample size, study sample characteristics/co-factors of interest (country where study was conducted, number of sites, language that the scale was administered, gender of participants included), cut-offs used, comparison measure/gold-standard used, and other information relevant to study, including information on study quality [33]. Some papers reported multiple data points for psychometric outcomes from different study populations (e.g., disaggregated data by sex or different research sites). These data points were extracted as separate records only if the paper did not provide a single overall measure for the psychometric outcomes for the entire study sample, consistent with other analyses [24].

Assessment of bias risk
For studies reporting diagnostic measures (e.g., sensitivity and specificity), reviewers rated study quality using the Revised Tool for the Quality Assessment of Diagnostic Accuracy Studies, QUADAS-2, guidelines [33], which includes quality rating questions on the study's patient selection, index test, reference standard, and flow and timing. For studies that did not include diagnostic accuracy measures, only relevant domains of QUADAS-2 were assessed, as appropriate (i.e., rating regarding the reference standard was not conducted). All extracted data were entered into an electronic questionnaire programmed in Qualtrics, and checked by another researcher (conducted by the same co-authors who screened citations, as well as co-author BK) to verify accuracy.

Data analyses
We calculated separate pooled summary estimates for each of the 37 substance use measures and also fitted separate models for each of the six psychometric outcomes for validity, reliability, and accuracy. For alpha, kappa, PPV and NPV, we pooled data across studies using DerSimonian-Laird random effects models, implemented in STATA version 13 (Colleges Station, TX) [34]. Random effects meta-analyses models, as opposed to fixed-effects models, are preferred for pooling data from diagnostic accuracy tests since heterogeneity is presumed to exists across these studies [35]. Random effects models, which are considered the default models used in meta-analyses for diagnostic accuracy tests, synthesize the psychometric outcomes from separate studies into a weighted average effect size (pooled summary estimate), using inverse variance weighting, based on sample size, while taking into account the extent of the variability of the effect sizes observed in separate studies [35]. Additionally, for sensitivity and specificity, we used hierarchical logistic regression models, implemented using the metandi command in STATA, to account for the correlation between the two measures (i.e., trade-off between sensitivity and specificity) [36][37][38]. Since metandi requires a minimum of four observations to conduct a meta-analysis, we pooled measures with less than four records for sensitivity and specificity outcomes using the random effects models described for other outcomes, and noted this alternate approach in the results, as appropriate.

Classification and evaluation of pooled estimates
Qualitatively, pooled summary estimates for alpha and kappa were classified as "excellent" for estimates that were > 0.89, "good" for estimates that were between 0.85-0.89, "moderate" for estimates that were between 0.80-0.84, "fair" for estimates that were between 0.75-0.79, or "unsatisfactory" for estimates below 0.75, consistent with other studies [24,39].
Pooled summary estimates for sensitivity, specificity, positive predictive value and negative predictive value were classified as "excellent" for estimates that were > 0.89, "good" for estimates that were between 0.8-0.89, "moderate" for estimates that were between 0.6-0.79, and "low" for estimates that were < 0.6 [24,40].
For each pooled psychometric summary estimate, we calculated I 2 statistics, which represents the percentage of total variation across studies, to assess heterogeneity. We considered pooled estimates as having low heterogeneity if I 2 25%, moderate heterogeneity if I 2 50%, and high heterogeneity if I 2 75% [41]. We did not use standard meta-analyses tests for publication bias given the limitations of these tests for diagnostic test accuracy studies and due to the characteristics of our psychometric outcomes (e.g., truncated measures cannot fall below zero) [42]. As indicated in the Cochrane Handbook for Systematic Reviews of Diagnostic Test Accuracy, using these tests are inappropriate because they will likely lead to a high false-positive rate for publication bias [35].

Screening and study inclusion
Study screening and inclusion is summarized in Fig. 1. In brief, in the identification stage, we initially identified 7555 references in the initial search, of which, 208 were excluded for being duplicates. In the title and abstract review phase, reviewers excluded 5854 studies that were deemed ineligible. Full-text reviews were conducted for 1493 articles that were deemed eligible from title and abstract review. Of the full-text reviewed articles, 1105 studies were excluded for not meeting eligibility criteria.
The most common reasons for exclusion were: scales or measures that were outside the scope of review (n = 386), lack of psychometric data on scales of interests (n = 140), lab or methods papers that were outside the scope of the review (n = 130), non-English language publications (n = 110), duplicate study (n = 98), psychometric outcomes that were outside the scope of review (n = 79). In total, there were 387 unique studies included in the data extraction phase containing sufficient data on the outcomes for 37 scales (Table 1). Table 2 presents characteristics of the studies included in this meta-analysis. As mentioned, studies published in English were included in this review, regardless of the language in which the scales were administered. Among the 387 studies included, the most those common language in which the scale/measure was conducted in was English (63%), followed by Spanish (9%), French (5%), Portuguese (3%), and Chinese (2%). A large proportion of studies were conducted in the United States (40%).

Study characteristics
The median sample size was 286 [Range = 9-50,049]. The vast majority of studies (83%) included men and women (n = 323). Additionally, 11% (n = 42) of the studies included study sample comprised only of men, while 5% (n = 20) studies included study samples comprised only of women. Most studies were published after 1999 (66%), with studies published between 2000 and 2009 accounting for 38% (n = 148) of the studies metaanalyzed, and studies published between 2010 and 2017 accounting for 28% (n = 110). Most studies involved a single study site 61%, while 39% were multi-site studies. Additionally, 72% of the studies involved convenience samples, 20% included random or probability based samples, and 7% had other or unclear sampling strategies.

Assessment of bias in study quality
The risk of bias in the four QUADAS 2 domains for each study included in this meta-analysis is presented in Supplementary Table 1. The distribution of the QUA-DAS 2 domains for the entire study is summarized in Fig. 2. Of the studies included, 58% of studies had a low

SELF-REPORTED MEASURES
Alcohol Dependence Scale ADS 3 The Alcohol Dependence Scale (ADS) is an alcohol screening and assessment tool that provides a quantitative index for the severity of alcohol dependence. Developed with respect to the concept of alcohol dependence syndrome, the ADS is comprised of 25 items that assess withdrawal symptoms, alcohol tolerance, awareness of dependence, ability to control drinking, and the salience of drink-seeking behavior.
Addiction Severity Index ASI 4 The Addiction Severity Index (ASI) is a structured interview for assessing alcohol and drug dependence. The ASI comprises 200 items across seven scales assessing past 30-day and lifetime alcohol use, drug use, medical problems, employment/support problems, legal problems, family/social problems, and psychological problems.
Addiction Severity Index-Alcohol (alcohol sub-scale)

ASI-A 22
The Addiction Severity Index -Alcohol (ASI-A) is the alcohol sub-scale of the Addiction Severity Index. It assesses frequency of past 30-day and lifetime alcohol use and intoxication, alcohol-related problems including withdrawal symptoms, and treatment experiences. The Alcohol Use Disorders Identification Test (AUDIT) is a ten-question test developed by a World Health Organization-sponsored collaborative project to identify persons with hazardous and harmful patterns of alcohol consumption or alcohol dependence. It comprises questions on amount and frequency of alcohol consumed, alcohol dependence, and alcohol-related problems.
Alcohol Use Disorders Identification Test -Question 3 AUDIT-3 16 The Alcohol Use Disorders Identification Test -3 (AUDIT-3) is a brief alcohol screening instrument. Derived from the third question of the tenitem AUDIT developed by the World Health Organization, it consists of a single-item measure assessing heavy episodic drinking.

Alcohol Use Disorders
Identification Test -C

AUDIT-C 42
The Alcohol Use Disorders Identification Test -Concise (AUDIT-C) is a brief alcohol screening instrument derived from the first three questions of the ten-item AUDIT developed by the World Health Organization. It assesses frequency of alcohol consumption, number of standard drinks consumed on a typical drinking day, and frequency of consumption of six or more drinks on one occasion.  Self-Administered Alcoholism Screening Test SAAST 4 The Self-Administered Alcoholism Screening Test (SAAST) is a 35-item questionnaire to screen for alcohol dependence. It assesses problem related to alcohol in the following domains: loss of control, occupational and social disruption, physical consequences, emotional consequences, concern on the part of others, and family members with alcohol problems.
Semi-Structured Assessment for Drug Dependence and Alcoholism The Semi-structured Assessment for Drug Dependence and Alcoholism (SSADDA) is a screening instrument that assesses alcohol/drug abuse and dependence as well as other DSM-IV disorders throughout the lifetime. It was developed from the Semi-Structured Assessment for the Genetics of Alcoholism, and therefore includes questions on the onset and recency of individual alcohol/drug abuse and dependence symptoms, allowing temporal assessment of symptom clusters. Its format as a semi-structured interview lists questions to be read verbatim, but also allows the interviewer to add follow-up questions.
Severity of Dependence SDS 9 The Severity of Dependence Scale (SDS) is a 5-item questionnaire used to measure the degree of dependence on different classes of drugs, with a focus on the psychological components of dependence.
Tolerance-Annoyance Cut Down Eye Opener TACE 9 Tolerance-Annoyance Cut Down Eye Opener (TACE) is a 4-item screening tool to identify maternal prenatal problematic alcohol use.
Timeline Followback TLFB 5 The Timeline Followback (TLFB) is a method that involves the use of a timeline (e.g., calendar) to ask individuals to estimate their daily alcohol and/or drug use consumption retrospectively (e.g., 7 days, 2 years).  Alanine aminotransferase (ALT) is a biomarker, which indicates liver damage from different types of disease and conditions. It that is used as a clinical screening and monitoring tool to check for chronic alcohol use.
ALT is an enzyme found mostly in the cells of the liver and kidney. When the liver is damaged, ALT is released into the blood. Elevated ALT in laboratory tests is indicative of heavy alcohol consumption and often used to detect relapses.
Aspartate transaminase AST 31 Aspartate transaminase (AST) is a biomarker, which indicates liver damage from different types of disease and conditions. It that is used as a clinical screening and monitoring tool to check for chronic alcohol use. The concentrations of AST in the serum are normally low. However, if the liver is damaged, the liver cell (hepatocyte) membrane becomes more permeable and some of the enzymes leak out into the blood circulation. Elevated AST in laboratory tests are indicative of chronic alcohol abuse Aspartate transaminase, Alanine transaminase ratio AST/ALT 5 AST and ALT are considered to be two of the most important tests to detect liver injury. The ALT: AST ratio is normally and in other condition is less than 1, but becomes greater than unity during liver injury. Elevated AST/ALT in laboratory tests are indicative of chronic alcohol abuse.
Blood alcohol concentration BAC 5 Blood Alcohol Concentration (BAC) levels represent the percent of your blood that is concentrated with alcohol. It is most commonly used as a metric of alcohol intoxication for legal or medical purposes. Its primary goal is to determine if alcohol has been consumed.
Carbohydrate deficient transferrin CDT 9 Carbohydrate-deficient transferrin (CDT) is an alcohol biomarker that is used as a clinical screening and monitoring tool to identify heavy drinking. Transferrin is glycoprotein produced in the liver that normally has 3-5 carbohydrate sidechains. Heavy alcohol use, however, inhibits the enzymes involved to appropriately regulate these sidechains; causing the transferrin to be carbohydrate deficient. Laboratory blood test can detect elevated levels of CDT (%CDT), which are indicative of heavy alcohol consumption and often used to detect relapses.

CDTech CDTech 35
Description: CDTect is a common method of using carbohydrate-deficient transferrin (CDT) to screen for heavy alcohol use.
Carbohydrate deficient transferrin + Mean corpuscular volume CDT + MCV 5 Carbohydrate-deficient transferrin (CDT) and Mean corpuscular volume (MCV) are two biomarkers commonly used to screen for heavy drinking. MCV is the average volume of blood cells, which increase in size after 4 to 8 weeks of excessive alcohol intake. CDT is transferrin, a glycoprotein produced in the liver that has become carbohydrate deficient. Heavy alcohol use prevents enzymes from properly regulating the carbohydrate side chains in transferrin, thus increasing the value of carbohydratedeficient transferrin. Using the combined biomarkers of CDT and MCV, a patient must exceed the cut-off of both biomarkers to be screened positive.
Gamma-Glutamyl Transferase GGT 68 Gamma-Glutamyl Transferase (GGT) is an enzyme that when elevated in serum is reflective of liver damage. Subsequently, clinical laboratory GGT tests are commonly used to detect and monitor excessive alcohol consumption. Elevated GGT levels typically correspond with continuous and chronic alcohol abuse as opposed to episodic heavy drinking.
Gamma-Glutamyl Transferase + Mean corpuscular volume GGT + MCV 10 Gamma-Glutamyl Transferase (GGT) and Mean corpuscular volume (MCV) are two biomarkers commonly used in screening heavy alcohol intake.
GGT is a type of enzyme that, when elevated in serum, is reflective of liver damage. MCV is the average volume of red blood cells, which increases after 4 to 8 weeks of excessive drinking. Used in combination, a patient must exceed the cut-offs for both GGT and MCV in order to be screened positive.
Ethyl glucuronide EtG 5 Ethyl glucuronide (EtG) is a byproduct of the body's metabolization of alcohol, and can be detected in the hair for up to 90 days. Compared to a blood or urine analysis, a hair analysis for EtG provides a much longer risk of bias with respect to the patient population; 57% has low risk of bias in the index test domain, 48% has low risk of bias in the reference standard test domain, and 72% had low risk for the flow and timing. Overall, only 16% of studies had low risk of bias across all four of these QUADAS 2 domains.

Pooled summary estimates: overall findings
The pooled summary estimates of psychometric properties of substance use measures (which are described in Table 1) are quantitatively and qualitatively summarized in Tables 2 and 3, respectively. Overall, 65% of pooled estimates for alpha were in the range of fairto-excellent; 44% of estimates for kappa were in the range of fair-to-excellent. In addition, 69, 97, 37 and 96% of pooled estimates for sensitivity, specificity, positive predictive value, and negative predictive value, respectively, were in the range of moderate-to-excellent (Fig. 3).

Pooled summary estimates, by substance use measure
The pooled estimates and 95% confidence intervals for alpha, kappa, sensitivity, specificity, positive predictive value, and negative predictive value are shown in Table  2, respectively. Below we summarize the results of the pooled summary estimates alphabetically for each of the 37 substance use measures, grouping self-reported measures and biomarkers separately. The list of references for the studies meta-analyzed for each scale/measure is presented in Supplementary Table 2.

Self-reported measures
Alcohol dependence scale (ADS) The pooled alpha estimate for ADS (3 data points) was good: 0.90 (95%CI = 0.80-0.99) and there was high heterogeneity between studies (I 2 98.9%). The pooled sensitivity estimate for ADS (2 data points) was excellent: 0.95 (95%CI = 0.90-1.00) and there was low heterogeneity between studies (I 2 0%). The pooled specificity estimate (2 data points) was moderate: 0.64 (95%CI = 0.52-0.77) and there was moderate heterogeneity between studies (I 2 60.1%). There was insufficient data to calculate the pooled PPV and NPV estimates for ADS.
Addiction Severity Index (ASI) The pooled alpha estimate for ASI (3 data points) was good: 0.84 (95%CI = 0.81-0.87) and there was moderate heterogeneity between studies (I 2 38.5%). There was insufficient data to calculate pooled kappa, sensitivity, specificity, PPV, and NPV estimates. Mean corpuscular volume (MCV) is the average volume of red blood cells, measured by multiplying a volume of blood by the proportion of the blood that is cellular, and then dividing the product by the number of red blood cells within that sample. The size of red blood cells increase after 4 to 8 weeks of excessive alcohol intake, making MCV effective as an alcohol biomarker. MCV is not very sensitive as a standalone measure or specific in detecting alcohol relapse, as it is slow to return to a normal value. It is, however, an easy and affordable method of testing.
Phosphatidylethanol PEth 8 Phosphatidylethanol (PEth), a commonly used alcohol biomarker, is an abnormal group of phospholipids that are formed in red blood cells only in the presence of alcohol. Clinical laboratory tests can identify the presence of PEth in blood, which is indicative of alcohol abuse. PEth testing is a popular detection tool for heavy alcohol consumption because it is considered a direct biomarker for ethanol and has 99% sensitivity.
Note: a Some studies contributed more than one data point/were comprised of more than on study populations. References for studies, by scale/measure, are presented in Supplementary      Notes: Pooled summary estimates for alpha and kappa were classified as "excellent" for estimates that were > 0.89, "good" for estimates that were between 0.85-0.89, "moderate" for estimates that were between 0.80-0.84, "fair" for estimates that were between 0.75-0.79, or "unsatisfactory" for estimates below 0.75, consistent with other studies [24,39]. Pooled summary estimates for sensitivity, specificity, positive predictive value and negative predictive value were classified as "low" for estimates that were < 0.6, "moderate" for estimates that were between 0.6-0.79, "good" for estimates that were between 0.8-0.89 and "excellent" for estimates that were > 0.89 [24,40] There was insufficient data to calculate the pooled estimate for sensitivity, specificity, PPV, and NPV for the original CIDI. The pooled sensitivity estimate for CIDI version 2.1 (3 data points) was fair: 0.75 (95%CI = 0.69-0.81) and there was low heterogeneity between studies (I 2 0.0%). The pooled specificity estimate for CIDI version 2.1 (3 data points) was good: 0.84 (95%CI = 0.69-1.00) and there was high heterogeneity between studies (I 2 98.7%). There was insufficient data to calculate the pooled estimate for kappa, PPV, and NPV for CIDI version 2.1.
Semi-structured assessment for drug dependence and alcoholism (SSADDA) There are no alpha coefficients associated with semi-structures assessments such as SSADDA. The pooled kappa estimate for SSADDA (8 data points) was moderate: 0.84 (95%CI = 0.77-0.91) and there was high heterogeneity between studies (I 2 0.97). There was insufficient data to calculate the pooled sensitivity, specificity, PPV and NPV estimates for SSADDA.
The chemical use, Abuse, and dependence (CUAD) The pooled alpha estimate for CUAD (3 data points) was excellent: 0.96 (95%CI = 0.94-0.98) and there was high heterogeneity between studies (I 2 95%). There was insufficient data to calculate the pooled estimate for kappa, sensitivity, specificity, PPV, and NPV for CUAD.

Carbohydrate deficient transferrin with mean corpuscular volume (CDT with MCV)
There are no alpha and kappa coefficients associated with biomarkers such as CDT and MCV. The pooled sensitivity estimate for CDT with MCV (8 data points) was moderate: 0.74 (95%CI = 0.60-0.88) and there was high heterogeneity between studies (I 2 98%). The pooled specificity estimate for CDT with MCV (4 data points) was excellent: 0.93 (95%CI = 0.91-0.95) and there was low heterogeneity between studies (I 2 0%). The pooled PPV estimate for CDT with MCV (4 data points) was moderate: 0.74 (95%CI = 0.51-0.97) and there was high heterogeneity between studies (I 2 98%). The pooled NPV estimate for CDT with MCV (4 data points) was excellent: 0.92 (95%CI = 0.83-1.00) and there was high heterogeneity between studies (I 2 95%).

Phosphatidylethanol (PEth)
There are no alpha and kappa coefficients associated with biomarkers such as PEth. The pooled sensitivity estimate for PEth (7 data points) was good: 0.87 (95%CI = 0.79-0.96) and there was high heterogeneity between studies (I 2 94%). The pooled specificity estimate for PEth (4 data points) was excellent: 0.94 (95%CI = 0.91-0.97) and there was moderate heterogeneity between studies (I 2 31%). There was insufficient data to calculate the pooled estimate for PPV and NPV for PEth.

Discussion
In this systematic review and meta-analysis, we identified 387 unique papers that have published data on the validity, reliability and diagnostic accuracy of 37 scales for substance classes that are associated with HIV risk. We observed based on meta-analyzable data available, that fourteen of the thirty-seven measures/scales (38%) that had all pooled estimates consistently meet criteria for acceptability (e.g., ranging between fair/moderate-to-excellent), which included the following self-reported measures: Taken together, our findings highlight the availability of a promising range of tools for researchers and practitioners when assessing substance use, particularly those working with classes of substances associated with HIV risk, such as heroin, methamphetamine, cocaine, ecstasy, and alcohol. Nevertheless, further research is needed to determine why some substance use measures do not consistently have acceptable psychometric properties across different studies.
Overall, while most of the self-reported scales had acceptable validity, most did not have acceptable reliability: 65% of pooled estimates for alpha were in the range of fair-to-excellent though only 44% of estimates for kappa were in the range of fair-to-excellent. Moreover, a greater proportion of the scales we identified and meta-analyzed were better at correctly identifying individuals who are truly not using substances/ not problematic users among those truly without these conditions (specificity: 97% of summary estimates) and among those who were deemed as not having this condition in the scale (negative predictive value: 96%). In contrast to specificity and negative predictive value estimates, fewer scales had pooled estimates on sensitivity and positive predictive value that were in the fair-to-excellent range (69 and 37%, respectively). These may have implications in the application of these measures in different settings. For example, in the criminal justice system, it may be better to utilize measures that have high specificity and negative predictive properties if the priority is to avoid false-positive results. However, in health settings, it may be more ideal to use measures with better sensitivity and positivity to better capture individuals who may require further assessment for substance use disorder assessments and treatment referrals, as appropriate.
Overall, the studies identified in this review had administered scales in English, were conducted within in the United States, and were less commonly tested among exclusively-women samples (there were twice as many exclusively-men samples in comparison). These findings highlight the general lack of diversity in terms of language, setting, and study population for the studies reporting validity, reliability, and diagnostic accuracy on substance use measures. Given the high morbidity and mortality associated with substance use globally and for different risk populations, greater effort is needed to further evaluate the psychometric properties of substance use measures in such samples. This study also found that few papers on substance use psychometric properties are "low risk" across all QUADAS 2 domains (16%). This finding highlights the need to further study the validity, reliability, and diagnostic accuracy of substance use measures using studies designed with better methodological rigor to reduce risk of bias.
This present study has several limitations. First, our inclusion criteria may have excluded some potentially relevant studies on the psychometric properties of substance use measures that were not published in English. Hence, although we included measures that were not administered in English as long as they were published in English, our findings may not necessarily be generalizable to the psychometric properties of non-English measures that were not published in English. It should also be noted that our eligibility criteria likely favored the inclusion of studies that were conducted in settings where English proficiency was higher, which is correlated with countries with higher gross national income per capita [43]. Moreover, while our search strategy was developed to try and identify all the relevant studies, many publications that have calculated our psychometric properties of interest may not have language referencing the specific key words/terms in our strategy in their titles and/or abstracts. In particular, this may occur because the psychometric data of scales may not be considered a "primary outcome" of a study, and thus not be highlighted in the title or abstract (i.e., the relevant data are imbedded within the full-text only). Additionally, while we did not specifically seek out studies only among HIV-risk populations, per se, our study did focus on substance classes that have been associated with HIV risk, namely alcohol, stimulants (methamphetamine, amphetamine, cocaine, ecstasy), and heroin. Hence, our search may have missed studies on more general substance use measures that did not explicitly name our targeted substance classes. Furthermore, we were unable to calculate pooled estimates for some psychometric outcomes of several measures due to lack of published data or insufficient data, including for some widely used assessments previously shown to be valid and reliable, such as the DSM-IV diagnostic modules used in the US National Surveys of Drug Use and Health, the Diagnostic Interview schedule, and the AUDADIS [44][45][46]. Another limitation in our metaanalysis is related to our narrow definition of validity, which focused on internal validity as measured by Cronbach's alpha values. We acknowledge that there are a range of other characteristics that examine validity that we did not include in our analysis such as criterion validity, predictive validity, and other psychometric properties [32]. Further research is needed to fill our gaps in knowledge on the psychometric properties of these substance use measures to enable pooled summary estimate calculations. In addition, we recognize the limitation from pooling alpha and kappa statistics from clinical and epidemiologic/community samples given how these statistical measures are margin-sensitive. Moreover, with respect to the synthesis of data on sensitivity and specificity, we acknowledge that some studies may have used imperfect goldstandards, which may lead to distorted values for the individual estimates for sensitivity and specificity. Therefore, it may be appropriate to refer to results as co-positivity and co-negativity, as suggested by Buck and Gart [47]. Finally, we also recognized that disease spectrum severity and prevalence can affect test performance for sensitivity and specificity [48,49]. Our results should be interpreted with these limitations in mind.
To our knowledge, this is the first systematic review and meta-analysis involving the synthesis of psychometric data across different measures of substances that are associated with HIV risk. As mentioned, limited research has been conducted with respect with quantitatively pooling the psychometric characteristics of substance use measures. Our findings highlight the general strengths of many substance use measures with respect to their validity, reliability, and diagnostic accuracy across multiple studies/samples. To facilitate the dissemination of these findings, and provide researchers with a resource to identify validated, reliable, and accurate measures for substance use, we collaborated with members of the HIV Prevention Trials Network (HPTN) Substance Use Scientific Committee to develop a web-based tool, with the results of the pooled summary estimates presented in this study. The tool, named "Substance Use Measure Identification (SUMI) Tool" is available as a free resource in the HPTN's website (URL: https://www. hptn.org/researchtools).

Conclusion
In summary, researchers in the field of substance use should endeavor to conduct more validity, reliability, and diagnostic accuracy studies on measures to identify substance use and use disorders among more diverse settings and populations, and with more rigorous study designs. Ultimately, accurate identification of substance users and problematic substance use is a critical step in identifying individuals for substance use treatment and evaluating the effectiveness of treatment strategies. Hence, further evaluation of substance use measures is of great importance not only to the field of substance use research, but also substance use treatment. Given the substantial contribution of substance use to the global burden of disease [5], having robust data on the. psychometric properties of substance use measure can help researchers identify the best tools to use in research studies, further enhancing the collection of more valid, reliable, accurate data to inform evidence-based responses to substance use.

Supplementary information
execution of the search strategy. We also thank the members of the HPTN Substance Use Scientific Committee for the feedback they provided on this project.