Social media has led to fundamental changes in the way that people look for and share health related information. There is increasing interest in using this spontaneously generated patient experience data as a data source for health research. The aim was to summarise the state of the art regarding how and why SGOPE data has been used in health research. We determined the sites and platforms used as data sources, the purposes of the studies, the tools and methods being used, and any identified research gaps.
A scoping umbrella review was conducted looking at review papers from 2015 to Jan 2021 that studied the use of SGOPE data for health research. Using keyword searches we identified 1759 papers from which we included 58 relevant studies in our review.
Data was used from many individual general or health specific platforms, although Twitter was the most widely used data source. The most frequent purposes were surveillance based, tracking infectious disease, adverse event identification and mental health triaging. Despite the developments in machine learning the reviews included lots of small qualitative studies. Most NLP used supervised methods for sentiment analysis and classification. Very early days, methods need development. Methods not being explained. Disciplinary differences - accuracy tweaks vs application. There is little evidence of any work that either compares the results in both methods on the same data set or brings the ideas together.
Tools, methods, and techniques are still at an early stage of development, but strong consensus exists that this data source will become very important to patient centred health research.
The rapid growth of social media (SM) has led to fundamental changes in the way that people look for and share health related information. Of a global population of 7.6 billion, almost half (3.7 billion) are classified as active (over once a month) social media users , with 72% of US adults using it for health purposes , either as a first or second-line health information source  or exchange resource [4, 5]. Restrictions and local lockdowns due to the global COVID-19 pandemic are likely to have led to an even greater use of health-related online use, as individuals may have avoided personal visits to clinicians or been unable to access treatments . Posts written by individuals on social media platforms are creating vast resources of spontaneously generated online patient experience (SGOPE) data in the form of unstructured text.
As the numbers of individuals using the internet for health-related purposes continues to rise, there has been a corresponding increased interest in exploring this online user generated content as a data source for health research . The potential benefits of social media as a research resource for healthcare include reducing research costs , improving patient empowerment , engagement  and health communication . Despite the methodological complexities of analysing large volumes of unstructured natural language text there has been increased interest from both commercial and academic researchers into methods of generating knowledge from it, and new methods are developing rapidly [12,13,14]. Although the use of health-related social media as a data source is a relatively new subject area, it is being actively researched across many other disciplines, including computer science, sociology, philosophy, and business. The volume of published literature is growing rapidly and includes both academic and grey sources, but as yet there is little literature bringing together the developments in the area [15, 16]. A review of reviews from 2018 looking at the potential uses, benefits, harms and tools was inconclusive in terms of the effectiveness and uses of SM as a data source for mental health research, concluding that better research designs were needed . As far as we are aware, this is the first review of this type since then. It summarises the current state of the art of how and why SGOPE data is being utilised in health research by conducting a scoping umbrella review of the recent literature with a particular focus on SGOPE data.
Aims & objectives
This review examines how SGOPE data is currently being used within health research. Our main research question for this review is “How and for what purposes is SGOPE data currently being utilised within health research?”. Sub-questions include:
Which sites / platforms are being used as data sources?
What purposes is SGOPE data used for?
What tools and methods are being used in the studies?
What are the knowledge gaps and areas of future research needed?
Study design, reason & justification
This study is an umbrella scoping review. An umbrella review is a form of knowledge synthesis which by summarising existing review papers aims to describe the subject area, what is currently known about it and identify the gaps in knowledge . Scoping reviews are particularly useful when looking to get a broad overview of an emerging area, drawing together the key concepts and what it encompasses . We chose this combined novel method for two main reasons. Firstly, comparing existing reviews gives a wide overview of the subject area, highlighting existing evidence and illustrating how researchers across the various disciplines are exploring the topic. By avoiding the repetition of searches, screening of individual papers, and the resynthesizing of existing studies it provides an overall picture of the current state of the art that can be used as a broad base to build from . Secondly, as the literature base is so varied, a scoping review enables the inclusion of any other relevant review literature that would not otherwise be included within a systematic review, such as grey or opinion literature. Much of the most current research around natural language processing (NLP) is interdisciplinary rather than being published solely in the mainstream health based journals . Although not always subject to the peer review process of the more traditional journals, these other sources can be an important source of information for a review on such a rapidly evolving subject area. Widening the scope adds both to the depth and breadth of the literature as well as reducing the potential for any publication bias. Searching across disciplines, especially in an area where the terminology varies and evolves rapidly means that it is difficult to use tightly defined search terms. Many relevant keywords relating to SM have yet to be indexed into the MeSH system so it was not possible to rely purely on MeSH terms for searching [7, 21]. Scoping review methods do not require a formal critical appraisal of the literature . We conducted this umbrella review following the methodology suggested by the Joanna Briggs Institute .
We searched the following databases; Medline, Embase, PubMed, PsychInfo, Web of Science, ACM and IEEE Xplore. We also searched Google Scholar, Twitter, Google and other text or opinion literature. Additional literature, both published and ‘grey’, such as conference reports, was added from reference lists and an existing bibliography previously compiled by JW. Final searches were conducted on January 21st, 2021.
The original search terms were based on keywords, first based on input from a research librarian, then agreed by JW and CD, clustered around the main areas of setting, analysis, content, usage, and methods (Fig. 1). We combined the keywords in various options and then conducted the searches as an iterative process, repeated as further search terms were identified to optimise the efficiency and targeting of the search process. Wildcards (*) were used where possible to pick up multiple word endings or ambiguities over hyphen usage.
Inclusion/ exclusion criteria
Papers were included if they were any type of review; systematic, scoping, literature, or general that included analysis of SGOPE data for health research. Non review papers, those not referring to SGOPE usage, not health related, entirely mathematical, or statistical, not in English or published before 2015 were excluded.
After duplicate removal, the initial screening was a three phased approach by two reviewers (JW and CD). Using the Rayyan tool  to assess the reviews, the remaining papers were initially classified independently as include, unsure or exclude based on the title or headline. Both JW and CD then read the abstracts or first paragraph of those not excluded. Full texts were retrieved of relevant papers. Initial agreement rate was 86%. Results were compared and disagreements resolved through discussion while comparing perspectives of the inclusion and exclusion criteria of the full text of the papers until agreement was reached. Although critical appraisal is not required for a scoping review, the reviews were informally assessed using the questions from the Confidence in the Evidence from Reviews of Qualitative research (CERQual) appraisal tool (Table 1) to ensure their suitability for inclusion.
The final selection of included papers was collated into a marked list on the Web of Science database for basic bibliometric analysis.
Data extraction/ analysis
The data extraction form was designed by JW and FG. The extracted data from each included paper comprised title, author(s), date of publication, journal, keywords, review type, objectives, research questions (where stated), number and type of included studies, settings and population studied, data sources, date ranges of included studies, key findings, future research needed if identified, and strengths and limitations if included. This enabled us to analyse the reviews in line with the research questions. Frequency analysis was performed on the author generated words for each included paper, and a word cloud generated.
Of the 1759 records initially identified from the searches, 58 were included in the final review. Details and characteristics of the included papers are summarised in Table 2. The PRISMA flow diagram (Fig. 2) shows the number of papers included and excluded at each stage of the process.
The 58 included reviews covered the period 2015 to 2020 with the reported number of associated papers published increasing each year, especially since 2017 [27, 34, 54, 59, 63, 73]. This is illustrated in the breakdown of included review papers by publication year (Fig. 3).
The included studies came from wide range of journals, although only six journals provided more than one review. Journal of Internet Mediated Research (JIMR) contributed 11/58 (19%), Yearbook of Medical Informatics 7/58 (12%), and IEEE 3/58 (5%), while PloSone, Journal of Biomedical Informatics and the International Journal of Qualitative Methods each provided 2/58 (3%). The other 31 papers originated from individual journals from a wide range of research areas (Table 3).
The interdisciplinary nature of the topic is reflected in the included tree map of the research areas as defined by the Web of Science database bibliometric analysis that the papers are from (Fig. 4). Furthermore, one of the larger reviews (414 papers) analysed the discipline of each of the first authors; finding that in 90 papers (22%) they were either from a non-health or unspecified background .
The characteristics of each review: the aims, health condition of interest, data sources, review type and number of included papers of the 58 included papers are shown in Table 2. In terms of the methodology of the included reviews, almost half (27/58), were described as systematic, with 15/58 being general, 12/58 scoping, with one each of narrative, critical, survey and bibliometric. The number of studies in the included reviews ranged from 5 [55, 65] to 3419 , with an average of 118.
In line with the inclusion criteria the included reviews cover two main areas; how spontaneously generated data is used within health research and the methods and tools that are used to analyse it. Just over half (34/58) of the reviews cover both questions while 13/58 were primarily focused on uses and 11/58 mainly focused on the methods used (Table 4).
The word cloud of the individual author generated keywords illustrates the range and frequency of the intended purposes of the included reviews (Fig. 5).
RQ1: which SGOPE sites and platforms are used as data sources?
Twitter has been by far the most utilised data source, although a wide variety of general social networks and disease specific communities have also been used. Six (10%) of the reviews looked entirely at Twitter based studies [15, 37, 46, 49, 63, 72]. A further eighteen (31%) reviews included a wider range of sites but reported that Twitter was the most frequently used source [20, 24, 25, 28, 29, 33, 39,40,41, 45, 48, 54, 59, 60, 62, 67, 74]. Both general health sites and disease specific communities covering a wide range of conditions were also widely accessed [20, 23, 24, 26, 31, 33, 35, 42, 45, 51, 58, 74]. One review focused entirely on looking at the potential of blogs as a qualitative data source . Six (10%) reviews used both SGOPE and electronic health records (EHRs) [30, 32, 44, 55, 66, 70] while Abbe  used a combination of online posts, EHR’s, biomedical literature and qualitative studies.
Four reviews highlighted that although there were exceptions, most of the individual papers within the reviews used data from a single source [20, 41, 43, 67].
RQ2: what purposes is SGOPE data used for?
The identified use cases for SGOPE data extended from improving public health at a population level to fine grained understanding of patient perspectives. We summarised the varied aims, outcomes and key findings from the included reviews in Table 5.
In terms of the specific health topic of interest, 22/58 papers included any health condition [7, 15, 16, 21, 30, 32, 34, 36, 37, 39, 44, 45, 47, 52, 54, 55, 62, 65, 69, 72,73,74]. Twelve focused on mental health conditions [20, 23, 24, 28, 42, 48, 50, 53, 58, 64, 66, 71], 9 on adverse drug reactions (ADRs) [31, 43, 46, 51, 57, 67, 68, 70, 75], 4 on infectious diseases [25, 29, 40, 41], two each on chronic disease [26, 56], substance misuse [49, 60], public health [27, 59], breast cancer [33, 38] and with one each for symptom identification , use of complementary and alternative medicine (CAM) therapies  and the reasons for existing use by health researchers .
As a retrospective surveillance tool SGOPE has been used to capture public reaction to health events in terms of emotions [20, 42], fears, knowledge , attitudes and behaviours [27, 41]. Karmegam  looked specifically at studies evaluating the potential of SM data to understand the emotional and psychological impact of unforeseen natural disasters in a community. Several reviews focus on using SGOPE data to monitor behaviours, communication patterns and spread of health related concepts, particularly relating to infectious diseases [25, 27, 29, 40, 41]. Both the speed and accuracy of tracking are seen to improve on existing surveillance and signal detection systems, although most conclude that SM surveillance should currently be complementary to existing systems rather than replace them [25, 29, 31, 67]. Analysis of SGOPE data has been used to understand the various network mechanisms of information spread, the topics that are discussed, and to identify trends or patterns within the conversations [23, 24, 26, 33, 34, 41, 43, 48,49,50, 54, 61, 63, 67, 73].
A study on chronic disease collated qualitative studies exploring how people shared knowledge within the communities to show how the distinct characteristics of online spaces helped patients self-manage their long term conditions in ways that are difficult to replicate off line, and how these spaces were filling an unmet need for information and or emotional support [26, 36].
One of the most frequent use specific use cases was as a new source for identifying adverse drug events or reactions [16, 31, 43, 51, 57, 67, 68, 70, 75]. Identified advantages of SGOPE data over existing sources include earlier identification of ADRs [31, 43, 68, 70], the reduction of associated economic costs and potential fatality numbers  and the highlighting of ‘mild’ adverse events that may not be seen as serious enough to report through existing routes. Golder  found that the prevalence of adverse event reporting on SM ranged from between 0.2 to 8% of the posts, with ‘mild’ events being over represented, while ‘serious’ ones were under represented as compared to other ADE discovery methods. Comparisons between the data sources show that SGOPE data is generally in concordance with other regulatory sources for most adverse events [43, 67], but that at this early stage of method development that it should be used in conjunction with other existing methods [68, 70]. Combining SGOPE with EHRs and omics data is seen as an essential method of detecting and predicting ADRs . The additional context from the patient experience narrative adds to existing post-marketing surveillance of interventions [16, 67].
Two reviews looked at the misuse of prescription medicines [49, 60]. Kim  used findings from existing Twitter analysis to create a typology of SM big data analysis on the topic based on the four conceptual dimensions of poster characteristics, communication characteristics, predictors and mechanism for the discussion of problematic use, and the psychological or behavioural consequences of discussing it on social media.
Other use areas included assessing the opportunities, benefits, challenges and limitations that using SGOPE data might offer healthcare providers and researchers [23, 25, 36, 37, 42, 45, 52, 57, 62, 64, 71, 73]. Benefits identified included providing a new channel for hearing patient perspectives of their health experiences [23, 33, 42], faster data collection and reduced costs [25, 33], and improved support for self-management of health conditions . Zhang  categorised SM based papers by their role in public health, with the most frequent use case being as an interactive intervention tool aimed at modifying risky health factors. Classifying studies into five categories encompassing, education, disease modification, diagnosis, support and management, Patel  evaluated the impact of social media use on outcomes across a range of chronic conditions, concluding that few studies suggested any harm from its use and that as a data source it had tremendous potential to improve patient care. Drewniak  looked at the risks and benefits of using patient narratives for patients, relatives and HCPs, finding that they were a promising way of improving patient understanding of their health conditions capable of impacting behaviours and outcomes.
Ru  concluded that with improvements in analysis methods, findings from SGOPE would be able to generate new research questions around effectiveness, ADRs and health related quality of life.
Vilar  evaluated SGOPE as a method of identifying drug-drug interactions (DDI). They conclude that existing DDI resources such as DrugBank, Micromedex and DDI Corpus, although good as knowledge or evaluation bases show little consistency, and that SGOPE had the potential to be instrumental in creating knowledge sets and identifying unknown DDIs.
Calvo looked at the ways and levels that NLP could be used within mental health, including triaging people at risk and diagnosis of specific conditions. At a post level, emotions and risks can be identified, temporal changes can be tracked at the author level, and general trends in sentiment and attitudes established at a population level .
One review identified how language markers, such as higher use of pronouns, can be indicators of altered mental state or suicide ideation . Using predefined semantic vocabularies allowed the identification of posts indicating both medium and severe mental illness .
Yin  looked specifically at SGOPE data as a route to understanding poster experience of health issues, concluding that it gave insights into health factors that often were not recorded in EHR systems. They summarised 103 papers into 5 research categories; those characterising health issues and patients, prediction of events such as suicide, the correlation between SM posts and existing data collection methods, those characterising drug usage/ adverse events/ misuse and detecting sentiment about major health events such as post-partum depression and how this impacted on posting behaviours. Recognising that symptom discussion is a large component of SGOPE data, one review focused on papers for symptom extraction . Understanding how symptoms cluster is a recognised knowledge gap . While pain and fatigue were the most common symptoms that were identified, many of the included papers in this review identified symptoms from 10 of the 12 symptom categories the review authors had previously defined, concluding that SGOPE data could help with faster diagnosis and understanding issues such as the recent opioid crisis and pain management.
RQ3: analysis methods identified by the reviews
Analysis methods have varied widely as new tools and techniques have been developed, and the reviews reflect this . Eleven of the reviews focused on the methods utilised to analyse this type of online data, while 34 looked at both uses and methods (Table 4). A study covering 2003 to 2017 highlighted the absence of specific trends in either approach, evaluation or performance .
This review found that many papers, even recent ones are still using traditional qualitative [26, 56, 69, 73] or mixed methods [41, 46, 50, 72] of analysis on small quantities of data. Of the 42 papers in the Patel review , only 3 analysed over 1000 posts, with 26/42 analysing less than 100 texts. Other reviews included papers using a mix of manual and machine learning methods [29, 38, 40, 49, 51, 61]. Abbe  argued that while the debates about qualitative and quantitative analysis continue, the exploratory yet highly automated approach of natural language processing (NLP) can bridge the gap, offering the best of both worlds.
Among the analysis methods used, sentiment analysis was the most commonly utilised [15, 16, 20, 32, 37, 42,43,44, 48, 49, 53, 61, 62, 64, 71, 74]. Our review found that much early sentiment analysis was often performed on small volumes of text, using qualitative or content analysis methods [46, 60]. Developed originally as a marketing tool for business to understand consumer opinion towards their product , sentiment analysis has frequently been used to identify emotions that can signify a posters thinking and mood when trying to identify potential suicide risk [15, 20, 28, 53], to track ADRs and to interpret patient reviews of health care services . Simple automated content analysis has used lexicon based keyword techniques such as the LIWC (Linguistic Inquiry and Word Count) text analysis tool to count the frequency of keywords within the text  or compute the percentage of positive or negative emotional terms in a text [20, 29].
Machine learning methods have a myriad of different algorithms and techniques of varying levels of complexity in various stages of development. At a basic level they can be divided into either supervised (classification) or unsupervised (clustering) methods. Classification methods were often rule based, looking for a predefined words or patterns of text and the accuracy of the model is heavily dependent on the initial parameters in the choice of words or expressions . The majority of the machine learning studies to date have used supervised methods. Common classification algorithms include Support Vector Machines (SVM), Naive Bayes (NB), Decision Trees (DT) and Random Forest (RF). All these and others are frequently mentioned in the method discussions although SVM is the most popular [25, 47, 59, 71, 74]. Gupta  noted that SVM was the most promising method for binary classification tasks. Unsupervised techniques using topic modelling which do not require large amounts of labelled data are beginning to become more prevalent, especially for identifying themes and topics within large quantities of text [29, 38] but were less frequently utilised . A comparison of all datamining techniques found that they all had various strengths and weaknesses and that research objective and data should guide the choice of method .
The methods of both SGOPE and clinical NLP (looking at the unstructured text in EHRs etc.) have similar issues and purposes, but although automatic methods of processing are developing, the unstructured nature, noise, domain specific content, problems with language usage, understanding semantics and the complexities of informal speech mean that there is still a lot of work to be done in developing methods to maximise its usage [44, 45, 66]. Sarker  categorised the methods currently used to identify and monitor such use, concluding that there was still a lack of datacentric pipelines, and proposing a new method based on shared annotation guidelines and labelled datasets. One solution to the issues of ‘noise’ and irrelevant text within SGOPE is to use a combination of methods to refine the content by using a binary classification method to exclude any irrelevant content and then topic modelling to identify themes from the useful content [40, 44, 60]. Health domain language can be quite specific to the domain, even at a lay person level. A variety of approaches are being explored to deal with the inconsistencies of ‘patient language’ such as spelling correction, and attempts to map lay language to medical ontologies .
Shared tasks and datasets have been identified as a means of improving method development. Table 6 lists details of some of the SGOPE shared dataset challenges held to date that were identified in the reviews [39, 44]. Often used in computer science these approaches are seen as the most comprehensive way to evaluate methods and techniques. All groups taking part in each challenge share the same training dataset, and then develop algorithms or pipeline processes with the accuracy of each method calculated with the test data . This allows an open assessment of the various approaches. Results from these suggest that the best results are often achieved by combining methods in a pipeline process, but that there is no one single combination that is seen as being the most effective . Tasks such as entity recognition, especially using dictionaries, are much easier than the more complex problems of correctly identifying relationships between the entities. Reports from some of the shared task challenges show that within clinical NLP, named entity recognition (NER) can achieve an accuracy of over 70%, but relation extraction methods are much less successful with performances below 50% . These figures are likely to be lower in SGOPE data where the variations in language, grammar and sentence construction are much wider.
The latest developments in NLP move from rule-based systems to deep learning [16, 21, 37, 44, 74]. These methods aim to improve on the semantic level of understanding, by using language models such as word embeddings and distributional semantics . Based on artificial neural networks, deep learning uses ‘hidden layers’ to extract more detail from the raw input. Within healthcare it is deep learning techniques that are behind the recent advances in automated image processing. As yet they seem to be rarely used within text based healthcare analysis, with only 1/86 papers using sentiment analysis methods using deep learning and 4/86 using word embeddings  although one review commented on how researchers were starting to use these methods on existing classification and negation identifications problems . Only one review focused on deep learning methods, but these were mostly applied on EHR and biomed literature data with only a few examples of SGOPE data usage .
RQ4: gaps and future research needed
All the reviews acknowledge that method development is still at an early stage and that much more work is needed before the full potential of SGOPE can be utilised. Particular challenges include algorithm design [25, 59], method refinement [51, 72] integrating diverse data sources [41, 70], pre-processing, coreference and temporal relation extraction [44, 51], spelling correction, normalising poster language , and reducing bias [16, 56, 66, 71]. Studies to date have considerable heterogeneity in methods and outcomes, further work is also needed to define the optimal standards for these [15, 28, 30, 31, 40, 51, 63].
Sentiment analysis performance on health related text was found to be lower than that of other domains  but that may be because most of the commonly used sentiment lexicons have been developed from publicly available film or restaurant reviews, but these do not work as well on health topics . There are calls for the development of annotation guidelines  and sentiment analysis tools trained on health care specific corpora . One of the problems with current standard sentiment lexicons is that they are too general for health topics. Attempts to map them to the Unified Medical Language System (UMLS) found that less than 1% of its content is covered by common existing lexicons .
At this early stage of method development there are a variety of tools and algorithms available to analyse unstructured text, but a lack of studies that compare their efficiency or accuracy  and therefore a lack of consensus as to which are the most useful [15, 43, 74]. Several reviews suggest greater sharing of datasets [32, 39, 44, 71, 74] and the wider development of shared tasks, where different groups can work towards solving a particular task on the same dataset [27, 32, 39, 44, 60].
The frequent lack of clear explanation of the methods used in studies [28, 47] and the poor reporting of datasets used  means that it is hard to assess the accuracy of many results and may lead to selective outcome reporting or publication bias .
Further work is needed in terms of evaluating the findings from SGOPE data, both against existing signal detection methods , and to psychosocial, behavioural and physical outcomes . Comparisons of SGOPE data to that in clinical text such as EHRs or biomedical literature identified the potential value of SGOPE but highlighted the particular issues of noisy, irrelevant content, language inconsistencies and ambiguity [23, 32, 44, 55], but made no comment on how the accuracy of SGOPE data analysis compares with these methods .
Other areas for future research identified include the need to adapt the methods to languages other than English [20, 21, 23, 24, 28, 44, 58], cost-effectiveness studies , better understanding of how SGOPE can help posters self-manage , maximising the representativeness of the data [29, 35], facilitating evaluation [61, 67], integrating SM text with audio and video sources [24, 45, 48], and a better understanding of how SGOPE could integrate with existing systems . Three reviews commented on the lack of demographic analysis, despite geotags being easily accessible from Twitter data [53, 63, 67, 74]. Each individual tweet has potentially 38 data features including detailed metadata such as geotags, but these seem to be unexplored at present . Methods that included temporal analysis could help identify event sequences and causal inferences . Only one review focused on health outcomes . A lack of linking SGOPE interactions and analysis with health outcomes has also been identified .
There is a lack of both theoretical  and methodological data centric frameworks  for SGOPE usage, hindered by the discipline boundaries where researchers in one area often do not know of relevant literature in another. This is compounded by differences in language, terminology and methods that exist . Giuntini suggests that a multi-disciplinary approach could help develop better algorithms . The need for interdisciplinary collaboration between NLP and health researchers in order to maximise the opportunities available is highlighted [20, 32]. A gap between academic NLP research and the commercial NLP systems as beginning to be used on electronic health records (EHRs) has been identified, in that academic work tends to be more advanced . One review looking at the development of methods identified a number of approaches that were in development for analysing SM text but concluded that many NLP developments are not getting as far as being used in applications – ‘they are often explored, published and then shelved’ .
Concerns around the ethics of using social media data posted in public spaces are ongoing and several reviews mentioned the need to be aware of ethical issues [16, 20, 28, 34, 49, 50, 63, 65]. The absence of any form of discussion around the ethical implications of this form of data use was highlighted in 23/26 surveillance studies  and 13/16 studies on suicide ideation . The need for guidelines and harmonisation of regulation around secondary use of SM data and data donation was identified [65, 72] together with a call to analyse data through different socio-cultural lenses .
As issues around privacy and consent begin to be resolved, further questions emerge about the how findings should be incorporated into health care practice [35, 51, 52, 65]. Regarding its use as a method of public health surveillance there is a lack of guidance as to how health organisations should accept or react to data from SM discussions . In the area of mental health questions remain as to if and how any posters deemed to be ‘at risk’ should or could be contacted .
This type of data source has traditionally been seen as lacking credibility, although recently several of the major science journals have begun to publish articles supporting its use within health research . One identified limitation is the potential for the content to be influenced by media events or coverage [27, 29]. To increase its acceptability efforts need to be made to bridge research and practice by demonstrating how the research can translate into practice [16, 40]. Aligning SGOPE data with clinical EHR data could help to both bridge a credibility gap and help both posters and clinicians reach a better understanding of how health issues impact on individual’s lives .
In total, 58 review papers were included that answered the research question of how and why SGOPE data is being used in health research in terms of the sub questions. Of these, 13/58 looked primarily at the purposes, 11/59 primarily at the methods, while 34/58 addressed both the purpose and the methods. Despite the heterogeneity of studies included, the early stage of methodology development and the many challenges still to be overcome, there was universal agreement between them of the potential of SGOPE data to improve health and deliver patient centred care. Twitter is the most widely used data source, and the majority of studies to date have used either qualitative, quantitative or supervised machine learning methods. The growing significance of this type of data source is reflected in the volume of published literature especially since 2017.
RQ1 which SGOPE sites / platforms are being used as data sources?
The high prevalence of Twitter as a data source probably reflects the easy accessibility to large volumes of data that has been accessible through their API, rather than its suitability for health research. Facebook was another common source, but recent privacy issues have resulted in far fewer messages being publicly accessible in recent years . Access to some of the potentially more useful online forums and communities is being restricted, due to a combination of privacy concerns and commercial interests, as the economic value of health data is increasingly recognised .
Restricting individual studies to a single data source may be simpler for method development, but it does decrease the overall validity of the studies due to the elements of emotional contagion or other bias [78, 79] that may be present on a single data source, especially if it is a relatively small community. Even a massive source such as Twitter is still quite limited in the demographics of its posters which has implications for the generalisability of findings based on it .
RQ2 what purposes is SGOPE data used for?
Although SGOPE data adds a new dimension to healthcare research , the topics that have been researched to date reflect both the early stages of methodology and the type of posts that are most available. The most active communities are known to be those with long term conditions or rare diseases . Simple key word searches are easy to implement and can be very effective when searching through large volumes of text for mentions of selected drugs or conditions, but as methods develop the range of use cases will widen. Much of the use to date has been retrospective or evaluative, but as methods improve higher degrees of semantic meaning can be accurately extracted, and its role as a predictive or triage tool may become more widespread. One of the potential problems of studies based solely or mainly on one data source such as Twitter is that the content posted there tends to be heavily biased by media coverage of events, so that potential use cases for research is driven by the availability of the data [27, 41].
RQ3: which analysis methods are being used in the studies?
The level of detail reported varied widely between reviews, with some of the reviews that looked at both uses and methods going into far greater detail of the individual methods used than some of the pure method focused reviews.
Although most of the machine learning methods used supervised techniques, these all require large quantities of annotated data, with both volume and the annotation quality having direct influence on the resulting accuracy. Annotating a dataset is a time intensive, and often expensive process as it requires domain specific knowledge [21, 44]. The general lack of availability of health specific trained or labelled data has implications for the accuracy that can be achieved. Zunic  suggests that increased use of shared datasets could increase the use of deep learning methods, improving the performance levels, as well enabling comparison between methods. However the use of complex deep learning methods requires a trade-off between the computational cost involved and the performance levels that can be achieved .
RQ4: knowledge gaps
The strongest message from these reviews was confirming how much more work there is to do in this emerging area. Knowledge gaps, defined areas for future research, as well as limitations with existing studies were identified in all the included papers. Despite the increase in interest in this area in the last few years, a recent review that looked at Covid-19 related social media found that there was a lack of studies both into the application of machine learning on the data and into its use for real time surveillance . The lack of systematic testing of methods and results impacts on the credibility of the findings. The importance of reproducibility is a current issue in healthcare research, so future work in this area should make clear what methods are used .
From the literature to date, there seemed to be little evidence of this data source being used to assess or evaluate the patient perspective of the effectiveness of a treatment, intervention, or service, other than for detecting adverse events or reactions. Given that so much is still unknown about the relationships between patient characteristics, environment and disease, patterns of symptoms, behaviours or effects that can be extracted from SGOPE may give new insights that can be used to improve outcomes .
Although many of the reviews acknowledged that sharing knowledge between online users was one of the big factors in online health information use  only one of the reviews focused on how important this was to those with long term conditions . Very few of the studies had explored or compared unsupervised methods to identify the themes being discussed.
Few reviews looked at identifying any form of inferred or perceived causal inference from the social media posts. Dictionary-based systems that can match explicit interventions and symptoms within defined units of text can be a simple and very effective way of identifying potential relationships . Determining that a possible relationship exists is however, different to determining what the actual effect is, especially if as a retrospective event. Very few of the review studies look beyond the co-occurrences of named entities to indicate a possible relationship between items. There has been less focus on assessing causality to identify true drug-ADE pairs [44, 51]. One suggestion is that the low quality of the information precludes the evaluation of causal links [31, 51], although the quality of the posts in terms of completeness varies widely between sites . Identifying temporal data to sequence events could help distinguish true causal links , as will the continued working on the more complex tasks of lexical semantics, coreference resolution and discourse analysis .
Strengths and limitations
Using an umbrella scoping review approach summarises the current state of the art of this fast-moving field. One of the strengths of this method is that although some of the individual studies may have been included in multiple reviews, each review paper has had different research questions, thus generating a range of different perspectives on any such papers. Seven databases were searched, together with grey literature and reference lists. It is however subject to the usual limitations of the keyword-based searches, in that it is possible that some relevant literature may have been missed. Searches were also limited to those in English. However this was mitigated by the deliberately broad inclusion / exclusion criteria which were intended to ensure that as many as possible relevant reviews were included in the final analysis.
SGOPE data remains an underused resource in healthcare. It has the potential to increase knowledge of many different aspects of healthcare and as such has a multitude of potential uses. Despite the raft of suggestions for future research and methodological development that is needed, the consensus from the included reviews in this study is that SGOPE is a data source capable of offering considerable benefit to healthcare researchers and providers, and that NLP will become an important methodological tool within health research.
Mueller J, Jay C, Harper S, Davies A, Vega J, Todd C. Web Use for Symptom Appraisal of Physical Health Conditions: A Systematic Review. J Med Internet Res. 2017;19(6):e202 [cited 2017 Jul 29]. jmir.org.
Oprescu F, Campo S, Lowe J, Andsager J, Morcuende JA. Online information exchanges for parents of children with a rare health condition: key findings from an online support community. J Med Internet Res. 2013;15(1):e16.
Ru B, Yao L. A literature review of social media-based data mining for health outcomes research. In: Bian J, Guo Y, He Z, Hu X, editors. Social web and Health Research: benefits, limitations, and best practices. Cham: Springer International Publishing; 2019. p. 1–14. https://doi.org/10.1007/978-3-030-14714-3_1.
Moorhead SA, Hazlett DE, Harrison L, Carroll JK, Irwin A, Hoving C. A new dimension of health care: systematic review of the uses, benefits, and limitations of social media for health communication. J Med Internet Res. 2013;15(4):e85 [cited 2016 Mar 1].
Aromataris E, Fernandez R, Godfrey CM, Holly C, Khalil H, Tungpunkom P. Summarizing systematic reviews: methodological development, conduct and reporting of an umbrella review approach. Int J Evid Based Healthc. 2015;13(3):132–40.
Calvo RA, Milne DN, Sazzad Hussain M, Christensen H. Natural language processing in mental health applications using non-clinical texts. Nat Language Eng. 2017;23(5):649–85. [cited 2019 Sep 13]. https://doi.org/10.1017/S1351324916000383.
Neveol A, Zweigenbaum P, Section Editors for the IMIA Yearbook Section on Clinical Natural Language Processing. Expanding the Diversity of Texts and Applications: Findings from the Section on Clinical Natural Language Processing of the International Medical Informatics Association Yearbook. Yearb Med Inform. 2018;27(1):193–8. https://doi.org/10.1055/s-0038-1667080.
Abd Rahman R, Omar K, Noah SAM, Danuri M, Al-Garadi MA. Application of machine learning methods in mental health detection: a systematic review. Ieee Access. 2020;8:183952–64. https://doi.org/10.1109/ACCESS.2020.3029154.
Allen C, Vassilev I, Kennedy A, Rogers A. Long-Term Condition Self-Management Support in Online Communities: A Meta-Synthesis of Qualitative Papers. J Med Internet Res. 2016;18(3):e61 eprints.soton.ac.uk.
Barros JM, Duggan J, Rebholz-Schuhmann D. The application of internet-based sources for public health surveillance (Infoveillance): systematic review. J Med Internet Res. 2020;22(3):e13680. https://doi.org/10.2196/13680.
Castillo-Sánchez G, Marques G, Dorronzoro E, Rivera-Romero O, Franco-Martín M, De la Torre-Díez I. Suicide risk assessment using machine learning and social networks: a scoping review. J Med Syst. 2020;44(12):205.
Charles-Smith LE, Reynolds TL, Cameron MA, Conway M, Lau EHY, Olsen JM, et al. Using social Media for Actionable Disease Surveillance and Outbreak Management: a systematic literature review. PLoS One [Internet]. 2015;10(10):e0139701.
Cheerkoot-Jalim S, Kumar KK. A systematic review of text mining approaches applied to various application areas in the biomedical domain. J Knowledge Manag. 2020; ahead-of-print(ahead-of-print). https://doi.org/10.1108/JKM-09-2019-0524.
Convertino I, Ferraro S, Blandizzi C, Tuccori M. The usefulness of listening social media for pharmacovigilance purposes: a systematic review. Expert Opin Drug Saf. 2018;17(11):1081–93.
Demner-Fushman D, Elhadad N. Aspiring to unintended consequences of natural language processing: a review of recent developments in clinical and consumer-generated text processing. Yearb Med Inform. 2016;1(1):224–33. https://doi.org/10.15265/IY-2016-017.
Dobrossy B, Girasek E, Susanszky A, Koncz Z, Gyorffy Z, Bognar VK. ‘Clicks, likes, shares and comments’ a systematic review of breast cancer screening discourse in social media. PLoS One. 2020;15(4). https://doi.org/10.1371/journal.pone.0231422.
Dol J, Tutelman PR, Chambers CT, Barwick M, Drake EK, Parker JA, et al. Health researchers’ use of social media: scoping review. J Med Internet Res. 2019;21(11):e13687.
Dreisbach C, Koleck TA, Bourne PE, Bakken S. A systematic review of natural language processing and text mining of symptoms from electronic patient-authored text data. Int J Med Inform. 2019;125:37–46.
Fung ICH, Duke CH, Finch KC, Snook KR, Tseng PL, Hernandez AC, et al. Ebola virus disease and social media: A systematic review. Am J Infect Control. 2016;44(12):1660–71. https://doi.org/10.1016/j.ajic.2016.05.011.
Gianfredi V, Bragazzi NL, Nucci D, Martini M, Rosselli R, Minelli L, et al. Harnessing Big Data for Communicable Tropical and Sub-Tropical Disorders: Implications From a Systematic Review of the Literature. Front Public Health. 2018:6. https://doi.org/10.3389/fpubh.2018.00090.
Giuntini FT, Cazzolato MT, dos Reis M, Campbell AT, Traina AJM, Ueyama J. A review on recognizing depression in social networks: challenges and opportunities. J Ambient Intell Humaniz Comput 2020;11(11):4713–4729. doi: https://doi.org/10.1007/s12652-020-01726-4.
Golder S, Norman G, Loke YK. Systematic review on the prevalence, frequency and comparative value of adverse events data in social media. Br J Clin Pharmacol. 2015;80(4):878–88 Wiley online Library.
Gonzalez-Hernandez G, Sarker A, O’Connor K, Savova G. Capturing the Patient’s Perspective: a Review of Advances in Natural Language Processing of Health-Related Text. Yearb Med Inform. 2017;26(1):214–27 thieme-connect.com.
Hamad EO, Savundranayagam MY, Holmes JD, Kinsella EA, Johnson AM. Toward a mixed-methods research approach to content analysis in the digital age: the combined content-analysis model and its applications to health care twitter feeds. J Med Internet Res. 2016;18(3):e60. https://doi.org/10.2196/jmir.5391.
Karmegam D, Ramamoorthy T, Mappillairajan B. A systematic review of techniques employed for determining mental health using social media in psychological surveillance during disasters. Disaster Med Public Health Prep. 2020;14(2):265–72. https://doi.org/10.1017/dmp.2019.40.
Kim SJ, Marsch LA, Hancock JT, Das AK. Scaling up research on drug abuse and addiction through social media big data. J Med Internet Res. 2017;19(10):e353. [cited 2017 Nov 1]. https://doi.org/10.2196/jmir.6426.
Lau AYS, Staccini P, Section editors for the, Imia yearbook section on education, consumer health, informatics. Artificial intelligence in health: new opportunities, Challenges, and Practical Implications. Yearb Med Inform. 2019;28(1):174–8.
Qiao J. A systematic review of machine learning approaches for mental disorder prediction on social media, 2020 International Conference on Computing and Data Science (CDS); 2020. p. 433–8. https://doi.org/10.1109/CDS49703.2020.00091.
dos Santos BS, Steiner MTA, Fenerich AT, Lima RHP. Data mining and machine learning techniques applied to public health problems: a bibliometric analysis from 2009 to 2018. Comput Ind Eng. 2019;138:106120. https://doi.org/10.1016/j.cie.2019.106120.
Sharma C, Whittle S, Haghighi PD, Burstein F, Keen H. Sentiment analysis of social media posts on pharmacotherapy: a scoping review. Pharmacol Res Perspect. 2020;8(5):e00640. https://doi.org/10.1002/prp2.640.
Staccini P, Fernandez-Luque L. Secondary use of recorded or self-expressed personal data: consumer health informatics and education in the era of social media and health apps. Yearb Med Inform. 2017;26(1):172–7. https://doi.org/10.15265/IY-2017-037.
Vilar S, Friedman C, Hripcsak G. Detection of drug-drug interactions through data mining studies using clinical sources, scientific literature and social media. Brief Bioinform. 2018;19(5):863–77. https://doi.org/10.1093/bib/bbx010.
Wong A, Plasek JM, Montecalvo SP, Zhou L. Natural Language Processing and Its Implications for the Future of Medication Safety: A Narrative Review of Recent Advances and Challenges. Pharmacotherapy. 2018;38(8):822–41. https://doi.org/10.1002/phar.2151.
McDermott MBA, Wang S, Marinsek N, Ranganath R, Foschini L, Ghassemi M. Reproducibility in machine learning for health research: Still a ways to go. Sci Transl Med. 2021;13(586):eabb1655 stm.sciencemag.org.
Schillinger D, Chittamuru D, Ramírez AS. From “Infodemics” to health promotion: a novel framework for the role of social Media in Public Health. Am J Public Health. 2020;110(9):1393–6. https://doi.org/10.2105/AJPH.2020.305746.
JW conceived the study design, conducted the study, and drafted the paper. JW and CD agreed the search terms, inclusion and exclusion criteria, selected and screened the studies for inclusion. FG and JC contributed to study design, advised on study conduct, and contributed to editing the paper. All authors contributed to the article and approved the submitted version.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
Walsh, J., Dwumfour, C., Cave, J. et al. Spontaneously generated online patient experience data - how and why is it being used in health research: an umbrella scoping review.
BMC Med Res Methodol22, 139 (2022). https://doi.org/10.1186/s12874-022-01610-z