Skip to main content

Table 5 Aims, outcomes, key findings methods used and future research suggestions

From: Spontaneously generated online patient experience data - how and why is it being used in health research: an umbrella scoping review

Ref

Paraphrased Aims

Area

Outcomes Assessed

Key Findings Paraphrased

Methods Mentioned

Future Research

Abbe 2016 [23]

Benefits & limitations. Current and potential uses in psych.

Mental health

Objectives of studies, and topic modelling methods /tools used for pre-processing and analysis.

Identified four main areas of application: Psychopathology, patient perspective, medical records, medical literature. A data source that cannot be ignored. Techniques and topics heterogenous. Basic capabilities at present but will get better and become a core method.

Mostly rule based systems but some classification.

Improved techniques, apply to more languages than English.

Abd Rahman

Adequacy, challenges, and limitations of SGOPE data for detecting MH problems

Mental Health

Data Sources, Condition, location, Feature extraction methods, analysis methods

22 studies: stress 8, depression 7, suicide 3, MH disorders 4. Geographical: China 6, US, 4, Japan 1, Greece 1, unspecified 10. Source: twitter 8, Sina Weibo 5, Facebook 2, others 7. The keywords used to select data often not specified. SVM (13/22) most popular classification, LR & RF (5/22), NB 4/22)

Text analysis, multi method inc questionnaires, accessing respondents OSN accounts. Feature extraction TF-IDF, ngrams, BOW,

Multiple sources, other languages, inclusion of audio, video, photos. Better methods

Al-Garadi 2016 [25]

Adequacy / limitations of SM for pandemic surveillance

Infectious disease

Data source and volume, analysis method, study aims and outcomes. Features and classifier performance of supervised methods.

Can complement existing systems but still problems with representivity. Need better algorithms and computational linguistic methods.

Mostly supervised, classification. SVM. Most used ngrams as features.

Better algorithms/ computational linguistics

Allen 2016 [26]

Better understanding of how patients with chronic disease share knowledge in online spaces. Possibilities for improving self-management.

Chronic

Network themes and mechanisms

Helpful in encouraging patients to self-manage l/term conditions through sharing collective knowledge, gifting relationships, sociability and disinhibition. Need to understand why people do or do not post.

Qualitative: thematic, grounded theory, content & thematic, IPA, ethnography

Find out why people are reluctant to post and illuminate how these communities help people manage their condition in daily life.

Barros 2020 [27]

To assess research findings regarding the application of IBSs for public health surveillance (infodemiology or infoveillance). Sources, purposes, methods

Public Health

Paper type, year, disease, health topic, forecasting, surveillance, disease characterisation, first person health mention, diagnosis prediction,

Infectious disease the biggest area. We also identified limitations in representativeness and biased user age groups, as well as high susceptibility to media events by search queries, social media, and web encyclopaedias

Correlation analysis (59/162) regression models (46/162). Machine learning 27/162, statistical models 20/162. Manual analysis 18/162, topic analysis 12/162. Deep learning 10/162, linguistic analysis 10/162. Rule-based techniques (n = 7), epidemiology theory(n = 6), surveys (n = 3), and ranking techniques (n = 1) were used in less than 10 papers.

Updating keywords to reflect changing search behaviours and health trends. Susceptibility of SM content to media events. Creation of standard datasets to improve method development.

Calvo 2017 [20]

What NLP methods used on user generated data in mental health?

Mental health

Objectives of studies, data sources, features extracted

Triaging MH issues seems like a great use but need to find how to react to it in practice. Ethics/ privacy issues. Very interdisciplinary.

LIWC most widely used both for feature extraction and Sentiment analysis. Good methods often a combination of methods/ algorithms. Lots of different tools/ techniques available- could not determine whether any one was superior.

Need to do research into using NLP in different languages. Also think about how to make contact with people identified as being at risk from mental health that are identified during the process.

CastiiloSanchez 2020 [28]

What ML techniques used to predict suicide from SM data?

Mental Health

Methods, Tools, Techniques

Text classification main objective for 75%. 8/16 studies report explicit datamining techniques. 10/16 using SVM. Papers not reporting time spans of data collection, or number of participants.

LIWC, LDA, LSA for feature extraction, Sentiment analysis

Other languages. Use annotated corpus. Develop new tools. Do temporal studies.

Charles-Smith 2015 [29]

Can SM be used for disease surveillance? Or to test interventions to improve health outcomes?

Infectious disease

Correlation between social media data and national health statistics. Prediction times. Topic / theme identification. Influence on health behaviours.

Earlier prediction of outbreaks. Correlation with existing methods. Topic modelling good for broad topics, but not for lower frequency themes. Lots of gaps in knowledge. Need to look for ways to incorporate SM into PH surveillance.

Topic modelling (LDA). Query selection and thematic analysis to detect lower frequency topics.

Work on who uses what types of social media, so as to get representative data. SM platforms/ preferences change.

Cheerkoot-Jalim 2020 [30]

Identify the text mining approaches, tools used in biomedical text. Who benefits? Application areas? What are the challenges?

Any

Data Sources, Techniques, Tools and Potential Beneficiaries of research

Looked at who could benefit from SGOPE research

MetaMap, UMLS used - mainly on EHRs and biomed literature. NLP methods; NER and relationship extraction.

Big data paradigms, methods that can scale with the volume of text. Methods of standardising data across sources. Improving accuracy.

Convertino 2018 [31]

Summarise strategies, assess quality of information, potential for early detection from SM.

ADR

Sources, study population, drug Proto-ADE pairs, clinical features, extraction method.

Lots of potential to complement existing regulatory agencies. But utility, validity and implementation are all under-studied. Need standardised methods. Fast moving field. No causality assessment so far.

Keywords, dictionary most popular 37/38.

More work to improve methods. Use in conjunction with other signal detection methods.

Demner-Fushman 2016 [32]

Improvements in NLP on patient language, and new opportunities.

Any

SM as a source for quality assessment. Methods

Much more to be done both in clinical and SM NLP. Research moving from capturing trends to addressing individual health-related posts, thus showing potential to become a tool for precision medicine and a valuable addition to the standard healthcare quality evaluation tools.

Sentiment analysis. Rule based RegEx or supervised event extraction most used.. More work needed on semantic processing. Using sentences better than words,

Need more publicly available clinical datasets. Work on semantics. Work on porting pipelines across domains. Collaboration between NLP research and EHR suppliers.

Dobrossy 2020 [33]

Assess volume, participants and content of SM data about breast screening. Potential for patient education.

Breast Cancer

Platforms, volume of discourse, participant roles, discourse content, themes.

Looked at age, role of user types, and the content of the posts. Good source to understand beliefs, attitudes, and literacy of the target population.

NS

NS

Dol 2019 [34]

How health researchers are using SM data.

Any

Journals, study country, first author discipline, health topic covered, platforms, study purpose.

81/414 analysing content. Biggest use was recruitment. Generally seen as positive but concerns re ethics.

NS

Need methods to optimise usage and demonstrate potential.

Dreisbach 2019 [35]

Using NLP methods to extract symptoms from SM text

Symptoms

Study purpose, data source, symptom categorisation, evaluation, and performance metrics

Pain and fatigue most evaluated symptoms. Variety od sources. NLP primary methodology for 15/21 papers. Current focus on extraction of terms. Need to share lexicons to move forward.

21 papers: 14 NLP, 6 text mining, 1 NLP + TM. No breakdown of type of methods.

Future research should consider the needs of patients expressed through ePAT and its relevance to symptom science. Understanding the role that ePAT plays in health communication and real-time assessment of symptoms, is critical to a patient-centred health system.

Drewniak 2020 [36]

Does SGOPE research have quantifiable risks or benefits for patients, relatives, or HCPs?

Any

Purposes of the narrative: inform, engage, model behaviour, persuade, comfort

Generally positive benefits although potential risks from misinformation

NS

Future research is needed to define the optimal standards for quantitative approaches to narrative-based interventions.

Edo-Osagie 2020 [37]

Current uses of Twitter data in public health

Any

Conditions, data sources, analysis methods, geographical and time trends

Twitter a good data source for 6 aspects of public health: surveillance, event detection, pharmacovigilance, forecasting, disease tracking and geographical identification.

Numerous

Unsupervised methods. Do research into less studied areas

Falisi 2017 [38]

What role does SM play in the health of breast cancer survivors?

Breast cancer

Platforms, ethnicity of study population, analysis method, which aspects analysed, connection between SM content and health outcome.

Focus on psychosocial wellbeing. Mostly online support forums/ message boards. Few non-Caucasian. Content analyses of social media interactions prevalent, but few articles linked content to health outcomes

40/98 did content analysis. Some manual / some M/L. Pre 2011 = LIWC, post 2011 = LDA etc. 37 quant. 3 qual

Should consider connecting SM content to psychosocial, behavioural, and physical health outcomes. None of the content analysis articles attempted to do this.

Filannino 2018 [39]

What tasks and methods included in the shared tasks?

Any

Task description, data type, data source, dataset size, best performance, measure.

NER & classification the most used tasks. Clear trend to data-driven solutions. Need more and varied datasets to explore.

NER and classification most common tasks.

Bigger and more varied datasets to share

Fung 2016 [40]

What research questions and methods used on Ebola related social media?

Infectious disease

Study design, qual or quant, study aim, data collection method, time frame, keywords used, analysis method, main findings, and limitations.

12 papers: 8 from Twitter/ Weibo, 1 from Facebook, 3 from YouTube, and 1 from Instagram and Flickr. All studies were cross-sectional. 11/12 articles studied one or more of themes / topics of SM content, post meta-data and characteristics of the SM account. Twitter content analysis methods included text mining (n = 3) and manual coding (n = 1). Two studies involved mathematical modelling. YouTube /Instagram/Flickr studies used manual coding of videos and images. Published Ebola virus disease-related social media research focused on Twitter and YouTube. The utility of social media research to public health practitioners is warranted. No evaluation of the studies utility performed.

Mix of manual coding and frequency analysis using LIWC.

Need a new checklist to appraise quality of SM papers. Future research in the direction of analysing multiple cross-sectional social media datasets or conducting prospective cohort studies of social media users will provide useful data for analysis of temporal change of social media contents or social media users’ behaviours. Need to bridge research and practice.

Gianfredi 2018 [41]

Can SM be used for disease surveillance / predictions? Can they capture public reactions to epidemic outbreaks?

Infectious disease

Data source, disease, study period, geographical location, study purpose, type of analysis and main findings

Out of the 47 articles included, only 7 were focusing on neglected tropical diseases, while all the other covered communicable tropical/sub-tropical diseases, and the main determinant of this unbalanced coverage seems to be the media impact and resonance.

Qualitative, narrative analysis, content analysis, mathematical modelling, correlational analysis, geospatial.

Lots of gaps, possibly due to the media impact of the specific disease. Need further research into ways of integrating diverse data sources.

Giuntini 2020 [42]

Sentiment and emotion analysis for identifying depressive disorders. What types of SM data? Which networks? Which methods?

Mental Health

Platform, type of SM, emotion or feeling detection, other disorders inferred, methodology

Most used media is text, then emoticons. Twitter most employed platform. Supervised methods with off the shelf classifiers combined with lexicons such as LIWC.

Supervised (NB, DT, SVM etc) plus LIWC, NRC Word Emoticon, word-Net Affect lexicons

More multidisciplinary studies.

Gohil 2018 [15]

What sentiment analysis tools for Twitter / healthcare. Any health specific training, validation or justification

Any

Health area, sentiment towards, type of method, tool used, manual annotation sample size, sample size

Multiple methods mix of open source, commercial and bespoke tools. Very few tested for accuracy.

Sentiment analysis. Mix of tools.

This study suggests that there is a need for an accurate and tested tool for sentiment analysis of tweets trained using a health care setting–specific corpus of manually annotated tweets first.

Golder 2015 [43]

Prevalence, frequency and value of ADR comments from SM

ADR

Data source type, ADR type, search strategy used, post selection, study aim, ADR prevalence, comparison method

51 studies, discussion forums most used source type. ADR prevalence varied from 0.2 to 8%. General agreement that a higher frequency of adverse events was found in social media and that this was particularly true for ‘symptom’ related and ‘mild’ adverse events.

8/12 used Consumer Health Vocab dictionary. Few evaluation methods

A cost-effectiveness analysis of all pharmacovigilance systems, including social media is urgently required.

Gonzalez-Hernandez 2017 [44]

Show how NLP is developing in regard to capturing the patient perspective from unstructured text.

Any

Types of SM sites, analysis type, types of tasks.

Move from rule based to learning based systems. Work needed on noise reduction and normalisation/mapping. Shortage of annotated shared datasets. Shared tasks useful development tool.

Move from rule based to learning methods. Over 50% papers used lexical content analysis. In SM NLP: regex, LDA topic modelling. Supervised classification. Sentiment analysis

Normalisation of data, co-reference and temporal relation extraction. Need to create and release annotated datasets and targeted unlabelled data sets in distinct languages.

Gupta 2020 [45]

What methods, sources, are used for SM based health surveillance. Potential applications, and challenges.

Any

ML Methods, Data Sources, Diseases, Limitations of SM systems

Twitter most used source (64%). SVM most used method (33%) - better at binary classification.

SVM, Decision trees, random forest, NB, Logistic Regression

Noise reduction, Combining SM with other data, theme detection, develop better predictive models for epidemic prediction. Only 3 studies included ethical debate.

Hamad 2016 [46]

How is content analysis used in health-related SM studies?

ADR

Keywords and hashtags, sampling and data collection, analysis methods, validation, and presentation of results

Methods used were not purely quantitative or qualitative, and the mixed-methods design was not explicitly chosen for data collection and analysis. Proposes CCA analysis as straightforward method for Twitter analysis

Content analysis (quantitative and qualitative). Infoveillance. Combined content analysis (mix of mixed methods and content analysis)

NS

Ho 2016 [75] 

Compares omics, social media and EHRs as sources of ADR knowledge

ADR

Study aims, Data & Tool, Method

Data driven approach essential to detect /predict ADRs. Omics data, EHRs and SM all new opportunities.

Datamining. NLP, NER, ontology building. Classification to exclude noise. Aims to reduce false positive rate. Yang = mix of topic + classification. Classification to link effect to drug. UMLS & MetaMap

NS

Injadat 2016 [47]

Techniques, areas, performance, comparison of techniques, strengths, and weaknesses of data mining methods.

Any

Domains, Techniques, Research objectives, Strengths, and weaknesses of techniques.

19 data mining techniques used to address 9 different research objectives in 6 different industrial and services domains. Most used methods: SVM, NB & DT. Most used in business and social network analysis. Medical/health use only 8%

Datamining. SVM, BN, DT

Research into how techniques are implemented. Need more statistical tests of results. But - many of the tests applied required a normal distribution which was not the case. Health researchers not good about writing about the methods used. Could learn a lot from CRM and HRM domains.

Karmegan 2020 [48]

Aims to analyse the possibility, effectiveness, and procedures of using SM data to understand the emotional and psychological impact of unforeseen disaster on the community.

Mental Health

Platform, methods

Twitter most used source. Sentiment analysis used for psychological surveillance. Could not conclude that any one method was superior.

Feature extraction using classification algorithms. Sentiment analysis

Combine text and image processing. Incorporate social network analysis with post content.

Kim 2017 [49]

How SM data can be used to understand communication and behavioural patterns of nonmedical or problematic use of prescription drugs

Substance misuse

User characteristics, communication characteristics, outcomes, methodological domain, ethical domain

See lots of potential, but more work needed.

Mixture: manual, qualitative, supervised / non supervised ML to identify themes, patterns, sentiment.

Lots more - sees their review as a base to build on. Identified a lack of theoretical framework for substance misuse monitoring. Consequences of SM engagement understudied.

Lafferty 2015 [50]

How is SM being used in psychiatry? Tools, benefits, and challenges.

Mental health

SM as data, methodological considerations, ethical considerations, SM for recruitment

Observational, real time patient experiences. Can help with development of practice, policy, and provision. Opportunities for co-creation of research, patient centric care.

Grounded theory, Social network analysis

Ethical issues. Analyse SM data through different socio-cultural lenses to build theoretical frameworks.

Lardon 2015 [51]

Can SM be a new source of knowledge for pharmacovigilance?

ADR

Language, data source, data volume, methods, lexicon,

Identification theme all 11 papers used manual methods. Identified heterogeneity of methods, but also gaps. Included studies failed to assess the completeness, quality, or reliability of the data.

RQ1: All manual /mixed, RQ2: Web scraping, pre-processing, various rule-based methods.

Additional studies are required to precisely determine the role of social media in the pharmacovigilance system. Need methods to assess data quality.

Lau 2019 [52]

2018 SOTA of opportunities, challenges, and implication of AI in health informatics

Any

NS

Few 2018 papers reported Artificial Intelligence (AI) research for patients and consumers. No studies that elicited patient and consumer input on AI. Most common use is secondary analysis of social media data (e.g., online discussion forums). The 3 best papers shared a common methodology of using data-driven algorithms (such as text mining, topic modelling, Latent Dirichlet allocation modelling), combined with insight-led approaches (e.g., visualisation, qualitative analysis, and manual review), to uncover patient and consumer experiences of health and illness in online communities. There is a lack of direction and evidence on how AI could actually benefit patients and consumers.

Best papers shared a common methodology of using data-driven algorithms (such as text mining, topic modelling, Latent Dirichlet allocation modelling), combined with insight-led approaches (e.g., visualisation, qualitative analysis, and manual review), to uncover patient and consumer experiences of health and illness in online communities

See what patients want from AI in health. More patient involvement to ensure that research is asking the right questions.

Lopez-Castroman 2019 [53]

Detecting suicide ideation from SM

Mental health

NS

Early days, but SM has important role in suicide prevention. Lots more work needed.

Various: Sentiment analysis, topic modelling, data mining

Add demographic data to text to improve results.

Mavragani 2020 [54]

Current state of SM based infodemiology. Validity of methods and research gaps.

Any

Timeline & journals. Data sources, Health topics, Advantages & Disadvantages of SM data

JMIR most used journal. Increasing interest since 2018. Twitter most used platform. Most researched subjects were conditions/diseases, epidemics, healthcare, drugs, smoking/alcohol.

NS

Combine SM data with traditional sources for more complete assessment.

Neveol 2017 [55]

Best clinical NLP papers of 2016

Any

Applications of NLP, Directions of progress

Developing applications rather than methods. Starting on the more complex tasks e.g., semantics, coreference resolution, and discourse analysis.

Classification of useful sentences, Information extraction, abbreviation disambiguation, coreference resolution, grounding of gradable adjectives

NS

Neveol 2018 [21]

Summarize recent research / best papers for clinical NLP in 2017

Any

NLP of SM data, NLP of HCP text, methods

2017 trends - revisiting old problems such as SM classification and negation with deep learning & neural nets. Production of annotated corpora. Continuing applications rather than methods. Beginning of deep learning. Start of language variants.

Negation detection, corpus annotation, deep learning.

Work in other languages. Increase generalisability.

Patel 2015 [56]

Categorise & summarise existing papers about chronic disease outcomes from SM. Suggest framework for future research.

Chronic

Platform, Taxonomy category, disease, study aim, study design, sample size & description, Method summary, SM effect

85% either Facebook or blogs. 40% for support (social, emotional, or experiential).

Quantitative, Thematic qualitative, Content analysis.

Understand how disease, patient factors and tech can interact to improve outcomes. Reduce potential for bias. Target studies to specific diseases might be the best way to improve clinical care.

Pourebrahim 2020 [57]

Datamining methods for ADR detection from SM

ADR

Analysis and evaluation metrics

SM good for early identification of ADRs. Three main stages; Pre-processing, feature extraction and classification

Supervised, regression, unsupervised

NS

Qiao 2020 [58]

Overview of SM studies relating to mental disorder detection.

Mental Health

Platforms, collection methods, feature extraction, algorithms, evaluation metrics

Facebook, Twitter, Reddit, Tumblr, Instagram. Most used supervised methods, especially SVM

SVM, Decision trees, random forest, NB, Logistic Regression

Develop systems with lower computational cost to increase speed. Multi-language systems.

Ru & Yao 2019 [7]

SGOPE data - methods/analysis opportunities and challenges

Any

Data type, volume, pre-processing method, analysis method, health outcomes

Variety of methods. Outcomes included side effects / effectiveness / adherence / hrqol

NER, mapping, identify concepts, text mining (Ngram, LDA, topic modelling), content analysis, hypothesis testing, supervised, unsupervised

Suggested further research on treatment effectiveness, adverse drug events, perceived value of treatment, and health-related quality of life. The challenge lies in the further improvement and customization of text mining methods. Only 6 discussed ethics.

Santos 2019 [59]

Numbers of papers / journals, countries / databases, methods/tools, which public health issues looked at

Public health

Year, Journal, Study purpose, health area, techniques, software/ programming language, study country

Results showed a slight increase in the number of papers published in 2014 and a significative increase since 2017, focusing mostly on infectious, parasitic, and communicable diseases, chronic diseases, and risk factors for chronic diseases. JMIR and PLoS ONE published the highest number of papers. Support Vector Machines (SVM) were the most common technique, while R and WEKA were the most common programming language and software application, respectively. The U.S. was the most common country where the studies were conducted. In addition, Twitter was the most frequently used source of data by researchers.

SVM, Decision trees, random forest, NB, most used techniques. R, WEKA, and Python most used languages/ apps.

In depth analysis of variations in techniques (deep learning / ensemble etc)

Sarker 2019 [60]

Look at existing methods of SM based medication abuse or misuse, propose new data centric pipeline.

Substance misuse

Data source, dataset size, medication studied, study objectives, methods, and findings.

39 studies, 80% published since 2015. Twitter most used source. Earlier studies manual qualitative, but growing trend towards NLP methods.

Supervised, unsupervised

Develop shared annotation guidelines and annotated datasets. Will help the direct project and enable comparison across methods. Show agreement for manual annotation. Reduce noise in data.

Sharma 2016 [61]

Identify and highlight research issues and methods used in studying Complementary and Alternative Medicine (CAM) information needs, access, and exchange over the Internet.

CAM

NS

Significant interest in developing methodologies for identifying CAM treatments, including the analysis of search query data and social media platform discussions

Qualitative, thematic, content analysis, keyword searches, regex, Consumer health vocabulary

Little work done on using SGOPE to understand CAM user’s perspectives / prevalence of CAM use. Lots more work required.

Sharma 2020 [62]

Can sentiment analysis be conducted on social media platforms to understand public sentiment held towards pharmacotherapy?

Any

Author, Year, Journal, data source, conditions, pharmacotherapy, SA method used, potential clinical use.

Lack of consistent approach. Opinion on particular medication (7/10) and ADRs (3/10) Lexicon based more used than ML for sentiment. (Lexicon 6, ML 1). Combining SA with other ADR methods improved results. Lots of untapped potential.

Lexicon, ML. Combining

No gold standard methods yet. Early stage of development. Accuracy rarely assessed.

Sinnenberg 2017 [63]

How and why health researchers using Twitter?

Health research

Ways Twitter data used by researchers, ways that Twitter platform used in health research, Publication date, research topic, ethics, and funding

The primary approaches for using Twitter in health research that constitute a new taxonomy were content analysis (56%; n = 77), surveillance (26%; n = 36), engagement (14%; n = 19), recruitment (7%; n = 9), intervention (7%; n = 9), and network analysis (4%; n = 5).

Content analysis, network analysis

Future work should develop standardized reporting guidelines for health researchers who use Twitter and policies that address privacy and ethical concerns in social media research. New opportunities to characterise users from metadata such as demographics.

Skaik 2020 [64]

Recent trends and tools for using social media posts to predict mental disorders using ML and NLP methods. Identifying research gaps.

Mental Health

Collection methods, applications, best practices, and gaps

25 papers looking at population level mental health classification techniques. 15/25 depressive disorders, 10/25 suicide-ideation. Twitter most used data source, SVM most used model. Heterogeneity of methods and feature selection.

Models: SVM, Ensemble, LR, RF, DT, LSTM. Features: WEKA, LDA, TF-IDF, Sentiment, Lexical, Syntactic, Demographics, Word embedding, Topic modelling

Improve identification of risk factors.

Staccini 2017 [65]

Uses and challenges for secondary use of health data

Any

Data donation, uses of SGOPE data

Secondary use of patient data (apart from personal health care record data) can be expressed according to many ways. Requirements to allow this secondary use should be harmonized between countries, and social media platforms can be efficiently used to explore and create knowledge on patient experience with health problems or activities. Machine learning algorithms can explore those massive amounts of data to support health care professionals, and institutions provide more accurate knowledge about use and usage, behaviour, sentiment, or satisfaction about health care delivery.

NS

Very early days, lots to work on. Socio-ethical concerns, increased adoption in health care. Need to check AI /SM is asking the right questions. Need a formal framework for consent and secondary use of data. Far from massive adoption in health practice.

Su 2020 [66]

Deep learning in Mental Health

Mental Health

Methods, Tools, Techniques

A growing number of studies using DL models for studying mental health outcomes. Particularly, multiple studies have developed disease risk prediction models using both clinical and non-clinical data and have achieved promising initial results. Lots of potential but lots of challenges

CNN, RNN, Autoencoders

Reduce bias, improve methods

Tricco 2018 [67]

Using SGOPE for ADR detection. Types / characteristics of platforms? How valid or reliable are the conversations?

ADR

Data sources, document characteristics, health conditions, methods, types of listening system, outcome results

46/70 documents (66%) described an automated or semi-automated information extraction system to detect health product AEs from social media conversations (in the developmental phase). Seven pre-existing information extraction systems to mine social media data were identified in eight documents. 19/70 documents compared SM reported AEs with validated data: consistent AE discovery in 17/19. No evaluation of methods or reliability.

Supervised 15/70, Rule based 6/70, unsupervised 4/70, deep learning 1/70, other ML 5/70, Manual or NA 32/70. Dictionary/ lexicon based most used.

Further research is required to strengthen and standardize the approaches as well as to ensure that the findings are valid, for the purpose of pharmacovigilance. Studies required to look at uses / utility over a longer time period. Need standardised methods. Fast moving field.

Vilar 2018 [68]

To review datamining as a method of detecting drug-drug interactions from pharmacovigilance sources, scientific literature. Challenges and limitations compared.

ADR

Data source, methods

SGOPE offers new possibilities for identifying DDIs. Current emphasis has been on ADRs not DDIs.

Dictionary matching, association mining, supervised LR.

More studies are necessary to really prove and understand the potential of social media resources and their role in pharmacovigilance.

Wilson 2015 [69]

Understanding how blogs could be used for qualitative health research

Any

Geographical location, study aims, now data used in health research.

Used for data collection and recruitment. Good for accessing out of reach populations. Potential for significant improvement of health equity. Sees blogs as ‘central part of global transformation’. Need to develop knowledge and skills to take advantage of this new resource.

Purely qualitative

Look for innovative methods to develop qualitative research.

Wong 2018 [70]

To review methods of identifying adverse events from free text

ADR

Definition of NLP tasks, evaluation metrics, challenges in applying NLP to medication safety, data source, methods

Time saving/ real time. Limited by lack of data sharing inhibiting large-scale monitoring across populations. SM good for groups such as children, pregnant women, often not included in trials. Data is Pt reported outcomes, values / preferences - more patient focused.

Supervised, CRF classifier, unsupervised k-means clustering. Linguistic based, standardising text with UMLS. Statistical based.

Integrate data sources from different domains to improve ADR detection. Ethical issues. Increased volume of open-source data.

Wongkoblap 2017 [71]

Scope & limitations of new predictive method using SM. Ethical concerns.

Mental health

Key characteristics, data collection techniques, data pre-processing, feature extraction, feature selection, model construction, and model verification.

Methods work across languages. Despite an increasing number of studies investigating mental health issues using social network data, some common problems persist. Assembling large, high-quality datasets of social media users with mental disorder is problematic, not only due to biases associated with the collection methods, but also with regard to managing consent and selecting appropriate analytics techniques.

Most common method was text analysis with LIWC. Sentiment analysis. Supervised / predictive models. Only 1/58 used deep learning,

Move towards open science standards - share datasets / workflow /code. Ethical aspects of using SM data not clearly defined. Lack of models for detecting stress or anxiety disorders. Combining SM content with confirmed patients rather than self-reported ones. Network analysis to investigate prevalence.

Yin 2019 [16]

To systematically review the effectiveness of applying machine learning (ML) methodologies to UGC for personal health investigations.

Any

Methods, Objectives, Data Source, Health issue, Language, Dataset size

103 eligible studies, summarized with respect to 5 research categories, 3 data collection strategies, 3 gold standard dataset creation methods, and 4 types of features applied in ML models. Popular off-the-shelf ML models were logistic regression (n = 22), support vector machines (n = 18), naive Bayes (n = 17), ensemble learning (n = 12), and deep learning (n = 11). The most investigated problems were mental health (n = 39) and cancer (n = 15). Common topics were treatment experience, sentiments, and emotions, coping strategies, and social support. Clinical credibility an issue. Application in practice - who should monitor UGC. Conflicting advice from peers / HCPs a potentially interesting avenue. SGOPE can learn health information not in EHRs. Processing and ethical challenges unresolved.

Logistic regression (22), SVM (18), Naïve Bayes (17), ensemble learning (12), deep learning (11)

Ethical aspects of analysing personally contributed data, bias induced when building study cohorts and dealing with natural language, interpretation of modelling results, and reliability of the findings.

Zhang 2018 [72]

Consideration of Twitter as a data source for health researchers.

Any

Research design, collection techniques, analytic methods, tools, author’s opinion on Twitter as a health research method.

17 papers: Quantitative (n = 2), qualitative (n = 7), and mixed methods (n = 8). Health topics and research questions included pain, migraines, and cancer, social discourse of conditions like perceptions of portrayal of seizures, and cyberspace compared to real-world phenomena. Twitter currently used to search and mine research data. Utilizing Twitter as a recruitment and data collection tool in health research remains largely unexplored. Data collection predominantly passive and covert data collection. Challenges include verification, ethics - overt or covert collection.

Qualitative, quantitative, mixed methods

Creates new questions about data collection, verification, ethics for researchers.

Zhang 2020 [73]

Role of SM, themes and methods used in SM based public health research.

Any

Publication trends, themes, role of SM, research methods

Growing number of publications and journals including studies.

Still mostly qual or quant, with little use of computational methods.

Need to develop the methodological potential.

Zunic 2020 [74]

Data sources, roles, motivations, and demographics of posters. Topic areas. Practical applications, methods and current performance levels of sentiment analysis.

Any

Data sources, role of post author, demographic features recorded, health area, ML algorithms used for SA, classification performance, lexical resources

86 studies. Majority of data from social networking/ Web-based retailing platforms. Primary purpose of online conversations is information exchange/social support. Communities tend to form around health conditions with high severity / chronicity rates. Topics include medications, vaccination, surgery, orthodontic services, individual physicians, and health care services in general. 5 poster roles identified: sufferer, addict, patient, carer, and suicide victim. Only 4 reported demographic characteristics. Many methods used for SA. Mainly supervised. Only 1 study used deep learning. Performance less than achieved by general sentiment analysis methods. F-score, below 60% on average. Few domain-specific corpora and lexica are shared publicly for research purposes. Unclear if performance issues are because of the intrinsic differences between the domains and their respective sublanguages, the size of training datasets, the lack of domain-specific sentiment lexica, or the choice of algorithms.

Sentiment analysis. Mix of tools. A wide range of methods were used to perform SA. Most common choices included support vector machines, naïve Bayesian learning, decision trees, logistic regression, and adaptive boosting. Only 1 study used deep learning.

Improved methods. Performance less than achieved by general sentiment analysis methods. Lack of domain specific datasets / lexicons. Need to create and share large, anonymised domain specific datasets. More inclusion of demographic data.