Today, real world data (RWD) primarily consists of administrative claims databases, electronic health records, patient registries, and patient-reported outcomes. While these data sources are tremendously powerful in their own right, they do not capture the full range of patient experiences.

Specifically, these sources can systematically miss certain types of data elements (e.g., data on physical fitness and socio-behavioral influences on health), and they do not capture all encounters with a health care provider (e.g., uninsured encounters, which are not captured in claims data), and they may miss healthcare events that occur outside of an encounter (e.g., flu episodes or over-the-counter drug use).

Novel sources of RWD hold the promise of supplementing standard databases by filling in these gaps. For example, better data on physical fitness and socio-behavioral factors could improve risk stratification and confounding control; earlier and more comprehensive detection of adverse events could help with pharmacovigilance; and better data on social activity and treatment experiences could assist in evaluating quality of life as an outcome.

These novel sources of data could include almost any technology that routinely collects personal data, and they often involve machine learning techniques that extract semantic knowledge and identify complex patterns from raw data.  Examples include but are not limited to:

Wearable devices. Devices such as fitness monitors and smartwatches can provide data on daily exercise activity, heart rate, blood pressure, and sleep quality, as well as identify episodes of disease at the time of occurrence, such as atrial fibrillation  and epilepsy.

Location data. Location data from smartphones, cars, and other sources could potentially be used to assess a broad range of topics, such as estimating a patient’s physical mobility, socioeconomic conditions, or environmental exposures, or determining when a patient is hospitalized or has other healthcare encounters.

Social media, patient forums, and internet search queries. Online discussions and search queries could be used to evaluate patient symptoms, diagnoses, treatment experiences, and adverse events associated with drug exposures, thereby assisting pharmacovigilance. In addition, these approaches could be used to detect infectious disease outbreaks and estimate disease incidence.

Voice assistants. As voice assistants such as Alexa can help patients manage their healthcare, they can provide data on a diverse range of topics, such as drug prescriptions, medical appointment scheduling, and medical information shared with care providers .

While these new sources of RWD may hold significant promise, they are also accompanied by a number of challenges, including:  

Patient privacy. Foremost of these challenges is ensuring patient privacy protection, particularly as personal data collection and linkage across multiple data sources becomes ever more pervasive. The improvement and use of validated, automated de-identification systems will be key in minimizing privacy risks and ensuring regulatory compliance. Furthermore, patients must be informed about how their data is collected and be empowered to have control over when their data are used for research.

Data integration. Novel sources of data are most powerful when they can be linked to other health records, such as claims and EHR data. Standardized applications, such as the use of the FDA’s recently launched MyStudies app, can help facilitate this integration while ensuring regulatory compliance.

Development and validation of machine learning techniques. Although significant progress has been made in machine learning technologies, further development is needed, such as in the area of social media data mining. Machine learning models must be validated and maintained over time, especially when used in rapidly evolving information environments such as social media, where changes in communication styles may render models less relevant over time and the presence of incentivized social media influencers may make it difficult to separate noise from genuine drug signals.

Generalizability. Data sourced from wearable devices and similar technologies may contain differential representations of patient groups compared to the general population. For example, patients who use wearable devices and social media may differ from those who do not based on socioeconomic status, culture, language, health, and other factors. As a result, models and conclusions developed from this type of data may not be externally generalizable to the wider population, and there are risks of excluding disadvantaged populations from realizing the benefits of these research efforts.

Study design challenges. As with any observational data, the effects of missing data, selection bias, and other epidemiological issues must be understood and managed in order to produce valid results. For example, can and should periods of missing wearable data be imputed? Is a patient who stops using a wearable device systematically different from one who continues to use it? How can we perform suitable risk adjustment in comparative studies when incorporating alternative sources of RWD?

These are just a few examples of issues that will need to be addressed when working with new sources of real world data, and we do not yet fully understand all of the challenges that they will bring. As we are just at the beginning of working with these new data sources, we must establish and standardize methods for protecting privacy while also realizing the full potential of data to improve patient health.