A vital step in generating real-world evidence is understanding where the available data came from and how they were generated. This is necessary for evaluating whether the data are capable of meeting a particular research need and for designing a robust study of that data.

Here are a few examples of questions that should be asked of any dataset—and how an understanding of the source of that data can help answer those questions.

1. Who is represented in the data?

The data source plays an immense role in determining what populations are available for study. For example, administrative claims sourced from employer-provided health insurance will capture working-aged adults. If the target population of a study includes older adults or the uninsured, then such data may not be appropriate. Populations in electronic health records (EHRs) may reflect the particular medical specialties of the providers that generated the data. As a result, certain therapeutic areas may be over- or under-represented in the data, and thus it is important to be aware of the scope of practice of the included providers.

2. What data elements are missing?

Understanding the data source can also help determine what types of data are reliably collected. For example, because end-of-life events do not necessarily generate health care claims, mortality is often missing from insurance data. If mortality information is needed, then supplementation with data sources such as vital statistics offices might be necessary. Laboratory tests may generate a claim, but actual test results tend to not be available in claims databases. As a result, if lab results are important, then it may be beneficial to include the use of EHRs. Conversely, because claims data are captured for the purpose of billing, they are more likely to have cost information compared to EHRs.

3. What therapies are captured?

Particular therapies can be missing in various ways in any given dataset. For example, over-the-counter drugs do not generate insurance claims and thus are not represented well in claims databases. Claims data will also often only have outpatient prescription data available, and thus drugs dispensed in a hospital may not be observable. Furthermore, some therapies, such as injectables that are new to the market, may not have yet been assigned a standard code (e.g., Healthcare Common Procedure Coding System sets) used for billing, and thus may not be clearly represented in a claims dataset.

4. Is it possible to know when patients are observable?

Observability is a key concept in RWE studies. If a patient’s observability status is unknown, then we do not know if the absence of a health care event (e.g., a stroke) means the event did not happen—or that it happened but was not recorded. Thus, it is important to know the time intervals over which a patient is observable so that appropriate analytical methods can be applied to deal with periods of unobservability.

Claims datasets often provide excellent information on observability because they explicitly define periods of time when patients are enrolled in health insurance and are thus contributing relatively complete data on their health care encounters. EHR datasets, in contrast, generally do not indicate when patients can be seen at the provider network from which the EHR was derived—thus observability needs to be dealt with more carefully.

Answering questions such as these is the critical first step toward designing an effective study to produce RWE in which we can be confident.