By 2025, it's estimated that 463 exabytes of data will be created each day globally – To put that into perspective, most smartphones now come with 64 gigabytes (one thousand million bytes) of storage. 64 gigabytes is only .0000000064 of an exabyte. In recent years, there has been an explosion in the adoption of technology, with data infiltrating every corner of our lives. Each day, across multiple industries, channels and platforms, an endless stream of information is gathered, catalogued and stored. But how exactly are we using this data? And, more importantly, are we using it wisely? Here we explore how we can leverage this massive, and sometimes intimidating resource, to best serve patients and drive innovative outcomes.

Population health datasets are as diverse as the patients that they represent, consisting of any number of health-related metrics, segregated based on parameters such as age, disease, geography, genomic profile or socio-economic circumstances. With the rise of technologies such as wearables, trackers, apps, patient portals and EHRs, healthcare data has been increasing on an exponential scale. It is believed that at least 30% of global data is generated by healthcare and this growth is accelerating faster than any other industry including manufacturing, financial services, media and entertainment.

One of our greatest knowledge gaps is what happens to raw data after it is collected. Healthcare data is atypically compounded. Population data is disordered (in a variety of formats; free text, charts, images, videos etc.), highly complex (human variability) and also extremely sensitive. On this note, though individuals have the right to privacy of their medical records, we cannot underestimate the value that big health data purveys. There is still much debate and confusion in the field around how this data should be used and who should be responsible for it. But there is a balance to strike. Last year we explored some of the more notable patient data debates in recent years: the good, the bad and the ugly. As discussed in a recent bmj opinion piece, the boundaries are blurred and de-identification of data for research by third parties has assumed the happy medium in this conundrum for now.

Data Mining Rat Race
In addition to data collection and analysis, the race to mine healthcare data on a population scale has well and truly begun, and with good reason. Health data mining can yield invaluable insights into the effects of lifestyle factors, interventions, treatments and non-health related impacts such as financial parameters on individuals and at a population level. Predictive and prescriptive data can be especially useful in forecasting disease and treatment outcomes and ensuring the right treatment is delivered, at the right time, to the right patient, strengthening interventions and improving the impact of care. Healthcare is shifting from curative measures to prevention tactics and the analysis of big health data is central to identifying key trends highlighting the causes or triggers for certain conditions, and also to pinpointing the best intervention points to slow or prevent their onset. The value of predicting risk groups, intervention crunch points and treatment outcomes is colossal. In 2019 Nature published a list of best practices for analysing large healthcare centric datasets, describing an open process (demonstrated in the diagram above) by which datasets results from studies are made available to allow for a more collaborative approach to analytics.

Successful data use
So who’s already successfully gleaning valuable insight from population health data, and what have they learned? The population health startup Color recently made headlines for securing a $167 million funding deal, indicating that the future is bright for those at the forefront of the population health game. Color provides a range of applications in the population health space, centred around the collection and analysis of genetic data. Their offerings include identifying genetic risk factors across oncology, cardiology and medication reactivity, and providing comprehensive support for large-scale health initiatives, working with a wide-variety of stakeholders. Color have made notable strides forward in this space over the past few years, having teamed up with the National Institute of Health to develop the All of Us program, a national research project in the U.S. dedicated to the prevention and treatment of disease through collecting data from a diverse population, which has recently begun returning results to participants who have already donated biosamples.

Genomics, Oncology: Population data haven?
Another notable project in the population health space is the FinnGen project. Launched in Finland in 2017 with the goal of understanding the origins of diseases and their treatment, the study is a six year endeavour aiming to utilise 500,000 unique blood samples from a nation-wide network of biobanks. It is the amalgamation of numerous organisations both public and private working together including universities, hospitals and several global pharmaceutical companies such as Biogen, Merck, Genentech and GSK. FinnGen have stated that the genomic data produced during the project will be returned to Finnish biobanks to provide the basis for new industrial partnerships, drug trials, monitoring studies, and other private-public projects. Valuable population and individual insights have already been gained from the study, including the risks of developing cancer and carrying specific gene defects which significantly increase the risk of developing the disease.

The research led to the definition of a “polygenic risk score” which sums up a number of genetic risk factors associated with breast cancer risk and can be used to assess the likelihood of the development of the disease. Further, the polygenic risk score can also provide an accurate assessment of risk in relatives of breast cancer patients, massively expanding the potential benefactors of this research.

This intersection between population health data and oncology being explored by FinnGen is rapidly evolving. Established tech-behemoths such as IBM Watson are developing multi-faceted cancer initiatives with the central goal of bringing data, technology and expertise together to transform health, stating on their website that they are exploring how “AI systems could ingest raw data and support oncologists as they make decisions for their patients''. Pharma is also getting in on the action. In 2019 Pfizer announced a partnership with Sypase, a real world evidence company who list Roche, Amgen and Merck amongst their investors. The partnership aims to advance cancer outcomes for patients globally by using real-world evidence with an initial focus on molecular testing and precision oncology. In July of last year Syapse presented the results of a study investigating the effect of Covid-19 on cancer patients which found that cancer patients diagnosed with Covid-19 were more likely to also have comorbidities affecting the kidneys, heart, lungs and blood vessels. They also found that older people with cancer were more likely to die from Covid-19, reflecting what is also seen in people without cancer. Further, the study produced interesting social-economic findings about Covid-19, demonstrating that people with cancer and a low average household income of $0-30k were also more likely to have worse outcomes than those with higher household incomes.

Lantern Pharma is another “emerging, oncology-focused, clinical stage pharma at the intersection of Artificial Intelligence, Genomics, and Machine Learning'' to keep on the watch list. They have a proprietary technology called RADR - “Response Algorithm for Drug Positioning & Rescue '' that integrates data analytics, experimental biology, biotechnology, and machine-learning-based methods to predict oncology treatment outcomes. It will be of great interest to see how they perform in the coming years and if they can make good on these claims.
Clearly, population health data is a hot and crowded market with many organisations getting in on the action, particularly in the area of oncology. However, involvement and inclusivity is key in driving ambitious population health studies. The NIH and FinnGen projects are testament to the fact that fostering a sense of collaboration and trust is imperative and will encourage more and more people to participate and donate samples
Finally, the intersection of technology, biology and medicine is no small feat and requires expertise on all fronts and partnerships between key players in these fields are vital to the success of data mining endeavours.The possibilities, the datasets, and the analytics are endless. 90% of the world’s data was generated in the last 2 years alone. We are only restrained by the questions we can think to ask: Do we now have more data than we know what to do with?