iStock / Palau83


Researchers at IBM, Berg Pharma, Memorial Sloan Kettering, UC Berkeley and other institutions are exploring how artificial intelligence and big data can be used to develop better treatments for diseases (as we explored in a separate story on Saturday).

But one of the biggest challenges for making full use of these computational tools in medicine is that vast amounts of data have been locked away — or never digitized in the first place.

The results of earlier research efforts or the experiences of individual patients are often trapped in the archives of pharmaceutical companies or the paper filing cabinets of doctors’ offices.

Patient privacy issues, competitive interests and the sheer lack of electronic records have prevented information sharing that could potentially reveal broader patterns in what appeared to any single doctor like an isolated incident.

When you can analyze clinical trials, genomic data and electronic medical records for 100,000 patients, “you see patterns that you don’t notice in a couple,” said Michael Keiser, an instructor at the UC San Francisco School of Medicine.

Given that promise, a number of organizations are beginning to pull together medical data sources.

Late last year, the American Society of Clinical Oncology announced the initial development of CancerLinQ, a “rapid learning system” that allows researchers to enter, access and analyze anonymized medical records of cancer patients.

Similarly, in April the CEO Roundtable on Cancer, a nonprofit representing major pharmaceutical companies, announced the launch of Project Data Sphere. It’s an open platform populated with clinical datasets from earlier Phase III studies conducted by AstraZeneca, Bayer, Celgene, Memorial Sloan Kettering, Pfizer, Sanofi and others.

The data has been harmonized and scrubbed of patient identifying details, enabling independent researchers or those working for life sciences companies to use it freely. They have access to built-in analytical tools, or can plug the data into their own software.

“Patient privacy is important but so is making progress on cancer.”

David Patterson, a professor of computer science at UC Berkeley developing machine learning tools for cancer research

It might uncover little known drug candidates that showed some effectiveness against certain mutations, but were basically abandoned when they didn’t directly attack the principle target of a particular study, said Dr. Martin Murphy, chief executive of the CEO Roundtable on Cancer.

In some cases, it could also eliminate the need for control groups — those who receive the standard of care plus a placebo instead of the experimental treatment — since earlier studies have already indicated the outcomes for those patients. (That would be an important development because the fear of receiving a placebo is a major reason many patients decide against participating in clinical trials.)

The effort is happening now in part because of improving technology and in part because companies are coming around to the view that they’ll all be better off with the insights gleaned from this pooled data.

“It’s a recognition that it’s costing a lot more money to develop another drug,” Murphy said. “The low-hanging fruit was long ago harvested.”

Other information sharing efforts include the Global Alliance for Genomics and Health, the molecular databases maintained by EMBL-EBI and the National Institute of Health’s Biomarker Consortium.

Meanwhile, last month Google Ventures led a $130 million round in Flatiron Health, which has built an “oncology cloud” that aggregates information from billing systems and electronic medical records.

The system makes sense of data stored in inconsistent and unstructured formats from doctors offices and hospitals, to enable analysis of what’s happening across broad cancer patient populations. Ideally it can highlight what’s working for which types of cancer patients.

“Flatiron is focused on what we (and the industry) call ‘real world’ patient clinical data, whereby we’re trying to aggregate and organize data on the 96 percent of patients who do not participate in a prospective clinical trial,” co-founder Nat Turner said in an email.

“To really understand what’s working and how others are treating and what outcomes are being achieved, institutions should be open to de-identified data sharing and anonymous benchmarking, which is part of the Flatiron vision,” he said.

Related Article

To be sure, there is good reason to proceed with some caution here. Medical information is highly sensitive, so any privacy risks demand careful consideration.

Supposedly “de-identified” data has proven to be anything but on several notable occasions in the past (including herehere and here). And electronic medical records have been compromised already.

But to the degree that there’s a social tradeoff here, many come down firmly on the side of: let’s try to save lives. Old habits and out-of-date regulations still mean the shift isn’t happening nearly fast enough if you ask David Patterson, a professor of computer science at UC Berkeley developing machine learning tools for cancer research.

“Those of us in the computer field are used to Internet time and Moore’s law,” he said. “For me as an outsider, it’s very frustrating that we can’t get bureaucratic agreement so that we can collect lots of data sets together.”

“Patient privacy is important but so is making progress on cancer,” he said. “The upside of collecting lots of information together is we can make progress on this terrible disease.”

No one interviewed for this article could point to a breakthrough treatment produced by these techniques to date. After all, the tools are new, the data sets are just coming together and clinical trials take years.

But nearly all agreed researchers are on the verge of something big.

“The tips of your shoes are just poking over the edge of the peaks,” Murphy said. “No one has been over this before in cancer.”


We in the UK are doing just this with the NHS with all it's data being released in meta form. It will be transformative for research. There were a few issues of privacy but the information is all meta. Once people were informed that individual information will not be included the opposition has petered away. Being also the NHS we have a huge amount of data from a large population with one health system. The data will be incredibly valuable with some predicting that it could be the biggest boom for our tech industry and make us world leader. Can't wait to see some results.


There is so much crap in the medical literature I wonder how this affects those machine learning algorithms. Maybe Mr. Watson should work on curating the literature first and share it.

Dr. Majorek
Dr. Majorek

There is no question that big data has the potential to save lives. In addition, patient privacy needs to be protected and safeguarded since there is great potential for abuse.  As cofounder of a patient engagement platform, MDCapsule, I see opportunity in involving consumers in the discussion…why not engage patients in the discussion of big data? isn't it the patient's own data that has potential to save lives and shouldn't it be him/her that decide whether it can be used or not? 


I've often wondered what information could be gleaned from all those supermarket loyalty cards most of us use to buy food. Tens of millions of people have been using them for over a decade to buy food, and combining this information with these people's health data could finally yield definitive insights on diet and health.


Get every new post delivered to your Inbox.

Join 300,761 other followers