Too Complicated For Google Search: Palantir’s Health Team Tackles Challenges with Data Analysis

Mailyn Fidler

This outbreak was different, it was clear. In the summer of 2010, I was an intern assisting analysts in the Emergency Operations Center (EOC) at the Centers for Disease Control and Prevention (CDC).  They were on alert, monitoring the progression of an outbreak of Salmonella Javiana in Indiana.  This infection causes particularly nasty food poisoning and affects about a million people annually.1 More people were falling ill than had ever been seen in Indiana from this strain.  The average case age was much higher than usual.

As case counts rose, the analysts grew increasingly worried about finding the source to stop the outbreak.  These analysts, however, had the advantage of an unusual group of sidekicks.  Next to them sat the health team from Palantir, a Silicon Valley company specializing in data analysis and integration.

Lekan Wang is one of those team members. Wang, a Stanford graduate, joined Palantir’s health team during the epidemic. Wang’s team includes a Stanford immunology PhD, a former director of engineering at a startup, and a medical student on leave from Dartmouth.

PayPal executives founded Palantir Technologies in the wake of 9/11 to provide solutions to some of the world’s biggest challenges with better data integration and analysis.  Palantir has 500 employees in offices across the globe. The health team, Wang said, “was created because we saw problems [across our healthcare system], and there was a lot of data involved, and we thought, how can we use Palantir to make it better?”

Wang’s team built a data analysis platform for the CDC’s Outbreak Response and Prevention Branch within the Division of Foodborne, Waterborne, and Environmental Disease.  The platform integrates huge amounts of structured and unstructured data used during an outbreak.  It allows analysts to search the data with different filters, record connections between data, and to track and share how they develop their analysis.

 

 

“If you need to visualize your data, that’s assuming you already have your data nicely organized,” Wang said.  In a large organization, hundreds of databases may be used, and this organized data must be integrated with less “formal” data – like that “Excel file emailed back and forth for years,” Wang said.

The health team customized the platform to combine different types of data. The data can then be analyzed with multiple searches and filters.  The platform makes using the data intuitive, meaning complicated interactions with databases are more accessible to analysts.  The platform also increases the specificity of a data search. The goal was to create a system that had the ability to “give me every location of a childcare center that the wind could have blown this organism to,” Wang said. This problem is fundamentally different from everyday searching and “gets a little complicated for Google search.”

The team had to address concerns about data sharing.  “Most health info is highly sensitive. We have to access this data very quickly while preserving security,” Wang said.  The team also wanted to allow analysts to share new data and interpretations with each other easily.

“With spreadsheets, it gets out of sync and there’s a lot of mess.  With the CDC, especially with foodborne epidemiology, you are interacting with states, who probably have their own data formats,” Wang said. The platform allows quick sharing of new data with the added feature of “sharing analysis – you can share a concept and the path of how you got there.”

The platform helped analysts pick apart the details of the epidemic.  The time and location-based searches of epidemiologic data helped them separate the Indiana epidemic from the one in the southeast.  Combined with traditional fieldwork, the platform helped cluster cases around a group of restaurants.  The tool’s integration of supply chain information with epidemiological data helped determine the exact cause of the outbreak – lettuce from a farm in Salinas Valley, California.

These discoveries are feasible outside of Palantir’s framework.  The platform, however, makes these determinations faster, saving time, resources, and probably lives. Essentially, Wang said, the system is a “platform that lets you chase your thoughts and quickly iterate on an idea.” The platform will continue to integrate data from the CDC over time, providing an easy access point to CDC institutional knowledge.

Palantir deployed the platform for use during the cholera outbreaks following the Haiti earthquake.  The platform helped “integrate Twitter posts and other data sources to try to get the early alert to where cases were happening,” Wang said.  Currently, the team is turning its thoughts towards the health insurance realm and how to process data to reduce healthcare costs.

Overall, the team approaches the intersection of public health and computer science with an inclusive attitude. “Computers are very good at certain things, humans are very good at certain things, so combining both of the best is where we really get our value,” Wang said. In this case, the platform combines the vast memory of the computer to calculate possibilities with the intuition of the analyst.

A standard outbreak survey tool for use across states, Wang said, is probably next on the team’s development list for the CDC.  To conduct surveys during the salmonella outbreak, EOC staffers spent hours at the phone banks, collecting data the old-fashioned way, by hand and on paper.  Palantir’s platform, with its lowered barrier between data and analysts, would have been a welcome addition to those late nights.

Lekan Wang, MS&E ’09 and Computer Science (Biocomputation) ’11, works at Palantir Technologies and enjoys the free T-shirts they give him.

1. Multistate Outbreaks of Salmonella Infections Associated with Raw Tomatoes Eaten in Restaurants.  Available at http://www.cdc.gov/mmwr/preview/mmwrhtml/mm5635a3.htm.  Accessed February 26, 2012.