Google Flu Trends for developing countries?
A few days back Aman wrote a post about Google Flu Trends. Thought I’d add a few thoughts here after reading the draft manuscript that the Google-CDC team posted in advance of its publication in Nature.
By the way, here’s what Nature says: Because of the immediate public-health implications of this paper, Nature supports the Google and the CDC decision to release this information to the public in advance of a formal publication date for the research. The paper has been subjected to the usual rigor of peer review and is accepted in principle. Nature feels the public-health consideration here makes it appropriate to relax our embargo rule
Ginsberg J, Mohebbi MH, Patel RS, Brammer L, Smolinski MS, Brilliant L. Detecting influenza epidemics using search engine query data. Draft manuscript for Nature. Retrieved 14 Nov 2008.
Assuming that few folks will read the manuscript or the article, here’s some highlights. I should say I appreciated that the article was clearly written. If you need more context, check out Google Flu Trends How does this work?…
- Targets health-seeking behavior of Internet users, particularly Google users [not sure those are different anymore], in the United States for ILI (influenza-like illness)
- Compared to previous work attempting to link online activity to disease prevalence, benefits from volume: hundreds of billions of searches over 5 years
- Key result – reduced reporting lag to one day compared to CDC’s surveillance system of 1-2 weeks
- Spatial resolution based on IP address goes to nearest big city [for example my current IP maps to Oakland, California right now], but the system is right now only looking to the level of states – this is more detailed CDC’s reporting, which is based on 9 U.S. regions
- CDC data was used for model-building (linear logistic regression) as well as comparison [for stats nerds - the comparison was made with held-out data]
- Not all states publish ILI data, but they were still able to achieve a correlation of 0.85 in Utah without training the model on that state’s data
- There have attempted to look at disease outbreaks of enterics and arboviruses, but without success.
- For those familiar with GPHIN and Healthmap, two other online , the major difference is in the data being examined – Flu Trends looks at search terms while the other systems rely on news sources, website, official alerts, and the such
- There is a possibility that this will not model a flu pandemic well since the search behavior used for modeling is based on non-pandemic variety of flu
- The modeling effort was immense – “450 million different models to test each of the candidate queries”
So what does this mean for developing world applications?
Here’s what the authors say: “Though it may be possible for this approach to be applied to any country with a large population of web search users, we cannot currently provide accurate estimates for large parts of the developing world. Even within the developed world, small countries and less common languages may be challenging to accurately survey.”
The key is whether there are detectable changes in search in response to disease outbreaks. This is dependent on Internet volume, health-seeking search behavior, and language. And if there is no baseline data, like with CDC surveillance data, then what is the best strategy for model-building? How valid will models be from one country to another? That probably depends on the countries. Is it perhaps possible to have a less refined output, something like a multi-level warning system for decision makers to followup with on-the-ground resources? Or should we be focusing on news+ like GPHIN and Healthmap?
Another thought is that we could mine SMS traffic for detecting disease outbreaks. The problem becomes more complicated, since we’re now looking at data that is much more complex than search queries. And there is often segmentation due to the presence of multiple phone providers in one area. Even if the data were anonymized, this raises huge privacy concerns. Still it could be a way to tap in to areas with low Internet penetration and to provide detection based on very real-time data.