Team taps Wikipedia for flu forecasting data

The space occupied by Google ($GOOG) Flu Trends is becoming more congested. One year after the Centers for Disease Control and Prevention (CDC) put out a call for ways to use digital data to forecast the flu season, details of the entrants' models are still trickling in as academic papers are published.

A team from the Los Alamos National Laboratory is the latest to publish details of its model, which uses Wikipedia access logs to predict the spread of influenza. By combining the access logs for Wikipedia articles related to flu with CDC data, the team has created a model which it claims can accurately predict the incidence of influenza. The model goes one step further than Google Flu Trends by trying to predict the future, instead of just giving a real-time estimate of the spread of the virus.

The Wikipedia-based approach could suffer from similar problems to Google Flu Trends, namely the possibility that the online activity of hypochondriacs will skew the model. Test data published to date suggest the Wikipedia model doesn't suffer from overestimation, perhaps because CDC data is built into its calculations. In fact, the model tends to underestimate flu incidence towards the end of the season. People not visiting Wikipedia when infected for a second time could account for the blip.

CDC handed out prizes for its flu forecasting competition in June, awarding first place to a team that combined data from Google with information the public health agency gathers from physicians. Google is now applying a similar combination approach to its own Flu Trends project.

- read the paper (PDF)
- here's MIT Technology Review's take