Tweet went the data

August 31, 2011 | by Kristen Intlekofer

“I feel like I’m walking around with drunk goggles on my face. Meh, allergies.”

“I officially have the flu … spent all last night puking my brains out, and I’ve gotta find the energy to go to work in the am!”

—Twitter messages

Illustration by Brucie Rosch

The art of the overshare in 140 characters or less. Thanks to public Twitter accounts, people can hypothetically reach millions of users by each tweet. With an average of 200 million tweets every day, that’s a lot of information, ranging from the vital to the banal. To put things in perspective, it would take more than 31 years for one person to read 200 million tweets—the equivalent of a 10 million–page book. But for Johns Hopkins computer scientists Mark Dredze and Michael Paul, those messages represent a vast collection of data waiting to be put to use.

Last October, Dredze and Paul began pursuing the idea of mining Twitter for public health information. They knew the popular social media site had amassed what amounted to a very large public data set, and that researchers from various disciplines had begun to explore its uses, says Dredze, a research scientist at the university’s Human Language Technology Center of Excellence and an assistant research professor of computer science at the Whiting School of Engineering. Although analysts are increasingly turning to social media to gauge public opinion and study a variety of other phenomena, researchers are just beginning to look at status updates and tweets as a source of public health data.

“Health hasn’t really been explored,” says Paul, “and since we’re at Hopkins, the hope is that if we start going in this direction with public health, then we can start collaborating” with public health researchers and medical professionals at the university.

A few initial studies have analyzed Twitter for influenza information, but the Hopkins researchers wanted to go beyond the current research to see if they could track more than just flu data. They adapted a model that can quickly and inexpensively comb millions of public Twitter messages to identify up-to-the-minute trends. Dredze and Paul tracked flu patterns over time, revealed geographic correlations for things like tobacco use and cancer rates, and identified trends in self-medication for illnesses that don’t typically require a doctor’s visit. For example, Twitter users reported taking Tylenol or Advil for pain relief and Claritin or Zyrtec for allergies.

The key was to look at aggregate information—not individual tweets—to determine trends, says Paul. Starting with more than 2 billion tweets collected during 2009 and 2010, they whittled them down to 1.63 million messages after sorting for health-related content. In total, they could distinguish 15 different ailments using clustering—a statistical model that groups certain keywords together. (For example, the words allergies, sneezing, and Claritin might be grouped based on patterns the computer thinks are meaningful.) Not only were they able to track flu data, they could automatically identify additional ailments such as allergies, depression, and obesity, making their study the first of its kind. Their paper, “You Are What You Tweet: Analyzing Twitter for Public Health,” was published by the Association for the Advancement of Artificial Intelligence earlier this year, and Paul presented their work at the AAAI’s International Conference on Weblogs and Social Media in Barcelona, Spain, over the summer.

Their concept is similar to Google Flu Trends, a website that maps flu activity around the world based on data from Google searches, with the idea that if a large number of users in a geographic area are Googling the word flu, chances are there will be a corresponding spike in influenza cases in that region. Google Flu Trends has proven accurate when compared with data from the Centers for Disease Control and Prevention—and Google’s method of collecting information is cheaper and faster than the CDC’s. If the merits of Google data have already been proven, why turn to Twitter? “The reason is that only Google has Google’s data,” explains Paul. “For good reason, that data is private.” He cites an incident five years ago when AOL released more than 650,000 users’ private search histories without their permission, a stunt that prompted a class-action lawsuit. Twitter’s information is public.

Another advantage to looking at Twitter, Dredze adds, is that tweets can provide more information than Google search history. “When you do a search on Google, you’re looking for information. So you might say, ‘flu medicine.’ On Twitter, you’re expressing information, so you might say, ‘Taking Tylenol for the flu.’ And so that gives us a lot more information than what people are putting into Google.”

The next step, Dredze says, will be to work with experts in medicine and public health to determine what questions they should be asking, such as whether people are promoting health information correctly or if their tweets reflect widespread misperceptions about health—for example, using antibiotics to treat the common cold or the flu. As Twitter expands its reach to other countries, Paul says, the model might also be useful as a first-alert system to detect new epidemics.