Twitter, Big Data, and Jobs Numbers
August 18, 2014 | by Susan Hutton
People turn to Twitter to see what’s trending in the news. But LSA Economics Professor Matthew Shapiro has found a new way to harvest employment information from tweets and hashtags faster and more accurately than the government’s official reports.
LSA Economics Professor Matthew Shapiro is one of the rare people who can claim to be getting work done when looking at social media. He’s not following sports scores or breaking news; he’s tracking the nation’s labor market.
“When we started,” explains Shapiro, “we had no idea if we could track job loss with tweets, but over a two-year period, we’ve seen the social media index perform quite well. In 2011-2012, for example, our index leveled off just like the official numbers. We captured big job-loss fluctuations around Hurricane Sandy, and around the government shutdown in October 2013.”
There were times when Shapiro’s numbers matched the reports, and there were times when they didn’t. When they differed, Shapiro’s numbers were more accurate than the government’s.
When the state of California got new computers, for example, there were delays in processing unemployment claims. Government data reflected the slowdown in processing applications, but social media captured a more accurate picture of what was happening in the labor market.
“Our series was stable,” says Shapiro, the Lawrence R. Klein Collegiate Professor of Economics, “so our numbers were, in some ways, a better indicator.”
By the Numbers
Many important economic indicators, including the Bureau of Labor Statistics unemployment rate and the University of Michigan Consumer Sentiment Index, are collected based on surveys. But it’s hard and costly to get enough people to answer a survey in order to get a representative sample. It’s hard to verify whether the people responding to the survey are speaking for just themselves or for everyone living in the household. And as anyone who’s ever abandoned a phone survey after slogging through the first five-part question can attest, people don’t always have the time or patience to respond.
Plus, Shapiro says, as technology has changed, people’s habits have, too. “It’s become increasingly common for households not to have a landline,” he explains by way of example. “It’s possible to conduct the survey on a cell phone, but it’s more difficult. And people simply often do not respond to mail surveys.”
For all of these reasons, Shapiro thinks Twitter might be a better, faster, more accurate source. “In tweets,” he explains, “people might say ‘2011 was a funny year: I lost my job, started a business.’ They’re not creating data consciously. They’re communicating with friends.
“And if this is how people are going to communicate,” he adds, “it behooves us to figure out how to turn social media communications into meaningful measures.”
Of course, in order to search for employment data, Shapiro and his U-M collaborators, Margaret Levenstein of the Survey Research Center and Michael Cafarella and Dolan Antenucci of the Computer Science Department, need to use the right words and phrases. To find them, they begin with what Shapiro describes as “the expert approach” and test their efforts with “the web corpus approach.“
The expert approach creates search terms based on words and phrases people commonly use when they talk about jobs and employment. Shapiro and his team do a search with those phrases and then check the results to make sure they don’t get misleading terms.
“‘Lost work’ was one of the first phrases we chose,” Shapiro recalls, “but it didn’t take long to figure out that it usually referred to a computer hard disk crashing. When we looked at the results, we saw that the word ‘computer’ showed up often.”
In general, the expert approach works well, Shapiro says, “though it has some issues: in particular, that it might miss the things you haven’t thought of. “
The web corpus approach takes care of that problem because it includes absolutely everything on the web. The millions of web authors who have designed billions of web pages, all the videos we’ve watched, and the comments we could not resist posting have all unwittingly created a massive data set for computer scientists, statisticians, and linguists to mine.
“Using the web corpus,” says Shapiro, “is one example of how economists learn from collaborating with computer scientists.”
The computer scientists on the team take a snapshot of the web corpus, and they run a search of the expert-selected phrases. This search identifies additional terms that appear within 100 words of the “expert” phrases and generates a dictionary of possible terms to search for in tweets. This new dictionary is too massive to practically use, so the team runs it through a statistical procedure that prunes the list of phrases.
“‘Jack LaLanne Juicer’ might occur by chance around ‘unemployment,’” Shapiro says, but that geographic coincidence doesn’t tie the terms together.
“The procedure looks for words that occur together systematically and tests such occurrences to verify whether the terms carry information.” The team’s eventual aim is to rely solely on the automated procedures and to take the expert out of it.
If Shapiro’s data continue to hold true, the resulting data will not only be accurate, but they could also be in the hands of people who steer policy much more quickly. “In turning points or times of crisis,” concludes Shapiro, “having something available instantaneously is critical for policy makers. Our aim is to have policy makers use these data as possible input to make decisions in real time. The Federal Reserve meets once a month and is often making decisions based on data that is an average of four to six weeks out of date.”
Maybe social media doesn’t waste time after all.
Illustrations by Erin Nelson.