| November 29, 2012
The data accumulating in databases today offer amazing opportunities for new insight. However, two things worry me: the problem of accepting consumer commentary at face value and the problem of garbage in, garbage out. Both of these issues have to be dealt with, particularly when analyzing social commentary.
Millward Brown’s foray into this arena is to mine the entire Twitter stream to identify serial brand commentators, and then parse their tweets to summarize subject, tone, location and sphere of influence. Implemented by the Emerging Media Lab in the U.S., the resulting Verve Index provides companies with a way to compare and contrast their standing in the world of social media.
I was surprised, given the absolute quantity of data involved, that data handling is not the major issue. By renting bandwidth from Amazon’s Elastic Compute Cloud (Amazon EC2), the team can quickly scale to handle changes in volume around major events like the Super Bowl.
Far more of an issue is the amount of spam found in the Twitter stream. Though social media measurement is widely accepted as an accurate census of organic online conversation, much of what is being captured is actually mass-produced by non-human sources.
As Anne Czernek describes in this Point of View on the topic, a recent Millward Brown audit of Twitter data across about 60 brands found that as much as 60 percent of all brand data captured is spam. While most of it is fairly innocuous – the “click here to get a coupon” variety – it does change the overall tenor of the commentary. All of which means that while Natural Language Processing and applying Bayesian rules help with data cleaning, at some point a human needs to step in and judge whether the commentary is worth attending to or not.
Analyzing social commentary provides a rich source of stimulus to prompt ideas and hypotheses for further analysis. This is particularly true when analyzed in parallel to traditional survey data sets, to understand the interaction with other elements of the brand experience and marketing campaigns. But the value of such analysis will only be recognized if we know that the issue of garbage in, garbage out has properly been addressed.
In the early days of the Internet, I remember many people in start-ups stating something along the lines of “electrons are free.” These days it seems that the assumption is “data is free.” But even if the data is really free, there are hidden costs attached to managing and analyzing big data. Ensuring that the data being analyzed is meaningful is just one of these hidden costs. And the real hidden cost is making the wrong judgment based on the data. Without an emphasis on data quality and cleaning, the chances of incorrect assessment and interpretation become significant.
Verve is just one way to harness big data alongside traditional research techniques. How else do you believe research will adapt to the world of big data? And how big an issue do you expect data quality to be? Please share your thoughts.