Can social data replace the polls?

The first known example of an opinion poll was in 1824, which was a local straw poll In Pennsylvania on the US presidential election. In 1916, The Literary Digest embarked on a national survey (partly as a circulation-raising exercise) and correctly predicted Woodrow Wilson’s election as president. Mailing out millions of postcards and simply counting the returns, The Literary Digest correctly predicted the victories of Warren Harding in 1920, Calvin Coolidge in 1924, Herbert Hoover in 1928, and Franklin Roosevelt in 1932. Of course polling has come a long way since those early days but one inherent feature still remains – human surveys – usually conducted over the telephone.

However, it’s now becoming a familiar refrain – the polls got it wrong (again)! The 2015 UK general election, Brexit and the most high profile of them all, the US presidential election 2016. Sites that used the most advanced aggregating and analytical modelling techniques available had Clinton’s chances at what look like silly odds: the New York Times had her odds of winning at 84% & the Princeton Election Consortium had her at 95-99%! Even Nate Silver who has had a pretty impressive track record gave Clinton a 71% chance of winning on the eve of the election.

So what’s going wrong? There are a number of trends driving the unreliability of election and other polling. The first is mobiles. Prior to mobiles, the ubiquity of landline telephones made finding reasonably-random and representative samples easy, as pollsters could just pick random names out of phone books, call potential voters, and talk them through interviews, which supplied the kinds of rich context and human understanding necessary for properly analysing their responses. That method also ensured reasonably high response rates and helped control nonresponse bias. But the rise of mobiles and the demographic differences of their adoption mean that random samples of landlines become increasingly inadequate in finding good samples. The problem with moving to mobiles or even attempting a hybrid approach is that mobiles are not usually publicly-listed, making it harder to find representative samples. Various online survey methods have been used to supplement or supplant more expensive and less expensive phone methods, but they often also suffer from bias and are generally considered of lower quality than other polls.

The second factor is the decline in people willing to answer surveys. Telephone surveys in the US in the late 1970s, achieved an 80 percent response rate. Enter voicemail, mobiles, decline of landlines and people generally not answering and by 1997 response rates were down to 36 percent and this decline has accelerated. By 2014 the response rate had fallen to 8 percent.

You don’t need a PhD in maths statistics to know that sample size is important for accuracy. Put simply: poll more people and your errors go down. If you’re struggling with numbers errors will increase. To get more numbers requires more time and hence cost. These two factors have made high-quality research much more expensive to do, so there is less of it. To top it off, a perennial election polling problem, how to identify “likely voters,” has become even thornier. Consequently election polling is in near crisis.

Can social media data replace polls? Without a doubt, social media has brought about a revolution in communication. Since 2004 its growth has been near exponential. It’s no longer the preserve of teenagers and millennials but firmly fixed in everyone’s daily lives. So much so, that headlines from the recent US election suggest that social media won the election – from Trump’s extensive use of Twitter to push his campaign messages to the influence of ‘fake’ news and echo chambers. Like never before people are freely broadcasting their views and sharing other peoples views they support.

The technology is now available to collect and analyse all these social media posts. It’s what Blurrt does. Using hand built search terms Blurrt collects all the relevant social posts, in real-time, and analyses the text for expressions of sentiment and emotion. The results can be displayed, in real-time, using a number of different visualisations and metrics. One such algorithm metric we have developed – the Blurrt score – measures volume of engagement, sentiment and the strength of the sentiment.

We first entered the fray of politics and polling in 2014 working with LBC on the two live debates between Nigel Farage and Nick Clegg. Using live sentiment graphs or ‘Twitter worms’ as they became known, we called the result of the debates, seconds after finishing, by analysing the reaction on Twitter. Over both debates we collected & analysed over 100,000 tweets. Since then we’ve moved onto bigger things and they didn’t come much bigger than the EU Ref in June 2016. We worked with Twitter UK and the Press Association and built a Referendum data hub analysing live tweets related to the referendum. Using hand crafted search terms we analysed over 54 million tweets over 6 weeks of the data hub being live.

Throughout the trend of the data suggested ‘Leave’ were ahead. Incredibly, at 1:00am on 23 June our data showed Leave at 57.7% and Remain at 42.3%. At 2:00am the positions were Leave at 51.3% and Remain at 48.7%.

The advantage that Blurrt has, is that it’s isn’t confined to an echo chamber. The software doesn’t have a timeline like you and I, determined by who we follow, friends and likes. The software collects everything that’s relevant – whether left or right wing and from old or young. This means that Blurrt has a ‘helicopter’ view of the horizon. Unlike the polls, Blurrt doesn’t have a problem with sample sizes and response rates. People are freely and openly sharing their views and all we need to do is harvest them. How many polls during the EU Referendum spoke to 54 million people? Critics will complain social isn’t a representative sample of the population. Really. Back in the early days maybe, but not any longer. Do you know anyone who doesn’t use social?

In many respects, Blurrt is doing what the Literary Digest did back in the early 20th Century. Only this time, we aren’t mailing postcards out to people. People are writing and mailing their own ‘postcards’ in vast numbers on a daily basis on social networks.

A revised version of this blog was also published online by Huffington Post