We analyzed Donald Trump's tweets looking for patterns and insight. The lessons we learned were not just about our newly inaugurated President, but also about pitfalls in data analysis and storytelling.
The era of big data gave us high quantity data, but that is meaningless if you have low quality data. Like other forecast models we used polling data but we also used qualitative reports like The Cook Political Report and Sabato’s Crystal Ball. We used this to smooth out polling response bias, but there were larger issues at hand.
So what went wrong?
On November 1st, we did a simple “what if” alternative analysis. The premise was simple: “What if all the polls were wrong?” We looked at 471 state polls that were published after August 1st. If a state had even a single poll with a Trump lead, we gave that state to Trump. This election map may very well be the most accurate forecast in the United States. (Note - at the time of posting, the election was called for Trump but we were awaiting the finally tally from several states).
Simply put, data quality determines the quality of your insights. It is up to the pollsters and pundits to decide whether they were wrong about the polls or the voter turnout. It is up to the pundits to determine if reporting on the polling had any impact on voter turnout.
What we can say from experience is that objectively collected data on actions and behavior is far more accurate than first person self-reporting.
In the end, good data science on low quality data makes for a low quality output.
Evince's Andy Hoagland created an ensemble model to predict the outcome of the U.S. Presidential election. His algorithm eliminates a lot of the volatility seen in other prominent election prognosticators, such as FiveThirtyEight and The Upshot. Here, he explains how, and why, he has been able to accomplish this feat.
Who is the best pollster? Poll aggregators figured out how to tackle this question years ago. Nate Silver put presidential forecasting on the map in 2008 and solidified making his name synonymous with election forecasting after his 51 out of 51 electoral college call in 2012.
This presidential election season we have more aggregators using different approaches. Consider this, along with the advances since 2012 in cloud computing, data science, and machine learning. Naturally, we wanted to get a model out there, but there was debate over our approach.
One thing I learned from participating in several Kaggle data science competitions is that there is no silver bullet algorithm, although XGBoost is pretty close. Data prep, feature engineering, and feature selection are key. A single model put me in 18th place out of 2,257. The winner of that competition took their model one step (actually, several steps) forward.
When you look at data science competitions on Kaggle, the winners all seem to have one thing in common; ensemble models. Take your best model, then mix it with another that uses a slightly different methodology. For the past several years, each winning solution has used some variation of ensemble modeling. We decided to do the same. That is how we came up with the Aggregators Ensemble.
Not only did we create our own model to predict the electoral college outcome of each state (and D.C.), but we also included poll aggregators, prediction markets, and qualitative experts. Once we had our state probabilities, we ran over 20,000 simulations to see which candidate crossed the 270 win threshold to become the next President of the United States.
Check out the interactive for yourself: http://tabsoft.co/2dByh94