With business analytics, the toughest challenge is collecting data needed for questions one needs answered. My emphasis here is on:
– must-have answers, not – desired answers!
This is the third post in a series of entries about big data. Others so far are:
New technqiues will not do
Often, we focus on predicting or forecasting the future. However, in management it is more important to understand the analytic HOWs and WHYs. These matter more than the promise of prediction. In the past we did not call things predictive analytics but forecasts instead. We used
– time series, as economists still often do, and – tried our luck with multivariate analysis (both part of what is called parametric statistics). These days, we still use the above methods. However, new ones have come to the fore, such as:
– k-means clusters, and – random graphs.
Get the latest news on your mobile Subscribe to our award-winning blog: DrKPI – the trend blog
K-clusters are used to determine the number of clusters for the K-means algorithm for different data sets (see also Pham, Dimov and Nguyen, 2014). The procedure follows a simple and easy method to classify a given data set. This is done through a certain number of clusters (assume k-clusters) which have a fixed priority. The main idea is to define k-centroids, one for each cluster.
Another increasingly used tool is random graphs. A random graph is obtained by randomly sampling from a collection of graphs. This collection may be characterized by certain graph parameters with fixed values (Fairchild and Fries, January 26, 2012).
K-means clusters and random graphs can be helpful in making predictions,but this may require that we play with petabytes of data.
The question is whether we handle these properly.
For instance, predicting the future can sometimes make people decide or behave in certain ways that do not appear helpful (see Ariely, 2009). Nevertheless, such models or using predictive analytics can help change how organizations think about issues (e.g., renewable resources). In turn, this can result in learning from mistakes.
This animation of the random graph method by a researcher from the Swiss Federal Institute of Technology shows the evolution of the G(n,p) (Erdős-Rényi) random graph as its density ‚p‘ is gradually increased. Phase transitions for trees of increasing orders, followed by the emergence of the giant component can be observed. The animation stops when the graph becomes connected at average.
You must learn from mistakes
So now the question is, if I can handle petabytes of data, how useful is my model to predict the vote on Scottish independence? For instance, last week two opinion polls indicated that the vote on whether to break up the 307-year-old union with Great Britain was evenly split. In fact, a You Gov poll released Wednesday, September 10, 2014 gave the No campaign a narrow lead (similar numbers as were obtained two months ago).
Anybody want to guess which campaign will win the Scottish referendum? Heads or tails, anyone? My guess is that the Scots will go for a ‚No‘ – what you tell a pollster is one thing, but it is quite different once you are behind a curtain putting your vote on paper. What do you think?
Of course, it is better to use predictive analytics to tell us who will win this referendum. But Kevin Dugan’s statement seems a bit incomplete when he says:
„…the key is not just measuring…it’s measuring success.“
Sure, it will be interesting to find out how accurate our predictions are or if a certain key driver (e.g., personalised phone service) helps increase sales. To illustrate, did the banks lead the business-sector drive for a Scottish ‚No‘, support the ‚Yes‘ campaign, or hurt their campaign’s chances?
„…let them relocate to England, good riddance… we already paid too much for the Royal Bank of Scotland’s failure in risk management.“
Nevertheless, it is also important to understand why a key driver may have failed to result in the desirable outcomes. To illustrate, Prime Minister David Cameron has been urging business leaders to speak out against the referendum for months. David Miliband, the leader of the Labor Party, has also given speeches against the referendum. Whether their involvement helped or hindered the ‚No‘ campaign is an open question. Of course, we want to better understand how such things may affect people’s vote, and in turn, determine why pollsters could be off the mark.
The true insight comes from learning why some predictions fail, while others come true. The same applies to key performance indicators (KPIs) or key drivers. Why do some work and others do not? The WHY is what matters here.
Often the reason can be that the assumptions made were incorrect. For instance, just because people search for answers right after a new product release about why the iOS operating system is slowing down does not necessarily make it true.
As well, flu infections tend to go up during the winter. However, the increased number of search queries cannot be used as an accurate predictor of how the epimdemic is spreading. In fact, numbers from the National Institute of Public Health indicate that this is decidedly inaccurate (see Gattiker, August 17, 2014).
Unfortunately, prediction may become a desired destination, instead of the introspective journey (Schrage, September 3, 2014). Hence, it is not necessarily the big data issue that matters.
PS: 55.3% voted no in the Scottish referendum with an 84.6% turnout for this ballot. Similarly, the pollsters also got it wrong for the Swedish election (see reader comments below).
Our focus should be on improving analytical insight and discovery about what we need to know.
Accordingly, it is also not necessarily smart to believe that, „…the best way to predict the future is to learn from failed predictive analytics.“ (Schrage, September 3). The section below addresses this issue in more detail.
Fine-tuning fails if your model is a dud
If the above (learn from your past mistakes) applied most of the time, the algorithm used for Google Flu Trends (GFT) would eventually work. But it starts off with the wrong assumptions, namely that search queries are the result of the spread of the flu.
However, news coverage does affect search queries – if coverage is extensive, searches go up.
Sometimes, smaller data may tell us more about what is happening than big data sets. Of course, this requires that the types of data we collect and analyse is based on a sound theoretical framework. But sometimes such things are shrouded in secrecy and unclear. However, believing is great while verifying how a study was conducted is certainly a more sound approach (e.g., see your Klout score – hard to trust without knowing how it works).
For instance, in one study the authors looked at corporate news announcements, the timing of which is in the hands of the CEO (Edmans, Goncalves-Pinto, Wang and Xu, August 29, 2014), excluding those news releases that are non-discretionary, such as earnings and regulatory statements. Looking at 166,000 news releases, the authors adjusted the number for news linked to an annual meeting or board meetings, as well as for trade fairs and events that prompt predictable bursts of news (e.g., possible hostile takeover attempt by a competitor).
Their findings indicate that CEOs hoard good news for when they plan to sell shares, bringing forward press releases about positive news (e.g., product launches, new clients or special dividends) for selling stock grants of shares. In turn, they benefit from the good news by triggering a short-term rise in the price and trading volume. This is done by managing the timing of non-discretionary releases, shortly before they want to sell.
What is important here is not the size of data or the processing power of a PC enabling them to analyse these data. Instead, it is raising interesting questions about CEOs‘ ethics and codes of conduct that seem to be ignored, when it affects their own pocket.
The authors found there were two percent more discretionary news releases in those months where CEOs vested than in non-vesting months. This was five percent higher than in the months before. Most interesting is that the higher the persentage, the greater the value of the shares the chief executive could sell.
Ready to learn from these data?
If current models in economics were ‚perfect‘, many economists would have predicted the 2008 financial crisis with great accuracy. They did not. Regulation does help, but cannot always make wrongs right. To illustrate, after the early 2000 scandals, the US Securities and Exchange Commission (SEC) implemented several new regulations. Under its Regulation Fair Disclosure, analysts, investors, and the public must receive significant news simultaneously, but this does not dictate what CEOs can do when it comes to timing non-discretionary releases.
If these non-discretionary releases are used to raise stock prices to reap additional rewards when vested stocks become sellable, one must wonder about the CEO’s code of conduct. It is the duty of boards of directors to scrutinize chief executives in years or months when they have a lot of equity coming their way. Regulation may not be able to reduce this risk much, but boards should be able to protect shareholders‘ interests.
The critical issue is to learn from these things. The fact that business is ’90 percent against‘ the secession of Scotland may be one thing. But going public about it may backfire and cause the ‚No‘ campaign more harm than good. Nevertheless, when to engage in a debate as a business leader is an important issue, as well as the concern of reducing the risk of CEOs timing non-discretionary releases to their personal advantage.
Learning from data regarding the Scottish referendum or non-discretionary releases makes sense. But if the data are based on models that are incorrect, or findings cannot be repeated, there is little value to be gained.
What is your opinion?
Commentators have stated that immigration will dominate the 2014 Swedish election (Sunday, September 14). Populists snubbed by other politicians could hold the balance of power afterward, but this is an opinion only, and until the final vote is counted… I rest my case.
What kind of data have helped you gain insights in your work? What kind of big data sets does your employer use? What #bigfail involving big data do you know about? Thanks again for sharing your insights – I always appreciate your very helpful feedback.
Ariely, Dan (2009). Predictably irrational. The hidden forces that shape our decisions. New York, NY: Harper-Collins.
Edmans, Alex; Goncalves-Pinto, Luis; Wang, Yanbo; Xu, Moqi (August 29, 2014). Strategic news releases in equity vesting months. London Business School: Working paper. Retrieved September 8, 2014, from http://ssrn.com/abstract=2489152
Gattiker, Urs E. (August 17, 2014). Secrets of analytics 1: UPS or Apple? Retrieved September 8, 2014, from http://blog.drkpi.com/big-data-2
Fairchild, Geoffrey, Fries, Jason (January 26, 2012). Lecture notes. Social networks: Models, algorithms, and applications. Retrieved September 8, 2014, from http://homepage.cs.uiowa.edu/~sriram/196/spring12/lectureNotes/Lecture4.pdf
No author. Class material – random graphs (July 3, 2009). Cornell University, Computer Science. Retrieved September 8, 2014, from http://www.cs.cornell.edu/courses/cs4850/2010sp/Course Notes/Random-graphs-from-jeh-Feb-06-2010.pdf
Pham, D.T., Dimov, S. S., Nguyen, C.C. (2005). Selection of K in K-means clustering. Proceedings of the Institution of Mechanical Engineers, Part C: Journal of Mechanical Engineering Science, 219. DOI: 10.1243/095440605X82982014 Retrieved August 31, 2014, from http://www.ee.columbia.edu/~dpwe/papers/PhamDN05-kmeans.pdf
Schrage, Michael (September 3, 2014). Learn from your analytics failures. Harvard Business Review – Blog Network. Retrieved September 4, 2014, from http://blogs.hbr.org/2014/09/learn-from-your-analytics-failures
This post is also available in: Englisch