The Unreasonable Effectiveness of Simplicity
A version of this blog first appeared on the Newfound Research blog site.
In their 2009 paper “The Unreasonable Effectiveness of Data,” Google researchers utilize the term “physics envy,” the desire by those of us in fields plagued by human behavior to be able to create neat, mathematical models. After all, it isn’t gas particles that rush in a disorderly manner to the exit when someone shouts “fire;” only humans do that.
Specifically, the Google paper discusses the applications of models to natural language processing (“NLP”). Linguistic modelers share the same envy as financial modelers; cultural context and shared social experience can lead to highly ambiguous expressions that are still easily understood – by humans, at least.
The authors say, “Perhaps when it comes to natural language processing and related fields, we’re doomed to complex theories that will never have the elegance of physics equations. But if that’s so, we should stop acting as if our goal is to author extremely elegant theories, and instead embrace complexity and make use of the best ally we have: the unreasonable effectiveness of data.”
The paper goes on to discuss the benefit that “data at web scale” has been for NLP, especially in relation to machine learning. In particular, there has been significant growth in statistical speech recognition and statistical machine translation – areas of advance not because they are simpler than other areas, but because the web provides such a fertile training set of data for these models. These are tasks humans perform every day. Document classification, on the other hand, hasn’t seen the same benefits.
With the benefit of large data, simple is better. To quote, “But invariably, simple models and a lot of data trump more elaborate models based on less data.”
I am particularly fond of how the authors conclude the paper: “So, follow the data. Choose a representation that can use unsupervised learning on unlabeled data, which is so much more plentiful than labeled data. Represent all the data with a nonparametric model rather than trying to summarize it with a parametric model, because with very large data sources, the data holds a lot of detail.”
Mining Meaning from Data
Can we extend this beyond NLP and assume that domain knowledge will soon be trumped by simple data mining? After all, global data is becoming more abundant, not less (some studies say that is doubling every two years). In particular – and relevant to this author – can we extend the elegance of large data to investment management?
I believe the answer is no, and my reasoning is that the success of the application of the “more data” approach requires that the solution to our problem is actually in the data itself. In other words, not only does the data hold a lot of details, but there is a sufficient amount of data that can be used to extract the necessary details. When the data holds the details, the trade-off you can make is model complexity versus data quantity.
The underlying assumption here is that “more data” means “more relevant data.” For the data to be relevant, the system we are modeling has to be fairly stable; the near future has to look like the past. While language is constantly evolving, the core of the corpus changes at a glacial pace. For example, the first e-mail ever sent by Ray Tomlinson in 1971 is going to be just as relevant as any e-mail written today for an NLP algorithm to learn from.
When More Isn’t More
In my opinion, financial markets and weather share a lot in common. When it comes to weather data, it could be argued that there are quite a lot of details: temperature, pressure, humidity, precipitation, and so on. So why are we so bad at predicting weather evolution? This is particularly puzzling given that we know the mathematical laws (fluid dynamics) that govern weather patterns. If we know the model with certainty, and we have plenty of data with plenty of detail, why can’t my weatherman accurately tell me whether it’s going to rain on January 8, 2014 or not?
The problem is that weather demonstrates chaotic behavior, in the mathematical sense. Most relevantly, while entirely deterministic, weather is nonlinear and therefore incredibly sensitive to initial conditions. To quote mathematician and meteorologist Edward Lorenze, “the present determines the future, but the approximate present does not approximately determine the future.” This is more commonly known as the butterfly effect; the ratio of initial uncertainty in our system is very, very small compared to the uncertainty in the system after a period of time.
This ultimately means that drawing on a large set of historical samples where initial weather conditions are similar may provide little to no insight into how the weather will evolve. More is not necessarily more.
In financial markets, one could argue that the aggregate beliefs of investors are ultimately translated to prices, in which case the data holds a lot of detail. However, I believe that financial markets are both nondeterministic and nonlinear. So, not only does the near future become incredibly uncertain very quickly, but I believe that the unpredictability and irrationality of market participants means that for identical initial conditions, different decisions may be made. This makes historical samples almost entirely useless. As a simple example, consider how relevant market data from 2006 was to decision making in October 2008.
The Unreasonable Effectiveness of Simplicity
In my opinion, building tactical portfolios from a handful of asset-class-level ETFs shares a lot in common with trying to pick the winner of Wimbledon (or, really, the bracket); managing risk is like trying to identify avalanches. Neither “more data” nor complex models provide sufficient solutions to these problems. Rather, it is simple tallying methods that reign supreme over optimized weightings and complex algorithms.
As another example, in his “The Dog and the Frisbee” presentation at the Federal Reserve’s August 2012 economic policy symposium, Bank of England Executive Director Andrew Haldane demonstrated that a GARCH (1,1) model had a lower mean-squared prediction error than a GARCH (3,3) model until over 100,000 data points were used for fitting — despite the fact that a GARCH (3,3) model was used to generate the data itself! The simpler model was able to reduce the error in its output by limiting the amount of error that could be introduced by its inputs.
Our approach to investment management is driven by “simple.” We believe in simple models built on simple theories (e.g., momentum) combined with simple rules (e.g., tallying and bucketing). In uncertain environments, simple heuristics tend to be more robust than complex decision rules. And markets are very, very uncertain.
As quants, we should always be looking towards engineering, mathematical, and modeling advances in other fields. However, we must always look through the lens of our domain knowledge to determine their applicability. In my opinion, “big data” does not bring much to the table when it comes to tactical – or strategic – portfolio construction.