The two cultures paper - a must read paper to understand the maths of data science
Introduction
In a previous post, (Why is machine learning challenging for some engineers? ) we had an interesting discussion on the idea of statistics vs machine learning. Thanks for your comments.
My goal is to understand the maths of machine learning. As a teacher and a perpetual learner of data science concepts, everyone is traditionally taught that statistics is the foundation of machine learning. The answer, as I have been discussing here, is not that simple.
See these previous posts
I believe that
Recommended by LinkedIn
The Two Cultures paper
"The Two Cultures: Statistics and Machine Learning, Data Mining, and Data Science" by Leo Breiman, published in Statistical Science in 2001, explores the philosophical and methodological differences between traditional statistics and the emerging fields of machine learning and data mining.
Seme initial comments
Two sections from this paper (emphasis mine)
Data models are rarely used in this (algorithmic modelling community / machine learning) community.The approach is that nature produces data in a black box whose insides are complex, mysterious, and, at least, partly unknowable. What is observed is a set of x’s that go in and a subsequent set of y’s that come out. The problem is to find an algorithm fx such that for future x in a test set, fx will be a good predictor of y
The goals in statistics are to use data to predict and to get information about the underlying data mechanism. Nowhere is it written on a stone tablet what kind of model should be used to solve problems involving data. To make my position clear, I am not against data models per se. In some situations they are the most appropriate way to solve the problem.But the emphasis needs to be on the problem and on the data. Unfortunately, our field has a vested interest in data models, come hell or high water.
To summarise the ideas from the Two Cultures paper:
I will conclude with a comment from Andrew Ng (which also emphasises the same ideas) - emphasis mine
One reason for machine learning’s success is that our field welcomes a wide range of work. I can’t think of even one example where someone developed what they called a machine learning algorithm and senior members of our community criticized it saying, “that’s not machine learning!” Indeed, linear regression using a least-squares cost function was used by mathematicians Legendre and Gauss in the early 1800s — long before the invention of computers — yet machine learning has embraced these algorithms, and we routinely call them “machine learning” in introductory courses!
In contrast, about 20 years ago, I saw statistics departments at a number of universities look at developments in machine learning and say, “that’s not really statistics.” This is one reason why machine learning grew much more in computer science than statistics departments. (Fortunately, since then, most statistics departments have become much more open to machine learning.)
Investor and finance professor who is passionate about AI, Machine Learning, and futurism. All posts I made here represent my personal view, not the view of my past, current, or future employer.
5moI am in the process of writing up a paper on the difference and similarities of econometrics and machine learning. So, this is timely. :) I think there are a lot of potentials for ML in science in the future. Statistics will continue to develop and applied in different ways than ML. Is there a chance that ML could take over the works done by statistics? Not likely, due to data requirements for ML. However, there is a chance in the future that we might not need statistics for most experiments as ML can create simulations so effectively that experiments are not needed. Parametric models will continue to be important. But it won't be the dominant force in the world of ML. BTW, most ML educators are statisticians. The math for ML is made unnecessarily complex by some in the ML research field. Statisticians can step in and simplify the math. :)
Hung (Leo) Chan welcome thoughts on this post
CTO
5moDo you think it would be a problem if we swing too the other way and employ overcomplicated large deep neural networks for cases where a simple model would do (say a straight line)? It would be less efficient but more than that don't we loose something significant if we don't find an underlying model?
Database Manager at Mount Sinai, New York
5moWell written. I'm going to share this with my team tomorrow.
Executive | Board | Risk & Data | AI & ML |Cybersecurity | Cryptography
5moWell said. To Leo.