What is this?
This is my documentation of the statistical techniques that I’ve worked on and plan to work on in the future. There’s many other data science skills that I enjoy: deep learning techniques, data pipeline engineering, visualizations. But I won’t showcase those here, this page is purely on statistical methods.
The textbook/course Statistical Rethinking is a fun way to unlearn STAT 101. McElreath is such a great educator that I’m making his book the back bone to this project: adopting his chapter structure, code, and fiery attitude towards what constitutes good science.
Why Statistics?
Statistics are powerful! They helped the Oakland A’s win just as many games as the Yankee’s with 1/3 of the budget. Your insurance, bank loan, retirement savings, car, smart phone, all took a lot of human ingenuity but that was not enough! They required lots of statistical work to be engineered, planned, and fine-tuned. They are a crucial element of scientific progress. The vast majority of academic papers from sociology to physics rely on similar statistical tools to validate their claims.
Some common objections to using statistics:Surely machine learning create more accurate predictions than traditional statistical models?
Machine learning techniques are far more powerful than traditional statistics. But machine learning techniques have their downsides; they require lots of data, they require lots of computation, and worst of all they’re black boxes. They don’t tell you why they predict the things that they predict or what causes what (at least not yet). This is why science papers use traditional statistics to validate their claims.
Businesses say they want data processing, but what they’re really saying is they want system automation (e.g. invoices need to be categorized and sent to accounting).
Yes, businesses need lots of automation, but they also need help making decisions. Who to hire, who to let go, what product to sell, at what price to sell, to name a few. These decisions are in a weird gray zone, they require some human judgement but would benefit from computer aid. Statistics are the best set of tools for many of these problems. They allow both humans and computers to add their judgement’s to an analysis, getting a better result than either would on their own.
Visualizations are more intuitive and persuasive to audiences. Our World in Data has changed more minds than dull economic statistics.
I love visualizations and try to use them as often as I can. But they’re severely constrained in what they can accomplish. You won’t be able to easily show how mountain terrain, soil fertility, precipitation, and historical wealth all help predict countries GDP without showing your audience half a dozen maps. Statistical models allow us to easily see how exactly each of those factors are associated to GDP.
Index
(1) Linear Regression 📈
- How can we use one variable (e.g. Education) to predict another (e.g. Salary). Doctorates love this one weird trick which is correlated with scientific progress (\(R^2\) = 0.5)
(2) Multi Regression 📊
- You’re worried that the correlation of ice cream and drownings might be because it’s hot outside
(3)️ Causality 🔀
- Oops you can’t just keep adding things to the regression 🫣
(4)️ Model Metrics 🎯
- AIC, LOO, Occam’s razor, Oh my!
(5) Interactions 🎨
- Covariate effects conditional on other covariate effects!
References
Source code is on GitHub.