Statistics for Hackers
https://speakerdeck.com/jakevdp/statistics-for-hackers
In general:
- Computing the Sampling Distribution is Hard.(计算采样分布很难)
- Simulating the Sampling Distribution is Easy.(但是通过计算机模拟采样分布却很容易)
Four Recipes for Hacking Statistics:
- Direct Simulation. 前提是我们知道数据生成模型
- Shuffling.
- Bootstrapping.
- Cross Validation.
统计学使用的方法是:
- 我们首先做空假设(null hypothesis)
- 在这个前提下我们通过计算/模拟来观察结果的显著性(significance)
- 显著性是以p-value/置信区间(confidence interval)相关为前提的
- 如果观察是显著的话,那么我们就可以推翻空假设。反之我们就认同空假设。
我们需要区分显著性(significance)和重要性(importance)的差别:
Significance vs. Importance
- Suppose that we try a different drug/placebo experiment on 1 million patients and the drug increases life by 5 years and 3 days whereas the placebo increases life by 5 years alone.
- This might, because of the large sample size, give a low p-value (thus statistically significant).
- But is it important? Do we care? Please ask this question.
另外选取p-value/confidence-interval需要根据情况选择。社会和人为变量的分布,通常比自然变量的分布更广。 比如人的资产分布,因为马太效应,通常呈现的是帕累托分布。
Social vs. Natural
- Confidence intervals tend to vary more for social/cultural phenomena than natural ones: people’s weights vary by a factor of maybe 20, but incomes can vary by a factor of 1000 or more.
- More important: in human affairs, past behavior is a bad predictor of future behavior (e.g. German Mark vs. Dollar in the early 1920s).
- See Nassim Taleb’s book: The Black Swan