Statistics for Hackers

https://speakerdeck.com/jakevdp/statistics-for-hackers

In general:

Computing the Sampling Distribution is Hard.（计算采样分布很难）
Simulating the Sampling Distribution is Easy.（但是通过计算机模拟采样分布却很容易）

Four Recipes for Hacking Statistics:

Direct Simulation. 前提是我们知道数据生成模型
Shuffling.
Bootstrapping.
Cross Validation.

统计学使用的方法是：

我们首先做空假设(null hypothesis)
在这个前提下我们通过计算/模拟来观察结果的显著性(significance)
显著性是以p-value/置信区间(confidence interval)相关为前提的
如果观察是显著的话，那么我们就可以推翻空假设。反之我们就认同空假设。

我们需要区分显著性(significance)和重要性(importance)的差别：

Significance vs. Importance

Suppose that we try a different drug/placebo experiment on 1 million patients and the drug increases life by 5 years and 3 days whereas the placebo increases life by 5 years alone.

This might, because of the large sample size, give a low p-value (thus statistically significant).

But is it important? Do we care? Please ask this question.

另外选取p-value/confidence-interval需要根据情况选择。社会和人为变量的分布，通常比自然变量的分布更广。比如人的资产分布，因为马太效应，通常呈现的是帕累托分布。

Social vs. Natural

Confidence intervals tend to vary more for social/cultural phenomena than natural ones: people’s weights vary by a factor of maybe 20, but incomes can vary by a factor of 1000 or more.

More important: in human affairs, past behavior is a bad predictor of future behavior (e.g. German Mark vs. Dollar in the early 1920s).

See Nassim Taleb’s book: The Black Swan