Bayesian Methods for Hackers
Table of Contents
http://camdavidsonpilon.github.io/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/
这篇文章主要讲的是贝叶斯方法,但是并不是通过数学公式,而是通过使用PyMC3概率编程。
1. The Bayesian state of mind :ref
贝叶斯推理保留了事情的不确定性(概率)。频率统计学派通过观察事情的出现频次来推断概率,这对于大样本事件没有问题,但是对于小样本事件就有很大问题。 对于小样本事件,更好的方式是,通过对影响这些小样本事件的变量进行建模,然后来推断小样本事件出现的概率分布。
Bayesian inference differs from more traditional statistical inference by preserving uncertainty. At first, this sounds like a bad statistical technique. Isn't statistics all about deriving certainty from randomness? To reconcile this, we need to start thinking like Bayesians.
The Bayesian world-view interprets probability as measure of believability in an event, that is, how confident we are in an event occurring. In fact, we will see in a moment that this is the natural interpretation of probability.
For this to be clearer, we consider an alternative interpretation of probability: Frequentist, known as the more classical version of statistics, assume that probability is the long-run frequency of events (hence the bestowed title). For example, the probability of plane accidents under a frequentist philosophy is interpreted as the long-term frequency of plane accidents. This makes logical sense for many probabilities of events, but becomes more difficult to understand when events have no long-term frequency of occurrences. Consider: we often assign probabilities to outcomes of presidential elections, but the election itself only happens once! Frequentists get around this by invoking alternative realities and saying across all these realities, the frequency of occurrences defines the probability.
Bayesians, on the other hand, have a more intuitive approach. Bayesians interpret a probability as measure of belief, or confidence, of an event occurring. Simply, a probability is a summary of an opinion. An individual who assigns a belief of 0 to an event has no confidence that the event will occur; conversely, assigning a belief of 1 implies that the individual is absolutely certain of an event occurring. Beliefs between 0 and 1 allow for weightings of other outcomes. This definition agrees with the probability of a plane accident example, for having observed the frequency of plane accidents, an individual's belief should be equal to that frequency, excluding any outside information. Similarly, under this definition of probability being equal to beliefs, it is meaningful to speak about probabilities (beliefs) of presidential election outcomes: how confident are you candidate A will win?
贝叶斯推理涉及到先验概率分布,也就是在某个事件发生之前,大家对这件事情的概率分布的认识是怎么样的。 当事件出现之后,大家都会更新自己的信念程度。这个信念程度是同样是引人而异的,因为每个人的先验信念程度是不同的。
Notice in the paragraph above, I assigned the belief (probability) measure to an individual, not to Nature. This is very interesting, as this definition leaves room for conflicting beliefs between individuals. Again, this is appropriate for what naturally occurs: different individuals have different beliefs of events occurring, because they possess different information about the world. The existence of different beliefs does not imply that anyone is wrong. Consider the following examples demonstrating the relationship between individual beliefs and probabilities:
- I flip a coin, and we both guess the result. We would both agree, assuming the coin is fair, that the probability of Heads is 1/2. Assume, then, that I peek at the coin. Now I know for certain what the result is: I assign probability 1.0 to either Heads or Tails (whichever it is). Now what is your belief that the coin is Heads? My knowledge of the outcome has not changed the coin's results. Thus we assign different probabilities to the result.
- Your code either has a bug in it or not, but we do not know for certain which is true, though we have a belief about the presence or absence of a bug.
- A medical patient is exhibiting symptoms x, y and z. There are a number of diseases that could be causing all of them, but only a single disease is present. A doctor has beliefs about which disease, but a second doctor may have slightly different beliefs.
This philosophy of treating beliefs as probability is natural to humans. We employ it constantly as we interact with the world and only see partial truths, but gather evidence to form beliefs. Alternatively, you have to be trained to think like a frequentist.
To align ourselves with traditional probability notation, we denote our belief about event A as P(A). We call this quantity the prior probability.
By introducing prior uncertainty about events, we are already admitting that any guess we make is potentially very wrong. After observing data, evidence, or other information, we update our beliefs, and our guess becomes less wrong. This is the alternative side of the prediction coin, where typically we try to be more right.
2. Bayesian Inference in Practice :ref
贝叶斯推断得到是某个事件的概率分布,而频率统计推断则通常得到一个数值。以下面这个“假设我的程序通过了X个test cases的话,那么我们的程序是bug-free"为例。 对于频率统计推断来说非常尴尬,因为只是通过观察事件你只能说是还是否(结合这个例子,你只能说”没有bug"). 但是对于贝叶斯推理来说, 可以考虑程序员水平等变量建模,结合"通过Y test cases, 是否bug-free"的历史数据,最终给出概率。
If frequentist and Bayesian inference were programming functions, with inputs being statistical problems, then the two would be different in what they return to the user. The frequentist inference function would return a number, representing an estimate (typically a summary statistic like the sample average etc.), whereas the Bayesian function would return probabilities.
For example, in our debugging problem above, calling the frequentist function with the argument "My code passed all X tests; is my code bug-free?" would return a YES. On the other hand, asking our Bayesian function "Often my code has bugs. My code passed all X tests; is my code bug-free?" would return something very different: probabilities of YES and NO. The function might return:
YES, with probability 0.8; NO, with probability 0.2
This is very different from the answer the frequentist function returned. Notice that the Bayesian function accepted an additional argument: "Often my code has bugs". This parameter is the prior. By including the prior parameter, we are telling the Bayesian function to include our belief about the situation. Technically this parameter in the Bayesian function is optional, but we will see excluding it has its own consequences.
3. Modeling approaches :ref
建模的过程主要包括: 1. 选择影响因子 2. 确定影响因子的分布 3. 估计这些分布的参数(这些参数是否也需要作为影响因子) 4. 最终通过这些因子的组合得到最终变量 5. 模拟计算
A good starting thought to Bayesian modeling is to think about how your data might have been generated. Position yourself in an omniscient position, and try to imagine how you would recreate the dataset.
In the last chapter we investigated text message data. We begin by asking how our observations may have been generated:
- We started by thinking "what is the best random variable to describe this count data?" A Poisson random variable is a good candidate because it can represent count data. So we model the number of sms's received as sampled from a Poisson distribution.
- Next, we think, "Ok, assuming sms's are Poisson-distributed, what do I need for the Poisson distribution?" Well, the Poisson distribution has a parameter λ.
- Do we know λ? No. In fact, we have a suspicion that there are two λ values, one for the earlier behaviour and one for the later behaviour. We don't know when the behaviour switches though, but call the switchpoint τ.
- What is a good distribution for the two λs? The exponential is good, as it assigns probabilities to positive real numbers. Well the exponential distribution has a parameter too, call it α.
- Do we know what the parameter α might be? No. At this point, we could continue and assign a distribution to α, but it's better to stop once we reach a set level of ignorance: whereas we have a prior belief about λ, ("it probably changes over time", "it's likely between 10 and 30", etc.), we don't really have any strong beliefs about α. So it's best to stop here.
- What is a good value for α then? We think that the λs are between 10-30, so if we set α really low (which corresponds to larger probability on high values) we are not reflecting our prior well. Similar, a too-high alpha misses our prior belief as well. A good idea for α as to reflect our belief is to set the value so that the mean of λ, given α, is equal to our observed mean. This was shown in the last chapter.
- We have no expert opinion of when τ might have occurred. So we will suppose τ is from a discrete uniform distribution over the entire timespan.
4. Privacy Algorithm :ref
这个例子非常有趣,是如果估计有多少比例的同学作弊了。你当然可以假设大家都诚实的回答,但是这样非常不可靠。但是我们可以通过系统的方法是绕过这个问题, 但是依然得到比较比较好的效果:让每个同学flip coin, 如果是head那么他们必须诚实回答,如果是tail的话那么他们继续flip coin. 如果head那么回答"cheat", 否则回答"no cheat"
可以简单地这么计算,假设cheat的概率是p的话,那么实际得到的结果回是 1/2 * p + 1/2 * 1/2 = 1/2 * p + 1/4. 从而回退出有多少同学作弊。
We will use the binomial distribution to determine the frequency of students cheating during an exam. If we let N be the total number of students who took the exam, and assuming each student is interviewed post-exam (answering without consequence), we will receive integer X"Yes I did cheat" answers. We then find the posterior distribution of p, given N, some specified prior on p, and observed data X.
This is a completely absurd model. No student, even with a free-pass against punishment, would admit to cheating. What we need is a better algorithm to ask students if they had cheated. Ideally the algorithm should encourage individuals to be honest while preserving privacy. The following proposed algorithm is a solution I greatly admire for its ingenuity and effectiveness:
In the interview process for each student, the student flips a coin, hidden from the interviewer. The student agrees to answer honestly if the coin comes up heads. Otherwise, if the coin comes up tails, the student (secretly) flips the coin again, and answers "Yes, I did cheat" if the coin flip lands heads, and "No, I did not cheat", if the coin flip lands tails. This way, the interviewer does not know if a "Yes" was the result of a guilty plea, or a Heads on a second coin toss. Thus privacy is preserved and the researchers receive honest answers.
I call this the Privacy Algorithm. One could of course argue that the interviewers are still receiving false data since some Yes's are not confessions but instead randomness, but an alternative perspective is that the researchers are discarding approximately half of their original dataset since half of the responses will be noise. But they have gained a systematic data generation process that can be modeled. Furthermore, they do not have to incorporate (perhaps somewhat naively) the possibility of deceitful answers. We can use PyMC3 to dig through this noisy model, and find a posterior distribution for the true frequency of liars.
5. Algorithms to perform MCMC :ref
MCMC的办法就是不断地做sample/position, 看看这些sample/position是否可以接受。接受的标准需要考虑到数据以及先验分布。 PyMC3里面对这些samples成为traces. 我理解第一个MC(Markov Chaind)的意思,这个算法只考虑从当前点到下一个点,而不会考虑是怎么到达当前点的,也就是一阶马尔可夫链。 而第二个MC(Monte Carlo)的意思,则是通过在data和prior distribution上面进行随机抽取算法判断当前点是否accepted.
There is a large family of algorithms that perform MCMC. Most of these algorithms can be expressed at a high level as follows: (Mathematical details can be found in the appendix.)
- Start at current position.
- Propose moving to a new position (investigate a pebble near you).
- Accept/Reject the new position based on the position's adherence to the data and prior distributions (ask if the pebble likely came from the mountain).
- If you accept: Move to the new position. Return to Step 1.
- Else: Do not move to new position. Return to Step 1.
- After a large number of iterations, return all accepted positions.
This way we move in the general direction towards the regions where the posterior distributions exist, and collect samples sparingly on the journey. Once we reach the posterior distribution, we can easily collect samples as they likely all belong to the posterior distribution.
If the current position of the MCMC algorithm is in an area of extremely low probability, which is often the case when the algorithm begins (typically at a random location in the space), the algorithm will move in positions that are likely not from the posterior but better than everything else nearby. Thus the first moves of the algorithm are not reflective of the posterior.
In the above algorithm's pseudocode, notice that only the current position matters (new positions are investigated only near the current position). We can describe this property as memorylessness, i.e. the algorithm does not care how it arrived at its current position, only that it is there.
6. Why Thousands of Samples? :ref
这里解释为什么做sampling是非常有效的方法:相比数学公式更加可行,而相比给出具体某个点则可以提供分布信息。 此外大量的sampling本质上就是利用大数定理,而这种办法非常适合解决“毫无头绪”的问题
At first, returning thousands of samples to the user might sound like being an inefficient way to describe the posterior distributions. I would argue that this is extremely efficient. Consider the alternative possibilities:
- Returning a mathematical formula for the "mountain ranges" would involve describing a N-dimensional surface with arbitrary peaks and valleys.
- Returning the "peak" of the landscape, while mathematically possible and a sensible thing to do as the highest point corresponds to most probable estimate of the unknowns, ignores the shape of the landscape, which we have previously argued is very important in determining posterior confidence in unknowns.
Besides computational reasons, likely the strongest reason for returning samples is that we can easily use The Law of Large Numbers to solve otherwise intractable problems. I postpone this discussion for the next chapter. With the thousands of samples, we can reconstruct the posterior surface by organizing them in a histogram.
7. Other approximation solutions to the posterior :ref
Besides MCMC, there are other procedures available for determining the posterior distributions. A Laplace approximation is an approximation of the posterior using simple functions. A more advanced method is Variational Bayes. All three methods, Laplace Approximations, Variational Bayes, and classical MCMC have their pros and cons. We will only focus on MCMC in this book. That being said, my friend Imri Sofar likes to classify MCMC algorithms as either "they suck", or "they really suck". He classifies the particular flavour of MCMC used by PyMC3 as just sucks ;)
8. Useful tips for MCMC :ref
MCMC算法和先验选择非常相关,如果先验是错误的话,那么通过MCMC没有办法收敛。收敛的意思是指,结果参数的范围不能太过于宽泛,必须集中于某个中心。
When I say MCMC intelligently searches, I really am saying MCMC will hopefully converge towards the areas of high posterior probability. MCMC does this by exploring nearby positions and moving into areas with higher probability. Again, perhaps "converge" is not an accurate term to describe MCMC's progression. Converging usually implies moving towards a point in space, but MCMC moves towards a broader area in the space and randomly walks in that area, picking up samples from that area.
If the priors are poorly chosen, the MCMC algorithm may not converge, or atleast have difficulty converging. Consider what may happen if the prior chosen does not even contain the true parameter: the prior assigns 0 probability to the unknown, hence the posterior will assign 0 probability as well. This can cause pathological results.
For this reason, it is best to carefully choose the priors. Often, lack of covergence or evidence of samples crowding to boundaries implies something is wrong with the chosen priors (see Folk Theorem of Statistical Computing below).
The Folk Theorem of Statistical Computing: If you are having computational problems, probably your model is wrong.