本教程主要从贝叶斯基本概念出来,详细解释了似然函数、先验概率、后验概率、后验预测分布之间的关系,并通过三个具体的实验来 介绍贝叶斯机器学习的原理。
当我们需要对模型进行学习的时候,我们本质上实在估计模型的参数。贝叶斯本质上就是把经验事实与先验知识结合起来,在一开始经验事实 不够多,或者说经验事实不足以有信服力时,它会侧重考虑先验知识;而当我们的经验知识越来越有信服力的时候,它对先验知识的依赖性就会越来越小。 而很多传统的统计模型以及机器学习模型则是纯粹的“经验主义”,比如决策树,它们完全以经验事实(经验数据)为依据,没有考虑到对一些 复杂情况下人类的先验认知可能会有很大帮助。
让我们回到最简单的掷硬币游戏(伯努利试验)
我们假设$X \backsim Ber(\theta)$,$X_{i} = 1$ 表示硬币的正面,$X_{i} = 0$ 表示硬币的反面,那么我们的Likelihood:
$$p(\mathcal{D|\theta}) = \theta^{N_{1}}(1-\theta)^{N_{0}},\,\,(why?)$$其中$N_{1},N_{0}$分别代表硬币为正/反面的次数
先验知识prior:
比如我认为这枚硬币很可能是均匀的,那么我可以令我的先验概率为$\gamma_{1}=3$,$\gamma_{2}=3$(why?),如果先验概率为$\gamma_{1}=0$,$\gamma_{2}=0$呢?
$$p(\theta) \propto \theta^{\gamma_{1}}(1-\theta)^{\gamma_{2}}$$似然函数likelihood:
$$p(\mathcal{D|\theta}) = Bin(N_{1}|\theta, N_{0}+N_{1}) = \theta^{N_{1}}(1-\theta)^{N_{0}}$$后验概率post:
\begin{align*} p(\theta|\mathcal{D}) & \propto p(\mathcal{D|\theta})p(\theta)\\ & \propto Bin(N_{1}|\theta, N_{0}+N_{1})Beta(\theta|a,b)\\ & \propto Beta(\theta|N_{1}+a,N_{0}+b) \end{align*}后验均值Posterior mean:
\begin{align*} \hat{\theta}_{MAP} &= \frac{a+N_{1}-1}{a+b+N-2}\\ \hat{\theta}_{MLE} &= \frac{N_{1}}{N}(是不是很熟悉?)\\ \bar{\theta} &= \frac{a+N_{1}}{a+b+N}\\ \mathbb{E}[\theta|\mathcal{D}] &= \frac{\alpha_{0}m_{1}+N_{1}}{N+\alpha_{0}} = \lambda m_{1}+(1-\lambda)\hat{\theta}_{MLE} \end{align*}其中,$\alpha_{0} = a+b$, $N = N_{0}+N_{1}$,$m_1 = \frac{a}{\alpha_0}$
后验方差Posterior variance:
$$var(\theta|\mathcal{D}) = \frac{(a+N_{1})(b+N_{0})}{(a+b+N)^2(a+b+N+1)} \thickapprox \frac{N_{1}N_{0}}{NNN} = \frac{\hat{\theta}_{MLE}(1-\hat{\theta}_{MLE})}{N}$$贝叶斯模型是概率模型,概率模型一个非常重要的特点就是它不是点估计,能够给出概率分布,不管是参数还是预测,这意味我们能够得到不确定性的描述,比如预测方差, 而在非概率模型当中,我们只能通过经验分布来获得估计值。
后验预测分布Posterior predictive distribution:
$$p(\tilde{x}=1|\mathcal{D}) = \int_{0}^{1}p(x=1|\theta)p(\theta|\mathcal{D})\,d\theta = \int_{0}^{1}\theta Beta(a+N_{1},b+N_{0})\,d\theta = \frac{a+N_{1}}{a+b+N} = Ber(\tilde{x}|\mathbb{E}[\theta|\mathcal{D}])$$这里就会出现所谓的黑天鹅问题,那就是当某个事件一次都没有出现过的时候,比如极小概率事件或者样本数不够,我们在MLE下则会估计 这些小事件概率都为0,因为它们在从未出现在经验事实(经验数据当中),然后我们知道概率为0则是非常荒谬的,概率为0代表事件发生的 可能性为0。但如果在贝叶斯的框架下,我们通过加入先验信息估计MAP,则能够解决这样的计数问题。
多硬币实验
\begin{align*} p(x|D,M) &= \int_{0}^{1} Bin(x|\theta,M) Beta(\theta|a+N_{1},b+N_{0})\,d\theta\\ &= {{M}\choose{x}}\frac{1}{B(a+N_{1},b+N_{0})} \int_{0}^{1} \theta^{x+a+N_{1}-1}(1-\theta)^{M-x+b+N_{0}-1}\,d\theta\\ &= {{M}\choose{x}}\frac{1}{B(a+N_{1},b+N_{0})} B(x+a+N_{1},M-x+b+N_{0}) \end{align*}回忆一下Beta函数:
$$B(x,y) = \int_{0}^{1}t^{x-1}(1-t)^{y-1}\,dt$$于是我们推导出来一个非常重要的分布beta-binomial分布:
\begin{align*} Bb(x|a,b,M) &= {{M}\choose{x}}\frac{B(x+a,M-x+b)}{B(a,b)}\\ \mathbb{E}[x] &= M \frac{a}{a+b}\\ var[x] &= \frac{Mab}{(a+b)^2}\frac{a+b+M}{a+b+1}\\ \end{align*}import numpy as np
from scipy.special import gamma
## 先验beta分布
def BLog(a,b):
BLog= np.log(gamma(a))+np.log(gamma(b))-np.log(gamma(a+b))
return BLog
## 后验beta分布
def betalog(x,a,b):
betalog = (a-1)*np.log(x)+(b-1)*np.log(1-x)-BLog(a,b)
return betalog
import matplotlib.pyplot as plt
#模拟一个掷硬币的试验过程
theta=0.25 #真实概率,对于观测者未知
N=30
datalist = [np.random.binomial(1,theta) for i in range(N)]
# plot prior
prior_a = 3
prior_b = 3
x = np.linspace(0.001, 0.999, 200)
Pprior = np.exp(betalog(x, prior_a, prior_b))
plt.plot(x,Pprior,label="prior:Beta(%s,%s)"%(prior_a,prior_b),linewidth=2)
#plot post
for i in range(10,31,10):
N1 = sum(datalist[0:i])
N0 = i-sum(datalist[0:i])
post_a = prior_a + N1
post_b = prior_b + N0
Ppost = np.exp(betalog(x, post_a, post_b))
plt.plot(x,Ppost,label="post:Beta(%s,%s)"%(N1,N0),linewidth=2)
plt.legend()
plt.show()
Plik = x**N1*(1-x)**N0;
plt.plot(x, Plik, linewidth=2, label="Likelihood function");
plt.legend()
plt.show()
a=1;b=1
theta_map = (a-1+N1)/(a+b-2+N);
theta_var = (a+N1)*(b+N0)/(a+b+N)**2/(a+b+N+1);
print("MAP估计:%.2f, 估计标准差:%.2f" % (theta_map, np.sqrt(theta_var)))
print("模拟正面硬币数为:" + str(np.sum(datalist)) + " ,预测正面硬币数为: " + str(N*(a+N1)/(a+b+N)))
from scipy.special import gamma,comb
def betaBinomLog(X,a,b,M):
return BLog(X+a, M-X+b) - BLog(a,b) + np.log(comb(M, X))
X = np.linspace(0.001, 30.999, 200)
priorPred = np.exp(betaBinomLog(X, prior_a, prior_b, 30))
plt.plot(X, priorPred, linewidth=2)
plt.title("prior predictive")
plt.show()
postPred = np.exp(betaBinomLog(X, post_a, post_b, 30))
plt.plot(X, postPred, linewidth=2)
plt.title("posterior predictive")
plt.show()
统计学习的核心思想就是概率意义的诠释,区别于一些判别算法。统计学习的难点就是推导或近似得到一系列概率分布。在应用统计学习算法或模型时,我们尤其要注意我们模型的假设是否成立(独立、条件独立、隐式关联等)。在统计学习理论的研究过程中,我们会碰到一系列的数值求解问题,有时是一个最优化问题(凸优化),有时是算法优化问题(时间复杂度、空间复杂度),有时则是复杂的理论推导(意义在于有时候一个定理能直接给出闭式结果或者极大得优化求解过程)。
现在让我们来拓展前一种情形,如果我掷骰子而不是扔硬币呢?骰子有六个面,显然用伯努力分布就不太适合。那我用一个什么样的分布来表示掷骰子这样一个实验呢?
还是回到我们的核心概念,首先是likelihood:
$$p(\mathcal{D}|\boldsymbol{\theta}) = \prod_{k=1}^{K}\theta_{k}^{N_{k}}$$其中, $N_{k} = \sum_{i=1}^{N}\mathbb{I}(y_{i} = k)$
Prior:
在贝耶斯理论里,如果后验分布$p(\theta|x)$与先验分布$p(\theta)$在同一个分布族(family)里面,我们便称他们为共轭分布(conjugate distributions)。显然我们的参数向量存在于k维 probability simplex,同时最好还是共轭的,幸好,恰好有这么一个概率分布满足我们的要求,那便是狄利克雷分布:
$$Dir(\boldsymbol{\theta|\alpha}) = \frac{1}{B(\boldsymbol{\alpha})}\prod_{k=1}^{K}\theta_{k}^{\alpha_{k}-1}$$Posterior:
$$p(\boldsymbol{\theta}|\mathcal{D}) \propto p(\mathcal{D}|\boldsymbol{\theta})p(\boldsymbol{\theta})$$$$\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, \propto \prod_{k=1}^{K}\theta_{k}^{N_{k}}\theta_{k}^{\alpha_{k}-1} = \prod_{k=1}^{K}\theta_{k}^{\alpha_{k}+N_{k}-1}$$$$\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, = Dir(\boldsymbol{\theta}|\alpha_{1}+N_{1}, ..., \alpha_{K}+N_{K})$$
Posterior mode:
证明过程略(具体思路拉格朗日乘子)
$$\hat{\theta}_{k}^{MAP} = \frac{N_{k}+\alpha_{k}-1}{N+\alpha_{0}-K}, \, 其中\alpha_{0} \triangleq \sum_{k=1}^{K}\alpha_{k}$$$$\hat{\theta}_{k}^{MLE} = \frac{N_{k}}{N}$$Posterior predictive distribution:
$$p(X=j|\mathcal{D}) = \int p(X=j|\boldsymbol{\theta})p(\boldsymbol{\theta}|\mathcal{D})d\boldsymbol{\theta}$$$$\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, = \int p(X = j|\theta_{j})\left[ \int p(\boldsymbol{\theta}_{-j}, \theta_{j}|\mathcal{D}) d\boldsymbol{\theta}_{-j} \right] d\theta_{j}\,\,(why?)$$$$\,\,\,\,\,\,\,\, = \int \theta_{j}p(\theta_{j}|\boldsymbol{D})d\theta_{j} $$$$\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, = \mathbb{E}[\theta_{j}|\mathcal{D}] = \frac{\alpha_{j}+N_{j}}{\sum_{k}(\alpha_{k}+N_{k})} = \frac{\alpha_{j}+N_{j}}{\alpha_{0}+N}$$应用实例:
让我们来回到刚才那个掷骰子的问题,如果骰子不是均匀的,我们如何来对它进行建模?
首先假设我们对骰子无额外的信息,我们的先验概率便为1/6的均匀分布:$p(\theta_{k}) = 1/6$,也就是说我们令$\alpha_{k}=1$
下面让我们来模拟这个情境:
dice_face_prob = [0.01,0.3,0.2,0.2,0.1,0.19]
K = 6
N = 100
l = [c*10 for c in dice_face_prob]
result = []
for i in range(N):
num = np.random.randint(1,11)
for j in range(K):
if sum(l[0:j]) < num <= sum(l[0:j+1]):
result.append(j+1)
N_j = [result.count(i) for i in [1,2,3,4,5,6]]
print("每一面骰子的出现次数:" + str(N_j))
#我们假设alpha=1
a=1
probMAP = [(c+a)/(N+K*a) for c in N_j]
print("每面骰子的MAP估计是:" + str([round(p,2) for p in probMAP]) + '\n' + "每面骰子的真实概率为:", str(dice_face_prob))
我们看到MAP估计看起来和真实值差不多,结果已经不错了。作为对比,我们来看下MLE估计:(注意到有一个概率估计为0)
probMLE = [result.count(i)/N for i in [1,2,3,4,5,6]]
print("每面骰子的MAP概率估计是:" + str([round(p,2) for p in probMLE]) + '\n' + "每面骰子的真实概率为:", str(dice_face_prob))
接下来我们介绍一个大家耳熟能详的模型——朴素贝耶斯分类器:
$$p(\boldsymbol{x}|y=c, \boldsymbol{\theta})=\prod_{j=1}^{D}p(x_{j}|y=c, \boldsymbol{\theta}_{jc})\,\,(对比一下之前的公式,和哪个比较像?)$$我们先来解释这里的naive(朴素)的意思,简单说我们这里的特征$\boldsymbol{x}$是条件独立的。这个模型有O(CD)个参数(why?),相对来说这个模型不那么容易过度拟合。
下面我们来自习研究$p(x_{j}|y=c, \boldsymbol{\theta}_{jc})$分别是如下分布的情形:
$p(\boldsymbol{x}|y=c, \boldsymbol{\theta}) = \prod_{j=1}^{D}\mathcal{N}(x_{j}|\mu_{jc},\sigma_{jc}^{2})$
$p(\boldsymbol{x}|y=c, \boldsymbol{\theta}) = \prod_{j=1}^{D}Ber(x_{j}|\mu_{jc})$
$p(\boldsymbol{x}|y=c, \boldsymbol{\theta}) = \prod_{j=1}^{D}Cat(x_{j}|\boldsymbol{\mu}_{jc})$
模型训练:
对于单个观测值的联合概率密度 $$p(\boldsymbol{x}_{i},y_{i}|\boldsymbol{\theta}) = p(y_{i}|\boldsymbol{\pi})\prod_{j}p(x_{ij}|\boldsymbol{\theta}_{j}) = \prod_{c}\mathcal{\pi}_{c}^{\mathbb{I}(y_{i}=c)}\prod_{j}\prod_{c}p(x_{ij}|\boldsymbol{\theta}_{jc})^{\mathbb{I}(y_{i}=c)}$$
于是我们的log-likelihood:
$$logp(\mathcal{D}|\boldsymbol{\theta}) = \sum_{c=1}^{C}N_{c}log\pi_{c} + \sum_{j=1}^{D}\sum_{c=1}^{C}\sum_{i:y_{i}=c}log p(x_{ij}|\boldsymbol{\theta}_{jc})$$利用我们之前用过的技术(拉格朗日乘子),我们便得到: $$\hat{\pi}_{c}^{MLE} = \frac{N_{c}}{N}$$
对于$\boldsymbol{\theta}$的估计,我们首先要指定它的分布(上面提到过三个例子),现在我们假定$p(x_{j}|y=c,\theta) \sim Ber(\theta_{jc})$, 即所有的特征都是二元的。
$$\hat{\theta}_{jc}^{MLE} = \frac{N_{jc}}{N_{c}}$$theta = [[0.5,0.6,0.2],[0.4,0.3,0.5],[0.2,0.5,0.3]]
pai = [0.1,0.8,0.1]
C = 3
N = 200
l = [i*10 for i in pai]
result = []
for i in range(N):
num = np.random.randint(1,11)
for j in range(C):
if sum(l[0:j]) < num <= sum(l[0:j+1]):
result.append(j)
N_c = [result.count(i) for i in range(C)]
import numpy as np
result_theta = [theta[i] for i in result]
a = np.zeros((N,3))
for i in range(N):
for j in range(C):
para = result_theta[i][j]
a[i][j] = np.random.binomial(1,para)
df = pd.DataFrame(a,index=result,columns=['a','b','c']).reset_index()
N_jc = df.groupby('index').sum()
N_jcount = df.groupby('index').count()
print((N_jc/N_jcount).values, '\n', theta)
pai_mle = [i/N for i in N_j]
print(pai, '\n', pai_mle)
Bayesian naive Bayes
上述mle估计的问题在于对于一些黑天鹅事件,模型便会崩溃,会一致给出0概率。解决这个问题的一个思路便是贝叶斯化。
$$p(\boldsymbol{\theta}) = p(\boldsymbol{\pi})\prod_{j=1}^{D}\prod_{c=1}^{C}p(\theta_{jc})$$于是更新我们的后验概率:
$$p(\boldsymbol{\theta}|\mathcal{D}) = p(\boldsymbol{\pi}|\mathcal{D})\prod_{j=1}^{D}\prod_{c=1}^{C}p(\theta_{jc}|\mathcal{D})$$$$p(\boldsymbol{\pi}|\mathcal{D}) = Dir(N_{1}+\alpha_{1}...,N_{C}+\alpha_{C})$$$$p(\theta_{jc}|\mathcal{D}) = Beta(N_{jc}+a,N_{c}-N_{jc}+b)\,\,(回忆一下)$$模型预测: $$p(y=c|\boldsymbol{x},\mathcal{D}) \propto p(y=c|\mathcal{D})\prod_{j=1}^{D}p(x_{j}|y=c,\mathcal{D})$$
$$p(y=c|\boldsymbol{x},\mathcal{D}) \propto \left[ \int Cat(y=c|\boldsymbol{\pi})p(\boldsymbol{\pi|\mathcal{D}})d\boldsymbol{\pi} \right] \prod_{j=1}^{D}\left[ \int Ber(x_{j}|y=c,\theta_{jc})p(\boldsymbol{\theta}_{jc}|\mathcal{D}) d\boldsymbol{\theta}_{jc}\right]$$$$\propto \bar{\pi_{c}} \prod_{j=1}^{D} (\bar{\theta}_{jc})^{\mathbb{I}(x_{j}=1)}(1-\bar{\theta}_{jc})^{\mathbb{I}(x_{j}=0)}$$$$\bar{\theta}_{jc} = \frac{N_{jc}+b}{N_{c}+a+b}$$$$\bar{\pi}_{c} = \frac{N_{c}+\alpha_{c}}{N+\alpha_{0}}\,\,(回忆狄利克雷分布)$$alpha = np.ones(C);
pai_map = (N_c+alpha)/(N+sum(alpha))
a=1
b=1
thetaMap = (N_jc+a)/(N_jcount+a+b)
thetaMap
这个模型称之为Bernoulli product model,这是一个非常基本的模型,它能帮助我们更好得理解问题,这个模型的主要不足之处在于忽视了单词在文档中出现的频率。
于是,当我们把每一个单词的分布都考虑进去,那么我们可以很自然得的到: $$p(\boldsymbol{x}_{i}|y_{i},\boldsymbol{\theta}) = Mu(\boldsymbol{x}_{i}|N_{i},\boldsymbol{\theta}_{c}) = \frac{N_{i}!}{\prod_{j=1}^{D}x_{ij}!}\prod_{j=1}^{D}\theta_{jc}^{x_{ij}}$$
然而这个模型同样有瑕疵,我们之前提到过,这个模型对于黑天鹅事件不能很好地处理,当然,我们知道怎么去修补它,贝叶斯化!当我们把先验概率加进去后,这个模型便成为了另一个比较著名的模型DCM(Dirichlet Compound Multinomial):
$$p(\boldsymbol{x}_{i}|y_{i}=c,\boldsymbol{\alpha}) = \int Mu(\boldsymbol{x}_{i}|N_{i},\boldsymbol{\theta}_{c})Dir(\boldsymbol{\theta}_{c}|\alpha_{c})d\boldsymbol{\theta_{c}} = \frac{N_{i}!}{\prod_{j=1}^{D}x_{ij}!}\frac{B(\boldsymbol{x}_{i}+\boldsymbol{\alpha_{c}})}{B(\boldsymbol{\alpha_{c}})}$$