Processing math: 100%
Logistic回归

Logistic回归回归又称Logistic回归回归分析,是一种广义的线性回归分析模型,常用于数据挖掘,疾病自动诊断,经济预测等领域。

  1. 分类问题的首选算法。
  2. Logistic回归解决二分类问题,Softmax回归解决多分类问题。

Sigmoid函数

g(z)=11+ez=hθ(x)=g(θTx)=11+eθTx

g(x)=(11+ex)=ex(1+ex)2=11+ex(111+ex)=g(x)(1g(x))

Logistic回归参数估计

假定:
P(y=1|x;θ)=hθ(x)

,
P(y=0|x;θ)=1hθ(x)

则:
P(y|x;θ)=(hθ(x))y(1hθ(x))1y

似然函数:
L(θ)=p(y|X;θ)=mi=1p(y(i)|x(i);θ)=mi=1(hθ(x(i)))y(i)(1hθ(x(i)))1y(i)

取对数得到:l(θ)=logL(θ)=mi=1y(i)logh(x(i))+(1y(i))log(1h(x(i)))

最后,对θ

参数求偏导:

l(θ)θj=mi=1(y(i)g(θTx(i)))x(i)j

参数迭代

Logistic回归参数的学习规则:

θj:=θj+α(y(i)hθ(x(i)))x(i)j

损失函数

loss(yi,ˆyi)=l(θ)

,其中y_i\in \left { 0,1 \right }
,\hat{y} = \left{\begin{matrix} p_i & y_i=1 \ 1-p_i & y_i = 0 \end{matrix}\right.

带入推导可得最终损失函数:$$\therefore loss\left ( y_i,\hat{y}i \right ) = -l\left ( \theta \right ) = - \sum{i=1}^{m}ln\left [ p_i^{y_i}\left ( 1-p_i \right )^{1-y_i} \right ] = \sum_{i=1}^{m}ln\left [ y_iln\left ( 1+e^{-f_i} \right ) + \left ( 1-y_i \right )ln\left ( 1+e^{f_i} \right )\right ]$$

Logistic回归的损失

y_i\in \left { -1,1 \right }

L(θ)=mi=1P(yi+1)2i(1Pi)(yi1)2

loss(yi,^yi)=mi=1[ln(1+eyifi)]

广义线性模型Generalized Linear Model

  1. y不再只是正太分布,而是扩大为指数族中的任一分布;
  2. x -> g(x) -> y,连接函数g单调可导,例如逻辑回归中的g(z)=11+ez
    ,拉伸变换g(z)=11+eλz

GLM

Softmax回归

  1. K分类,第k类的参数为θk
    ,组成二维矩阵θk×n
  2. 概率:p(c=kx;θ)=exp(θTkx)Kl=1exp(θTlx)
    , 其中k=1,2,……,K
  3. 似然函数:L(θ)=ml=1Kk=1p(c=k|x(i);θ)y(i)k=Kl=1Kk=1(exp(θTkx(l))/Kl=1exp(θTlx(l)))y(i)k
  4. 对数似然:
    Jm(θ)=lnL(θ)=mi=1Kk=1y(i)k(θTkx(i)lnKl=1exp(θTlx(i)))

    J(θ)=Kk=1yk(θTkxlnKl=1exp(θTlx))
  5. 随机梯度:
    J(θ)θk=(ykp(yk|x,θ))x

鸢尾花分类

实验数据

鸢尾花数据集是最有名的模式识别测试数据,1936年模式识别先驱Fisher在其论文“The use of multiple measurements in taxonomic problems” 使用了它。数据集包括3个鸢尾花类别,每个类别有50个样本,其中一个类别与另外两类线性可分,而另外两类不能线性可分。

数据描述

该数据集包括150行,每行为1个样本,每个样本共有5个字段,分别是花萼长度,花萼宽度,花瓣长度,花瓣宽度,类别。其中类别包括Iris Setosa, Iris Versicolour,Iris Virginica三类,前四个字段的单位为cm。

实验代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
# -*- coding: utf-8 -*
import pandas as pd
import io
import requests
import matplotlib as mpl # 设置环境变量
import matplotlib.pyplot as plt # 绘图专用
from sklearn.cross_validation import train_test_split
from sklearn import metrics
import numpy as np
from sklearn.linear_model import LogisticRegression
import sys
reload(sys)
sys.setdefaultencoding('utf8')
mpl.rcParams['font.sans-serif'] = ['FangSong']
mpl.rcParams['axes.unicode_minus']=False

def iris_type(s):
it = {'Iris-setosa':0,'Iris-versicolor':1,'Iris-virginica':2}
return it[s]

url="http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
data = pd.read_table(io.StringIO(requests.get(url).content.decode('utf-8')), sep=" ", delimiter=',', dtype=float, converters={4:iris_type}, header=None,names=['a','b','c','d','e']).values
# print data
# print type(data)

x,y = np.split(data,(4,),axis=1)
x=x[:,:2]
#print x

logreg = LogisticRegression()
logreg.fit(x,y.ravel())

N, M = 500, 500
x1_min, x1_max = x[:,0].min(),x[:,0].max()
x2_min, x2_max = x[:,1].min(),x[:,1].max()
t1 = np.linspace(x1_min,x1_max,N)
t2 = np.linspace(x2_min,x2_max,M)

x1,x2 = np.meshgrid(t1, t2)
x_test = np.stack((x1.flat,x2.flat), axis=1)

y_hat = logreg.predict(x_test)
y_hat = y_hat.reshape(x1.shape)
plt.pcolormesh(x1,x2,y_hat,cmap=plt.cm.prism)
plt.scatter(x[:,0],x[:,1], c=y, edgecolors='k',cmap=plt.cm.prism)
plt.xlabel('Sepal Length')
plt.ylabel('Sepal Width')
plt.xlim(x1_min, x1_max)
plt.ylim(x2_min, x2_max)
plt.grid()
plt.show()

png

1
2
3
4
5
6
7
8
9
10
11
12
13

y_hat = logreg.predict(x)
y = y.reshape(-1)

print y_hat.shape
print y.shape
result = y_hat == y
print y_hat
print y
print result
c=np.count_nonzero(result)
print c
'Accuracy: %.2f%%' % (100*float(c)/float(len(result)))
'Accuracy: 76.67%'

结果分析

  1. 仅用花萼长度和宽度,在150个样本中,有115个分类正确,正确率为76.67%
  2. 使用四个特征,试验后发现有144个样本分类正确,正确率为96%.

案例跟踪

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
# -*- coding: utf-8 -*
import pandas as pd
import io
import requests
import matplotlib as mpl # 设置环境变量
import matplotlib.pyplot as plt # 绘图专用
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
import numpy as np
from sklearn.linear_model import LogisticRegression
import sys
reload(sys)
sys.setdefaultencoding('utf8')
mpl.rcParams['font.sans-serif'] = ['FangSong']
mpl.rcParams['axes.unicode_minus']=False

def iris_type(s):
it = {'Iris-setosa':0,'Iris-versicolor':1,'Iris-virginica':2}
return it[s]

url="http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
data = pd.read_table(io.StringIO(requests.get(url).content.decode('utf-8')), sep=" ", delimiter=',', dtype=float, converters={4:iris_type}, header=None,names=['a','b','c','d','e']).values
# print data
# print type(data)

x,y = np.split(data,(4,),axis=1)
X=x[:,:4]

x1_min, x1_max = x[:,0].min(),x[:,0].max()
x2_min, x2_max = x[:,1].min(),x[:,1].max()

plt.scatter(x[:,0],x[:,1], c=y, edgecolors='k',cmap=plt.cm.prism)
plt.xlabel('Sepal Length')
plt.ylabel('Sepal Width')
plt.xlim(x1_min, x1_max)
plt.ylim(x2_min, x2_max)
plt.grid()
plt.show()

X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=1)
linreg = LogisticRegression()
model = linreg.fit(X_train,y_train)
y_pred = linreg.predict(X_test)
# print linreg.coef_

result = y_pred == y_test.ravel()

c=np.count_nonzero(result)

png

1
'Accuracy: %.2f%%' % (100*float(c)/float(len(result)))
'Accuracy: 84.21%'

分析:

  1. 第四节使用训练集测试,结果正确性有误
  2. 本实验分训练集和测试集,准确率为84.21%
感谢您的阅读,本文由 Gavinhome Blog 版权所有。如若转载,请注明出处:Gavinhome Blog(http://gavinhome.github.io/2017/08/26/LogisticRegression/
线性回归
支持向量机(SVM)

Related Issues not found

Please contact @GavinHome to initialize the comment