1. 数据标准化¶

数据标准化（Standardization or Mean Removal and Variance Scaling）

进行标准化缩放的数据均值为0，具有单位方差。

scale函数提供一种便捷的标准化转换操作，如下：

In [2]:

from sklearn import preprocessing #导入数据预处理包
X=[[1.,-1.,2.],
       [2.,0.,0.],
       [0.,1.,-1.]]
X_scaled = preprocessing.scale(X)
X_scaled

Out[2]:

array([[ 0.        , -1.22474487,  1.33630621],
       [ 1.22474487,  0.        , -0.26726124],
       [-1.22474487,  1.22474487, -1.06904497]])

In [67]:

 X_scaled.mean(axis=0)

Out[67]:

array([0., 0., 0.])

In [68]:

X_scaled.std(axis=0)

Out[68]:

array([1., 1., 1.])

同样我们也可以通过preprocessing模块提供的Scaler（StandardScaler 0.15以后版本）工具类来实现这个功能：

In [6]:

scaler = preprocessing.StandardScaler().fit(X)
print(scaler)
print(scaler.mean_)
print(scaler.var_)

StandardScaler(copy=True, with_mean=True, with_std=True)
[1.         0.         0.33333333]
[0.66666667 0.66666667 1.55555556]

In [70]:

scaler.transform(X)

Out[70]:

array([[ 0.        , -1.22474487,  1.33630621],
       [ 1.22474487,  0.        , -0.26726124],
       [-1.22474487,  1.22474487, -1.06904497]])

2. 特征缩放¶

2.1 MinMaxScaler(最小最大值标准化)¶

将数据缩放至给定的最小值与最大值之间，通常是０与１之间

公式：X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0)) ;

In [71]:

#例子：将数据缩放至[0, 1]间。训练过程: fit_transform()  
X_train = np.array([[1., -1., 2.], [2., 0., 0.], [0., 1., -1.]])  
min_max_scaler = preprocessing.MinMaxScaler()   
X_train_minmax = min_max_scaler.fit_transform(X_train)    
X_train_minmax

Out[71]:

array([[0.5       , 0.        , 1.        ],
       [1.        , 0.5       , 0.33333333],
       [0.        , 1.        , 0.        ]])

In [72]:

#将上述得到的scale参数应用至测试数据  
X_test = np.array([[ -3., -1., 4.]])    
X_test_minmax = min_max_scaler.transform(X_test) #out: array([[-1.5 ,  0. , 1.66666667]])  
#可以用以下方法查看scaler的属性  
print(min_max_scaler.scale_)        #out: array([ 0.5 ,  0.5,  0.33...])  
print(min_max_scaler.min_)         #out: array([ 0.,  0.5,  0.33...])

[0.5        0.5        0.33333333]
[0.         0.5        0.33333333]

2.2 MaxAbsScaler（绝对值最大标准化）¶

它通过除以最大值将训练集缩放至[-1,1]。这意味着数据已经以０为中心或者是含有非常非常多０的稀疏数据。

In [73]:

X_train = np.array([[ 1., -1.,  2.],  
                     [ 2.,  0.,  0.],  
                    [ 0.,  1., -1.]])  
max_abs_scaler = preprocessing.MaxAbsScaler()  
X_train_maxabs = max_abs_scaler.fit_transform(X_train)  
X_train_maxabs

Out[73]:

array([[ 0.5, -1. ,  1. ],
       [ 1. ,  0. ,  0. ],
       [ 0. ,  1. , -0.5]])

In [74]:

X_test = np.array([[ -3., -1.,  4.]])  
X_test_maxabs = max_abs_scaler.transform(X_test)  
X_test_maxabs

Out[74]:

array([[-1.5, -1. ,  2. ]])

In [75]:

max_abs_scaler.scale_

Out[75]:

array([2., 1., 2.])

3. 数据规范化（Normalization）¶

把数据集中的每个样本所有数值缩放到(-1,1)之间。

In [76]:

X = [[ 1., -1., 2.],
     [ 2., 0., 0.],
     [ 0., 1., -1.]]
X_normalized = preprocessing.normalize(X, norm='l2')
X_normalized

Out[76]:

array([[ 0.40824829, -0.40824829,  0.81649658],
       [ 1.        ,  0.        ,  0.        ],
       [ 0.        ,  0.70710678, -0.70710678]])

In [77]:

normalizer = preprocessing.Normalizer().fit(X) # fit does nothing
normalizer

Out[77]:

Normalizer(copy=True, norm='l2')

In [78]:

normalizer.transform(X)

Out[78]:

array([[ 0.40824829, -0.40824829,  0.81649658],
       [ 1.        ,  0.        ,  0.        ],
       [ 0.        ,  0.70710678, -0.70710678]])

In [79]:

normalizer.transform([[-1., 1., 0.]])

Out[79]:

array([[-0.70710678,  0.70710678,  0.        ]])

4. 二进制化（Binarization）¶

将数值型数据转化为布尔型的二值数据，可以设置一个阈值（threshold）

In [80]:

X = [[ 1., -1., 2.],
     [ 2., 0., 0.],
     [ 0., 1., -1.]]
binarizer = preprocessing.Binarizer().fit(X) # fit does nothing
binarizer

Out[80]:

Binarizer(copy=True, threshold=0.0)

In [81]:

binarizer.transform(X)

Out[81]:

array([[1., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.]])

In [82]:

binarizer = preprocessing.Binarizer(threshold=1.1) 
binarizer.transform(X)

Out[82]:

array([[0., 0., 1.],
       [1., 0., 0.],
       [0., 0., 0.]])

5. 标签预处理（Label preprocessing）¶

4.1 标签二值化（Label binarization）¶

LabelBinarizer通常用于通过一个多类标签（label）列表，创建一个label指示器矩阵

In [83]:

lb = preprocessing.LabelBinarizer()
lb.fit([1, 2, 6, 4, 2])

Out[83]:

LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False)

In [84]:

lb.classes_

Out[84]:

array([1, 2, 4, 6])

In [85]:

lb.transform([1, 6])

Out[85]:

array([[1, 0, 0, 0],
       [0, 0, 0, 1]])

4.2 标签编码（Label encoding）¶

In [86]:

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit([1, 2, 2, 6])

Out[86]:

LabelEncoder()

In [87]:

le.classes_

Out[87]:

array([1, 2, 6])

In [88]:

le.transform([1, 1, 2, 6])

Out[88]:

array([0, 0, 1, 2])

In [89]:

le.inverse_transform([0, 0, 1, 2])

/usr/local/python3/lib/python3.5/site-packages/sklearn/preprocessing/label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:

Out[89]:

array([1, 1, 2, 6])