您的当前位置:首页正文

python data analysis | python数据预

来源:华拓网
  • Dataset transformations| 数据转换
  • Combining estimators|组合学习器
  • Feature extration|特征提取
  • Preprocessing data|数据预处理

<p id='1'>1 Dataset transformations</p>


scikit-learn provides a library of transformers, which may clean (see Preprocessing data), reduce (see Unsupervised dimensionality reduction), expand (see Kernel Approximation) or generate (see Feature extraction) feature representations.

scikit-learn 提供了数据转换的模块,包括数据清理、降维、扩展和特征提取。

Like other estimators, these are represented by classes with fit method, which learns model parameters (e.g. mean and standard deviation for normalization) from a training set, and a transform method which applies this transformation model to unseen data. fit_transform may be more convenient and efficient for modelling and transforming the training data simultaneously.

scikit-learn模块有3种通用的方法:fit(X,y=None)、transform(X)、fit_transform(X)、inverse_transform(newX)。fit用来训练模型;transform在训练后用来降维;fit_transform先用训练模型,然后返回降维后的X;inverse_transform用来将降维后的数据转换成原始数据

<p id='1.1'>1.1 combining estimators</p>

  • <p id='1.1.1'>1.1.1 Pipeline:chaining estimators</p>

Pipeline 模块是用来组合一系列估计器的。对固定的一系列操作非常便利,如:同时结合特征选择、数据标准化、分类。

  • Usage|使用
    代码:
from sklearn.pipeline import Pipeline  
from sklearn.svm import SVC 
from sklearn.decomposition import PCA
from sklearn.pipeline import make_pipeline
#define estimators
#the arg is a list of (key,value) pairs,where the key is a string you want to give this step and value is an estimators object
estimators=[('reduce_dim',PCA()),('svm',SVC())]  
#combine estimators
clf1=Pipeline(estimators)
clf2=make_pipeline(PCA(),SVC())  #use func make_pipeline() can do the same thing
print(clf1,'\n',clf2) 

输出:

Pipeline(steps=[('reduce_dim', PCA(copy=True, n_components=None, whiten=False)), ('svm',           SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))]) 
 Pipeline(steps=[('pca', PCA(copy=True, n_components=None, whiten=False)), ('svc', SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])

可以通过set_params()方法设置学习器的属性,参数形式为<estimator>_<parameter>

clf.set_params(svm__C=10)

上面的方法在网格搜索时很重要

from sklearn.grid_search import GridSearchCV
params = dict(reduce_dim__n_components=[2, 5, 10],svm__C=[0.1, 10, 100])
grid_search = GridSearchCV(clf, param_grid=params)

上面的例子相当于把pipeline生成的学习器作为一个普通的学习器,参数形式为<estimator>_<parameter>。

  • Note|说明
    1.可以使用dir()函数查看clf的所有属性和方法。例如step属性就是每个操作步骤的属性。
('reduce_dim', PCA(copy=True, n_components=None, whiten=False))

2.调用pipeline生成的学习器的fit方法相当于依次调用其包含的所有学习器的方法,transform输入然后把结果扔向下一步骤。pipeline生成的学习器有着它包含的学习器的所有方法。如果最后一个学习器是分类,那么生成的学习器就是分类,如果最后一个是transform,那么生成的学习器就是transform,依次类推。

  • <p id='1.1.2'> 1.1.2 FeatureUnion: composite feature spaces</p>

与pipeline不同的是FeatureUnion只组合transformer,它们也可以结合成更复杂的模型。

  • Usage|使用
    代码:
from sklearn.pipeline import FeatureUnion   
from sklearn.decomposition import PCA
from sklearn.decomposition import KernelPCA
from sklearn.pipeline import make_union
#define transformers
#the arg is a list of (key,value) pairs,where the key is a string you want to give this step and value is an transformer object
estimators=[('linear_pca)',PCA()),('Kernel_pca',KernelPCA())]  
#combine transformers
clf1=FeatureUnion(estimators)
clf2=make_union(PCA(),KernelPCA())
print(clf1,'\n',clf2) 
print(dir(clf1))

输出:

FeatureUnion(n_jobs=1,
       transformer_list=[('linear_pca)', PCA(copy=True, n_components=None, whiten=False)), ('Kernel_pca', KernelPCA(alpha=1.0, coef0=1, degree=3, eigen_solver='auto',
     fit_inverse_transform=False, gamma=None, kernel='linear',
     kernel_params=None, max_iter=None, n_components=None,
     remove_zero_eig=False, tol=0))],
       transformer_weights=None) 
 FeatureUnion(n_jobs=1,
       transformer_list=[('pca', PCA(copy=True, n_components=None, whiten=False)), ('kernelpca', KernelPCA(alpha=1.0, coef0=1, degree=3, eigen_solver='auto',
     fit_inverse_transform=False, gamma=None, kernel='linear',
     kernel_params=None, max_iter=None, n_components=None,
     remove_zero_eig=False, tol=0))],
       transformer_weights=None)

可以看出FeatureUnion的用法与pipeline一致

  • Note|说明
Here is a example python source code:[feature_stacker.py](http://scikit-learn.org/stable/_downloads/feature_stacker.py)

<p id='1.2'>1.2 Feature extraction</p>

skilearn.feature_extraction模块是用机器学习算法所支持的数据格式来提取数据,如将text和image信息转换成dataset。
Note:
Feature extraction(特征提取)与Feature selection(特征选择)不同,前者是用来将非数值的数据转换成数值的数据,后者是用机器学习的方法对特征进行学习(如PCA降维)。

  • <p id='1.2.1'>1.2.1 Loading features from dicts</p>

代码:

measurements=[{'city': 'Dubai', 'temperature': 33.}
,{'city': 'London', 'temperature':12.}
,{'city':'San Fransisco','temperature':18.},]
from sklearn.feature_extraction import DictVectorizer
vec=DictVectorizer()
x=vec.fit_transform(measurements).toarray()
print(x)
print(vec.get_feature_names())```
输出:

[[ 1. 0. 0. 33.]
[ 0. 1. 0. 12.]
[ 0. 0. 1. 18.]]
['city=Dubai', 'city=London', 'city=San Fransisco', 'temperature']
[Finished in 0.8s]

* ###<p id='1.2.2'>1.2.2 Feature hashing</p>
* ###<p id='1.2.3'>1.2.3 Text feature extraction</p>
* ###<p id='1.2.4'>1.2.4 Image feature extraction</p>
以上三小节暂未考虑(设计到语言处理及图像处理)[见官方文档][官方文档]
[官方文档]: http://scikit-learn.org/stable/data_transforms.html

##<p id='1.3'>1.3 Preprogressing data</p>
>The sklearn.preprocessing
 package provides several common utility functions and transformer classes to change raw feature vectors into a representation that is more suitable for the downstream estimators

sklearn.preprogressing模块提供了几种常见的数据转换,如标准化、归一化等。
* ###<p id='1.3.1'>1.3.1 Standardization, or mean removal and variance scaling</p>
>**Standardization** of datasets is a **common requirement for many machine learning estimators** implemented in the scikit; they might behave badly if the individual features do not more or less look like standard normally distributed data: Gaussian with **zero mean and unit variance**.

 很多学习算法都要求事先对数据进行标准化,如果不是像标准正太分布一样0均值1方差就可能会有很差的表现。

 * Usage|用法

 代码:
```python
from sklearn import preprocessing
import numpy as np
X = np.array([[1.,-1., 2.], [2.,0.,0.], [0.,1.,-1.]])
Y=X
Y_scaled = preprocessing.scale(Y)
y_mean=Y_scaled.mean(axis=0) #If 0, independently standardize each feature, otherwise (if 1) standardize each sample|axis=0 时求每个特征的均值,axis=1时求每个样本的均值
y_std=Y_scaled.std(axis=0)
print(Y_scaled)
scaler= preprocessing.StandardScaler().fit(Y)#用StandardScaler类也能完成同样的功能
print(scaler.transform(Y))

输出:

[[ 0.         -1.22474487  1.33630621]
 [ 1.22474487  0.         -0.26726124]
 [-1.22474487  1.22474487 -1.06904497]]
[[ 0.         -1.22474487  1.33630621]
 [ 1.22474487  0.         -0.26726124]
 [-1.22474487  1.22474487 -1.06904497]]
[Finished in 1.4s]
  • Note|说明
    1.func
    2.class
    3.StandardScaler 是一种Transformer方法,可以让pipeline来使用。
    MinMaxScaler (min-max标准化[0,1])类和MaxAbsScaler([-1,1])类是另外两个标准化的方式,用法和StandardScaler类似。
    4.处理稀疏数据时用MinMax和MaxAbs很合适
    5.鲁棒的数据标准化方法(适用于离群点很多的数据处理):

the median and the interquartile range often give better results

用中位数代替均值(使均值为0),用上四分位数-下四分位数代替方差(IQR为1?)。

  • <p id='1.3.2'>1.3.2 Impution of missing values|缺失值的处理</p>

  • Usage
    代码:
import scipy.sparse as sp
from sklearn.preprocessing import Imputer
X=sp.csc_matrix([[1,2],[0,3],[7,6]])
imp=preprocessing.Imputer(missing_value=0,strategy='mean',axis=0)
imp.fit(X)
X_test=sp.csc_matrix([[0, 2], [6, 0], [7, 6]])
print(X_test)
print(imp.transform(X_test))

输出:

  (1, 0)    6
  (2, 0)    7
  (0, 1)    2
  (2, 1)    6
[[ 4.          2.        ]
 [ 6.          3.66666675]
 [ 7.          6.        ]]
[Finished in 0.6s]
  • Note
    1.scipy.sparse是用来存储稀疏矩阵的
    2.Imputer可以用来处理scipy.sparse稀疏矩阵

  • <p id='1.3.3'>1.3.3 Generating polynomial features</p>

  • Usage
    代码:

import numpy as np
from sklearn.preprocessing import PolynomialFeatures
X=np.arange(6).reshape(3,2)
print(X)
poly=PolynomialFeatures(2)
print(poly.fit_transform(X))

输出:

[[0 1]
 [2 3]
 [4 5]]
[[  1.   0.   1.   0.   0.   1.]
 [  1.   2.   3.   4.   6.   9.]
 [  1.   4.   5.  16.  20.  25.]]
[Finished in 0.8s]
  • Note
    生成多项式特征用在多项式回归中以及多项式核方法中 。

  • <p id='1.3.4'>1.3.4 Custom transformers</p>

这是用来构造transform方法的函数

  • Usage:
    代码:
import numpy as np
from sklearn.preprocessing import FunctionTransformer
transformer = FunctionTransformer(np.log1p)
x=np.array([[0,1],[2,3]])
print(transformer.transform(x))

输出:

[[ 0.          0.69314718]
 [ 1.09861229  1.38629436]]
[Finished in 0.8s]
  • Note