# sklearn

## 1 Overview

• supervised learning
• classification # Identifying to which set of categories a new observation belong to.
• regression # Predicting a continuous value for a new example.
• unsupervised learning
• clustering # Automatic grouping of similar objects into sets.
• dimensionality reduction # Reducing the number of random variables to consider.
• model selection and evaluation # Comparing, validating and choosing parameters and models.
• dataset transformations # Feature extraction and normalization.
• scale out and performance

## 2 Building Blocks

http://scipy-lectures.github.io/ sklearn底层使用的三驾马车numpy, scipy, matplotlib.

### 2.1 NumPy

numpy (import numpy as np) 用于矩阵运算

numpy provides:

• extension package to Python for multi-dimensional arrays
• closer to hardware (efficiency)
• designed for scientific computation (convenience)
• also known as array oriented computing

numpy array attributes:

• ndim # 维度
• shape # 每个维度大小
• dtype # 存储类型
• T # 转置矩阵
• size # 元素个数
• itemsize # 每个元素占用内存大小
• nbytes # 占用内存大小

numpy index array:

• a[d1, d2, …] # 多维访问
• a[<array>, …] # fancy indexing

### 2.2 Matplotlib

matplotlib (import matplotlib.pyplot as plt) 用于绘图. A pyplot tutorial

### 2.3 SciPy

scipy (import scipy) 复杂的科学计算

The scipy package contains various toolboxes dedicated to common issues in scientific computing. Its different submodules correspond to different applications, such as interpolation, integration, optimization, image processing, statistics, special functions, etc. scipy can be compared to other standard scientific-computing libraries, such as the GSL (GNU Scientific Library for C and C++), or Matlab's toolboxes. scipy is the core package for scientific routines in Python; it is meant to operate efficiently on numpy arrays, so that numpy and scipy work hand in hand.

### 2.4 Pandas

• Series 可以认为是一个带有index key的ndarray.
• DataFrame a dict of Series objects index, columns
• loc/at on column, iloc/iat on index
Operation Syntax Result
Select column df[col] Series
Select row by label df.loc[label] Series
Select row by integer location df.iloc[loc] Series
Slice rows df[5:10] DataFrame
Select rows by boolean vector df[bool_vec] DataFrame

## 3 Supervised Learning

### 3.1 Generalized Linear Models

LinearRegression求解Ordinary Least Squares(|y-Y|^2). 假设X矩阵是(n, p), 时间复杂度是O(n * p^2). 对于输入数据要求是对于Xi, Xj来说两者不能线性相关，否则噪声会让训练结果有high variance.

Lasso增加了L1-norm, Rigde增加了L2-norm, ElasticNet综合两者有两个norm作为regularizer.

LogisticRegression用来做分类，里面可选l1或者是l2 norm. 根据l1还是l2底层使用不同的算法. liblinear不是真正的multiclass分类器，所以对multiclass是通过binary class + one-vs-rest完成的

Case Solver
Small dataset or L1 penalty "liblinear"
Multinomial loss or large dataset "lbfgs", "sag" or "newton-cg"
Very Large dataset "sag"

The classes SGDClassifier and SGDRegressor provide functionality to fit linear models for classification and regression using different (convex) loss functions and different penalties. E.g., with loss="log", SGDClassifier fits a logistic regression model, while with loss="hinge" it fits a linear support vector machine (SVM).

### 3.2 Kernel Ridge Regression

Ridge Regression with kernel tricks.

### 3.3 Support Vector Machines

svm可以用来做classification, regression以及outliers detection(异常检测). 一些优点和缺点如下：

The advantages of support vector machines are:

• Effective in high dimensional spaces.
• Still effective in cases where number of dimensions is greater than the number of samples.
• Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.
• Versatile: different Kernel functions can be specified for the decision function. Common kernels are provided, but it is also possible to specify custom kernels.

The disadvantages of support vector machines include:

• If the number of features is much greater than the number of samples, the method is likely to give poor performances.
• SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation (see Scores and probabilities, below).

classification有三种分类器分别是SVC, NuSVC, LinearSVC. 其中LinearSVC相同于我SVC使用'linear'核方法，区别在于SVC底层使用libsvm, 而LinearSVC则使用liblinear. 另外LinearSVC得到的结果最后也不会返回support_(支持向量).

kernel函数支持 1.linear 2. polynomial 3. rbf 4. sigmoid(tanh). 对于unbalanced的问题，sklearn实现允许指定 1.class_weight 2.sample_weight. 其中class_weight表示每个class对应的权重，这个在构造classifier时候就需要设置。如果不确定的话就设置成为'auto'。sample_weight则表示每个实例对应的权重，这个可以在调用训练方法fit的时候传入。另外一个比较重要的参数是C(惩罚代价), 通常来说设置成为1.0就够了。但是如果数据中太多噪音的话那么最好减小一些。

#!/usr/bin/env python
#coding:utf-8

from sklearn import datasets

from sklearn import svm
from sklearn import cross_validation
from sklearn.metrics import classification_report

clf = svm.SVC(gamma = 0.001, C = 1.0)
# (data, target) = (iris.data, iris.target)
(data, target) = (digits.data, digits.target)
X_tr, X_tt, y_tr, y_tt = cross_validation.train_test_split(data, target, test_size = 0.3, random_state = 0)
clf.fit(X_tr, y_tr)
y_true, y_pred = y_tt, clf.predict(X_tt)
print(classification_report(y_true, y_pred))


Stochastic Gradient Descent (SGD) is a simple yet very efficient approach to discriminative learning of linear classifiers under convex loss functions such as (linear) Support Vector Machines and Logistic Regression. Even though SGD has been around in the machine learning community for a long time, it has received a considerable amount of attention just recently in the context of large-scale learning.

SGD has been successfully applied to large-scale and sparse machine learning problems often encountered in text classification and natural language processing. Given that the data is sparse, the classifiers in this module easily scale to problems with more than 10^5 training examples and more than 10^5 features.

• Efficiency.
• Ease of implementation (lots of opportunities for code tuning).

• SGD requires a number of hyperparameters such as the regularization parameter and the number of iterations.
• SGD is sensitive to feature scaling. # 所以输入之前要做好预处理

The major advantage of SGD is its efficiency, which is basically linear in the number of training examples. If X is a matrix of size (n, p) training has a cost of O(k n \bar p), where k is the number of iterations (epochs) and \bar p is the average number of non-zero attributes per sample.

Recent theoretical results, however, show that the runtime to get some desired optimization accuracy does not increase as the training set size increases.

### 3.5 Nearest Neighbors

NN可以同时用来做监督和非监督学习。其中非监督学习的NN是其他一些学习方法的基础。

#!/usr/bin/env python
#coding:utf-8

from sklearn import datasets

from sklearn.neighbors import KNeighborsClassifier
from sklearn import cross_validation
from sklearn.metrics import classification_report

# (data, target) = (iris.data, iris.target)
(data, target) = (digits.data, digits.target)
X_tr, X_tt, y_tr, y_tt = cross_validation.train_test_split(data, target, test_size = 0.3, random_state = 0)

clf = KNeighborsClassifier(n_neighbors = 10)
clf.fit(X_tr, y_tr)
y_true, y_pred = y_tt, clf.predict(X_tt)
print(classification_report(y_true, y_pred))


### 3.6 Naive Bayes

1. 如果Xi是连续量的话，Gaussian Naive Bayes. 取y=k的所有Xi数据点，假设这个分布服从高斯分布。计算出这个高斯分布的mean和std之后，就可以计算P(X|y=k)。这个模型系数有d * k个。
2. 如果Xi是离散量的话，Multinomial Naive Bayes. 那么P(X=u|y=k) = P(X=u, y=k) / P(y=k). 这个模型系数有k * ∑ {Xi}个。模型里面还有一个平滑参数。
3. 进一步如果Xi是(0,1)的话，Bernoulli Naive Bayes. 通常我们需要提供参数binarize，这个方法用来将X转换成为(0,1).
#!/usr/bin/env python
#coding:utf-8

from sklearn import datasets

from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn import cross_validation
from sklearn.metrics import classification_report

(data, target) = (iris.data, iris.target)
clf = GaussianNB()
# (data, target) = (digits.data, digits.target)
# clf = MultinomialNB()
X_tr, X_tt, y_tr, y_tt = cross_validation.train_test_split(data, target, test_size = 0.3, random_state = 0)

clf.fit(X_tr, y_tr)
y_true, y_pred = y_tt, clf.predict(X_tt)
print(classification_report(y_true, y_pred))


### 3.7 Decision Trees

Some advantages of decision trees are:

• Simple to understand and to interpret. Trees can be visualised.
• Requires little data preparation. Other techniques often require data normalisation, dummy variables need to be created and blank values to be removed. Note however that this module does not support missing values.
• The cost of using the tree (i.e., predicting data) is logarithmic in the number of data points used to train the tree.
• Able to handle both numerical and categorical data. Other techniques are usually specialised in analysing datasets that have only one type of variable. See algorithms for more information.
• Able to handle multi-output problems.
• Uses a white box model. If a given situation is observable in a model, the explanation for the condition is easily explained by boolean logic. By contrast, in a black box model (e.g., in an artificial neural network), results may be more difficult to interpret.
• Possible to validate a model using statistical tests. That makes it possible to account for the reliability of the model.
• Performs well even if its assumptions are somewhat violated by the true model from which the data were generated.

The disadvantages of decision trees include:

• Decision-tree learners can create over-complex trees that do not generalise the data well. This is called overfitting. Mechanisms such as pruning (not currently supported), setting the minimum number of samples required at a leaf node or setting the maximum depth of the tree are necessary to avoid this problem.
• Decision trees can be unstable because small variations in the data might result in a completely different tree being generated. This problem is mitigated by using decision trees within an ensemble.
• The problem of learning an optimal decision tree is known to be NP-complete under several aspects of optimality and even for simple concepts. Consequently, practical decision-tree learning algorithms are based on heuristic algorithms such as the greedy algorithm where locally optimal decisions are made at each node. Such algorithms cannot guarantee to return the globally optimal decision tree. This can be mitigated by training multiple trees in an ensemble learner, where the features and samples are randomly sampled with replacement.
• There are concepts that are hard to learn because decision trees do not express them easily, such as XOR, parity or multiplexer problems.
• Decision tree learners create biased trees if some classes dominate. It is therefore recommended to balance the dataset prior to fitting with the decision tree.

Tips on practical use:

• Decision trees tend to overfit on data with a large number of features. Getting the right ratio of samples to number of features is important, since a tree with few samples in high dimensional space is very likely to overfit.
• Consider performing dimensionality reduction (PCA, ICA, or Feature selection) beforehand to give your tree a better chance of finding features that are discriminative.
• Use min_samples_split or min_samples_leaf to control the number of samples at a leaf node. A very small number will usually mean the tree will overfit, whereas a large number will prevent the tree from learning the data. Try min_samples_leaf=5 as an initial value. If the sample size varies greatly, a float number can be used as percentage in these two parameters. The main difference between the two is that min_samples_leaf guarantees a minimum number of samples in a leaf, while min_samples_split can create arbitrary small leaves, though min_samples_split is more common in the literature.
• All decision trees use np.float32 arrays internally. If training data is not in this format, a copy of the dataset will be made.
• If the input matrix X is very sparse, it is recommended to convert to sparse csc_matrix before calling fit and sparse csr_matrix before calling predict. Training time can be orders of magnitude faster for a sparse matrix input compared to a dense matrix when features have zero values in most of the samples.

### 3.8 Ensemble methods

emsemble方法通常分为两类：

• averaging methods. 平均方法，使用不同的算法构建出几个不同的假设然后取平均效果。算法得到的假设都比较好但是容易overfitting, 通过取平均效果降低variance. 通常算法只是作用在部分数据上。这类方法有Bagging, Random Forest等。sklearn提供了bagging meta-estimator允许传入base-estimator来自动做averaging. RF还提供了两个不同版本，另外一个版本在生成决策树选择threshold上也做了随机。

#!/usr/bin/env python
#coding:utf-8

from sklearn import datasets

from sklearn import cross_validation
from sklearn.metrics import classification_report

# (data, target) = (iris.data, iris.target)
(data, target) = (digits.data, digits.target)
X_tr, X_tt, y_tr, y_tt = cross_validation.train_test_split(data, target, test_size = 0.3, random_state = 0)

print '----------RandomForest----------'
clf = RandomForestClassifier(n_estimators = 100, bootstrap = True, oob_score = True)
clf.fit(X_tr, y_tr)
print 'OOB Score = %.4f' % clf.oob_score_
print 'Feature Importance = %s' % clf.feature_importances_
y_true, y_pred = y_tt, clf.predict(X_tt)
print(classification_report(y_true, y_pred))

clf = GradientBoostingClassifier(n_estimators = 100, learning_rate = 0.6, random_state = 0)
clf.fit(X_tr, y_tr)
print 'Feature Importance = %s' % clf.feature_importances_
y_true, y_pred = y_tt, clf.predict(X_tt)
print(classification_report(y_true, y_pred))


## 4 Model selection and evaluation

### 4.1 Cross-validation: evaluating estimator performance

• 使用train_test_split分开training_set和test_set.
• 使用k-fold等方式从training_set中分出validation_set做cross_validation.
• 使用cross_val_score来进行cross_validation并且计算cross_validation效果.
#!/usr/bin/env python
#coding:utf-8

import numpy as np
from sklearn import cross_validation
from sklearn import datasets
from sklearn import svm

# iris.data.shape = (150, 4); n_samples = 150, n_features = 4

# 分出40%作为测试数据集合. random_state作为随机种子
X_train, X_test, y_train, y_test = cross_validation.train_test_split(iris.data, iris.target, test_size = 0.4, random_state = 0)

# 假设这里我们已经完成参数空间搜索
clf = svm.SVC(gamma = 0.001, C = 100., kernel = 'linear')
# 使用cross_validation查看参数效果
scores = cross_validation.cross_val_score(clf, X_train, y_train, cv = 3)
print("Accuracy on cv: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

# 如果效果不错的话，就是可以使用这个模型计算测试数据
clf.fit(X_train, y_train)
print(np.mean(clf.predict(X_test) == y_test))


### 4.2 Grid Search: searching for estimator parameters

#!/usr/bin/env python
#coding:utf-8

from __future__ import print_function

from sklearn import datasets
from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.svm import SVC

# To apply an classifier on this data, we need to flatten the image, to
# turn the data in a (samples, feature) matrix:
(n_samples, h, w) = digits.images.shape
# 这里也可以直接用digits.data和digits.target. digits.data已经是reshape之后结果.
X = digits.images.reshape((n_samples, -1))
y = digits.target

# Split the dataset in two equal parts
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)

# Set the parameters by cross-validation
# 提供参数的可选列表
tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],
'C': [1, 10, 100, 1000]},
{'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]

# 链接中给的代码还对cross_validation效果评价方式(scoring)进行了搜索
clf = GridSearchCV(SVC(), tuned_parameters, cv=5) # 使用k-fold划分出validation_set. k = 5
clf.fit(X_train, y_train)

print("Best parameters set found on development set:")
print(clf.best_estimator_)
print("Grid scores on development set:")
for params, mean_score, scores in clf.grid_scores_:
print("%0.3f (+/-%0.03f) for %r"
% (mean_score, scores.std() / 2, params))
print("Detailed classification report:")
print("The model is trained on the full development set.")
print("The scores are computed on the full evaluation set.")
y_true, y_pred = y_test, clf.predict(X_test)
print(classification_report(y_true, y_pred))


Best parameters set found on development set:
SVC(C=10, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.001,
kernel=rbf, max_iter=-1, probability=False, random_state=None,
shrinking=True, tol=0.001, verbose=False)
Grid scores on development set:
0.986 (+/-0.001) for {'kernel': 'rbf', 'C': 1, 'gamma': 0.001}
0.963 (+/-0.004) for {'kernel': 'rbf', 'C': 1, 'gamma': 0.0001}
0.989 (+/-0.003) for {'kernel': 'rbf', 'C': 10, 'gamma': 0.001}
0.985 (+/-0.003) for {'kernel': 'rbf', 'C': 10, 'gamma': 0.0001}
0.989 (+/-0.003) for {'kernel': 'rbf', 'C': 100, 'gamma': 0.001}
0.983 (+/-0.003) for {'kernel': 'rbf', 'C': 100, 'gamma': 0.0001}
0.989 (+/-0.003) for {'kernel': 'rbf', 'C': 1000, 'gamma': 0.001}
0.983 (+/-0.003) for {'kernel': 'rbf', 'C': 1000, 'gamma': 0.0001}
0.976 (+/-0.005) for {'kernel': 'linear', 'C': 1}
0.976 (+/-0.005) for {'kernel': 'linear', 'C': 10}
0.976 (+/-0.005) for {'kernel': 'linear', 'C': 100}
0.976 (+/-0.005) for {'kernel': 'linear', 'C': 1000}
Detailed classification report:
The model is trained on the full development set.
The scores are computed on the full evaluation set.
precision    recall  f1-score   support

0       1.00      1.00      1.00        60
1       0.95      1.00      0.97        73
2       1.00      0.97      0.99        71
3       1.00      1.00      1.00        70
4       1.00      1.00      1.00        63
5       0.99      0.97      0.98        89
6       0.99      1.00      0.99        76
7       0.98      1.00      0.99        65
8       1.00      0.96      0.98        78
9       0.97      0.99      0.98        74

avg / total       0.99      0.99      0.99       719


### 4.3 Model evaluation: quantifying the quality of predictions

There are 3 different approaches to evaluate the quality of predictions of a model: # 有3中不同方式来评价模型预测结果

1. Estimator score method: Estimators have a score method providing a default evaluation criterion for the problem they are designed to solve. # 模型自身内部的评价比如损失函数等
2. Scoring parameter: Model-evaluation tools using cross-validation (such as cross_validation.cross_val_score and grid_search.GridSearchCV) rely on an internal scoring strategy. # cv的评价，通常是数值表示. 比如'f1'.
3. Metric functions: The metrics module implements functions assessing prediction errors for specific purposes. # 作用在测试数据的评价，可以是数值表示，也可以是文本图像等表示. 比如'classification_report'.

sklearn还提供了DummyEstimator. 它只有有限的几种比较dummy的策略，主要是用来给出baseline.

DummyClassifier implements three such simple strategies for classification:

• 'stratified' generates randomly predictions by respecting the training set's class distribution,
• 'most_frequent' always predicts the most frequent label in the training set,
• 'uniform' generates predictions uniformly at random.
• 'constant' always predicts a constant label that is provided by the user.

DummyRegressor also implements three simple rules of thumb for regression:

• 'mean' always predicts the mean of the training targets.
• 'median' always predicts the median of the training targests.
• 'constant' always predicts a constant value that is provided by the user.

### 4.4 Validation curves: plotting scores to evaluate models

Every estimator has its advantages and drawbacks. Its generalization error can be decomposed in terms of bias, variance and noise. The bias of an estimator is its average error for different training sets. The variance of an estimator indicates how sensitive it is to varying training sets. Noise is a property of the data. # bias是指模型对不同训练数据的偏差，variance则是指模型对不同训练数据的敏感程度，噪音则是数据自身属性。这三个问题造成预测偏差。

#note: 这个特性应该是从0.15才有的。之前我用apt-get安装的sklearn-0.14.1没有learning_curve这个模块。

validation curve

If the training score and the validation score are both low, the estimator will be underfitting. If the training score is high and the validation score is low, the estimator is overfitting and otherwise it is working very well. A low training score and a high validation score is usually not possible. All three cases can be found in the plot below where we vary the parameter gamma on the digits dataset.

learning curve