Scikit Learn

数据准备

数据集划分

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    df_data,             # 支持DataFrame输出DataFrame, 可省略.to_numpy()
    df_target,           # support 1d array/series
    test_size=0.25, random_state=1
)

数据预处理

数值化

特征编码

One-Hot Encoder

OneHotEncoder可以利用数据训练得到数据的取值集合，随后用于新数据的编码（fit、transform）；而pd.get_dummies仅能对当前数据进行编码（fit_transform）。

from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(categories='auto', drop=None, sparse=True, 
                    dtype=np.float64, handle_unknown='error')

categories：指定离散数据的类型。默认为'auto'，编码器从数据中推导每个特征包含的类别（fit()）；反之，可以为每一列指定离散值（list_of_list，每个特征对应的类别数量可能不同，不适合使用矩阵表示），==每一列的值不应该混合字符串和数值；如果为数值类型，则应该有序==。

drop：默认保留编码后的所有列，first：丢弃第一列，去除冗余；if_binary：仅对有两类的列丢弃第一列。

sparse：如果设置为True，则在执行transform时，要求数据也是稀疏类型的。

enc.categories_：拟合后得到的类别数据取值集合（enc.categories为初始化的集合）。

enc.get_feature_names(col_names)：根据原数据列名（col_names为序列类型），生成编码后数据列名；==需要首先执行fit()操作==，fit()返回的是编码后的数值矩阵，可结合生成的列名构建编码后的表格（pd.DataFrame）。

标签编码

`LabelEncoder`

fit：统计标签类别；
transform：将标签根据已知类别进行数字编码。fit_transform：在相同数据上进行类别学习和编码。

`LabelBinarizer`

fit(y)/transform(y)/fit_transform(y)：从y（序列类型）中统计标签的类别，标签按数值/字符串大小排序。

fit决定拟合的类别数量n_classes；transform根据输入决定标签类型：y_type=multiclass/binary/...'

普通情况：拟合数据包含两类（==从小到大==分别为neg_label, pos_label），变换数据包含相同的一类/两类；则对变换数据根据y中二值对应进行编码；由于负标签和正标签总是按拟合数据的大小顺序确定的，因此如果与认为规定的正负标签不一致，可对变换后的编码求反：y=1-y。

标签正交二值化（'multiclass'）：如果fit或transform的数据y至少包含两类，且总类别数超过两类；则将拟合数据y中的每类数值单独进行二值编码，不同类别的标签变换后的编码相互正交。

lb.fit(['a', 'b', 'c'])
lb.transform(['c','b','a','d'])
# input l_a l_b l_c    # fit -> a b c
#   c    0   0   1     # 对应类型标签列的值为1，其余为0
#   b    0   1   0
#   a    1   0   0
#   d    0   0   0     # 不属于任何一类

如果拟合数据只有一类（通常不应该出现），如果变换数据包含除拟合类别外的数值，则将拟合类别视为正类别；如果变换数据也只包含拟合类别，视拟合类别为负类别（即该列为常量，应该剔除）。

lb.fit(['b', 'a'])  # 记录拟合的标签 ['a','b']
lb.classes_         # 按大小顺序，与出现顺序无关
y = lb.transform(['a', 'b', 'b','a']) # ->[0, 1, 1, 0]
y = lb.transform(['a', 'b', 'c','a']) # ->[[1,0], [0,1], [0,0], [1,0]]

标签二值编码的计算实现：

y = column_or_1d(y)  
# pick out the known labels from y
y_in_classes = np.in1d(y, classes) # Test whether each element of a 1-D array is also 
                                   # present in a second array.
y_seen = y[y_in_classes]
indices = np.searchsorted(sorted_class, y_seen) # Find indices where elements should be 
                                                # inserted to maintain order.
                                                # 此处等效查找y的分类序号
indptr = np.hstack((0, np.cumsum(y_in_classes))) # if y_in_classes, then it has 1
                                                 # corresp. index.
data = np.empty_like(indices)
data.fill(pos_label)                            
Y = sp.csr_matrix((data, indices, indptr),      # 每一行只有一个非零元素（只属于一类）
                  shape=(n_samples, n_classes)) # 因此适用于csr_matrix

`MultiLabelBinarizer`

与LabelBinarizer存在超过两类的情况不同在于，==每个样本可以拥有多个标签，样本标签可以有交集，因此样本二值编码后并不一定相互正交==。如果每个样本仅包含一个标签，则等价于LabelBinarizer。

mlb.fit_transform([{'sci-fi', 'thriller'}, {'comedy'}])
# array([[0, 1, 1],
#       [1, 0, 0]])
mlb.classes_
# array(['comedy', 'sci-fi', 'thriller'], dtype=object)

每个样本的多个标签应该用列表，元组或集合表示。

上述示例中，总共包括三类标签，记录的类别按标签字典序排序。所以第一个标签样本编码后为011，第二个标签样本编码为100。

特征生成

多项式特征生成

from sklearn.preprocessing import PolynomialFeatures
pf = PolynomialFeatures(degree=2)
X_train = pf.fit_transform(X_train)  # 生成所有特征的所有二阶及以下多项式组合特征

特征尺度变换

StandardScaler

MinMaxScaler

Normalizer(norm='l2', copy=True)

机器学习流程

steps = [
  ('proc_name', estimator), 
  ...
]
workflow = Pipeline(steps)
pred = workflow.fit(x)		# fit transform/predict
pipe.score(x,y)
pipe['proc_name'].func_name()

流程的参数可以通过Pipeline.set_params进行修改，也可以对每个流程的对象单独修改。

为了与Pipeline保持兼容，相应处理流程需要提供三个参数（self,x,y=None），对于预处理流程第三个参数不起作用。

机器学习算法模型

模型参数

estimator.get_params()

colunm_trans = ColumnTransformer[
  [('proc_name', proc, idx), ...]
]

proc：可以是具体的对象，或是'passthrough'、'drop'。

Ignored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline.

性能评估

交叉验证

from sklearn.model_selection import KFold
kf = KFold(n_split=5)
X_train = [None]*5 # X_test = X_train.copy() ...
# 通过迭代依次交换训练/测试集
for train_index, test_index in kf.split(X):
  X_train[i], X_test[i] = X[train_index], X[test_index]
  y_train[i], y_test[i] = y[train_index], y[test_index]

参数搜索

网格搜索交叉验证

from sklearn.model_selection import GridSearchCV
cv = GridSearchCV(estimator, param_grid, scoring=None,jobs=None,cv=None)
cv.fit(data, label)

参数网格定义：可定义一个或多个参数组合。

# 使用字典代表参数网格
param_grid = {'kernel': ('linear', 'rbf'), 'C': [1, 10]}
# 使用字典的列表代表多个参数网格
param_grid = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4], 'C': [1, 10, 100]},
              {'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]

交叉验证参数cv：

None：默认使用5折交叉验证；
N：N折交叉验证；
CV splitter：
一个返回(train,test)数据的迭代器。

计算结果包括：

cv_results_：各个参数组合及其评估性能结果；
best_estimator_：通过搜索得到的性能最好的模型实例；
best_score_
best_params_
best_index_：最佳结果在cv_results_中的索引；

随机搜索交叉验证

RandomizedSearchCV

Learning Programming Book