autofeat 的特征筛选原理

xsmile 发布于 5个月前分类：机器学习

主要代码在autofeat.autofeat.AutoFeatModel#fit_transform，在做完特征生成后，调用 autofeat.featsel.select_features函数做特征筛选。

在select_features函数，会做featsel_runs次特征筛选

            selected_columns = []
            for i in range(featsel_runs):
                selected_columns.extend(run_select_features(i))

run_select_features 是一个定义在本地的Local 函数。

    def run_select_features(i):
        if verbose > 0:
            print("[featsel] Feature selection run %i/%i" % (i+1, featsel_runs))
        np.random.seed(i) # todo rng
        rand_idx = np.random.permutation(df_scaled.index)[:max(10, int(0.85 * len(df_scaled)))]
        return _select_features_1run(df_scaled.iloc[rand_idx], target_scaled[rand_idx], problem_type, verbose=verbose-1)

可以看到每次都会选 85% 的样本做特征筛选。数据和标签在筛选前都被 StandardScaler 标准化。

run_select_features 调用了一个定义在当前代码文件的全局函数 _select_features_1run ，进入这个函数

首先，用Lasso拟合数据，并取前min(df.shape[1]-1, df.shape[0]//5) 的样本（即特征数不超过样本数的20%）

model.fit(df.iloc[rand_idx], target[rand_idx])
if problem_type == "regression":
    coefs = np.abs(model.coef_)
else:
    # model.coefs_ is n_classes x n_features, but we need n_features
    coefs = np.max(np.abs(model.coef_), axis=0)
# weight threshold: select at most 0.2*n_train initial features
thr = sorted(coefs, reverse=True)[min(df.shape[1]-1, df.shape[0]//5)]
initial_cols = list(df.columns[coefs > thr])

在我设的iris测试案例中，生成后的特征数是24，初筛后的initial_cols保留为16 。

autofeat借鉴了boruta算法的噪音过滤 特征筛选法。所以调用_noise_filtering对initial_cols 继续筛选

首先要调用_add_noise_features加shadow features

加噪音特征的方法相比于boruta有一些变动。主要的变动是除了对特征做shuffle消除信息后添加外，还加了列的高斯噪音，即�(0,1)的随机数

def _add_noise_features(X):
    n_feat = X.shape[1]
    if X.shape[0] > 50 and n_feat > 1:
        # shuffled features
        rand_noise = StandardScaler().fit_transform(np.random.permutation(X.flatten()).reshape(X.shape))
        X = np.hstack([X, rand_noise])
    # normally distributed noise
    rand_noise = np.random.randn(X.shape[0], max(3, int(0.5*n_feat)))
    X = np.hstack([X, rand_noise])
    return X

后面的步骤和boruta算法一致，筛掉coef比噪音区最大coef小的所有特征。

筛完后，16列只剩下6列

你以为这样就结束了吗？并没有，后面还有一坨代码

X_w_noise = _add_noise_features(df[initial_cols].to_numpy())

X_w_noise.shape
Out[35]: (95, 15)

X_w_noise是怎么来的呢？df[initial_cols] shape = (95, 6)，加上shuffle shadow = (95, 6), 加上randn shadow = (95, 3),

对于被筛掉的其他变量，做 n_splits 次迭代。相当于对被筛掉的other_cols iterative地迭代。

每次迭代，取一个 current_cols，并用这些特征与 X_w_noise 拼接

X = np.hstack([df[current_cols].to_numpy(), X_w_noise])

对其他列单独做了遍筛选后，保留了9个特征。

最后再调用一遍_noise_filtering，保留了7个特征。

相当于至少调用了3遍_noise_filtering

做5遍筛选后，统计哪个特征被保留的最多

selected_columns = Counter(selected_columns)
# sort by frequency, but down weight longer formulas to break ties
selected_columns = sorted(selected_columns, key=lambda x: selected_columns[x] - 0.000001*len(str(x)), reverse=True)

本文链接：https://blog.csdn.net/TQCAI666/article/details/107950626

0个回复

暂无回复

问答社区

autofeat 的特征筛选原理

0个回复

推荐访问