Skip to content

Flexible Feature Engineering & Exploration Library using GPUs and Optuna.

License

Notifications You must be signed in to change notification settings

pfnet-research/xfeat

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

xfeat

Slides | Tutorial | Document | Installation

Flexible Feature Engineering & Exploration Library using GPUs and Optuna.

xfeat provides sklearn-like transformation classes for feature engineering and exploration. Unlike sklearn API, xfeat provides a dataframe-in, dataframe-out interface. xfeat supports both pandas and cuDF dataframes. By using cuDF and CuPy, xfeat can generate features 10 ~ 30 times faster than a naive pandas operation.

xfeat_bench_result xfeat_target_encoding_image
Group-by aggregation benchmark (result) Target encoding benchmark (result)

Document

More examples are available in the ./examples directory.

Quick Start

xfeat provides a dataframe-in, dataframe-out interface:

xfeat_arithmetic_combination

Feature Engineering

It is possible to sequentially concatenate encoder objects with xfeat.Pipeline. To avoid repeating the same feature extraction process, it is useful to output the results to the feather file format.

  • More encoder classes available here.
import pandas as pd
from xfeat import Pipeline, SelectNumerical, ArithmeticCombinations

# 2-order Arithmetic combinations.
Pipeline(
    [
        SelectNumerical(),
        ArithmeticCombinations(
            exclude_cols=["target"], drop_origin=True, operator="+", r=2,
        ),
    ]
).fit_transform(pd.read_feather("train_test.ftr")).reset_index(
    drop=True
).to_feather(
    "feature_arithmetic_combi2.ftr"
)

Target Encoding with cuDF/CuPy

xfeat_target_encoding_image

Target encoding can be greatly accelerated with cuDF. Internally, aggregation is computed on the GPU using CuPy.

from sklearn.model_selection import KFold
from xfeat import TargetEncoder

fold = KFold(n_splits=5, shuffle=False)
encoder = TargetEncoder(input_cols=cols, fold=fold)

df = cudf.from_pandas(df)  # if cuDF is available.
df_encoded = encoder.fit_transform(df)

Groupby features with cuDF

xfeat_groupby_agg_image

Benchmark result: Group-by aggregation and benchmark result.

from xfeat import aggregation

df = cudf.from_pandas(df)  # if cuDF is available.
df_agg = aggregation(df,
                     group_key="user_id",
                     group_values=["price", "purchased_amount"],
                     agg_methods=["sum", "min", "max"]
                     ).to_pandas()

Feature Selection with GBDT feature importance

Example code: examples/feature_selection_with_gbdt.py

from xfeat import GBDTFeatureSelector

params = {
    "objective": "regression",
    "seed": 111,
}
fit_kwargs = {
    "num_boost_round": 10,
}

selector = GBDTFeatureSelector(
    input_cols=cols,
    target_col="target",
    threshold=0.5,
    lgbm_params=params,
    lgbm_fit_kwargs=fit_kwargs,
)
df_selected = selector.fit_transform(df)
print("Selected columns:", selector._selected_cols)

Feature Selection with Optuna

GBDTFeatureSelector uses a percentile hyperparameter to select features with the highest scores. By using Optuna, we can search for the best value for this hyperparameter to maximize the objective.

Example code: examples/feature_selection_with_gbdt_and_optuna.py

import optuna

def objective(df, selector, trial):
    selector.set_trial(trial)
    selector.fit(df)
    input_cols = selector.get_selected_cols()

    # Evaluate with selected columns
    train_set = lgb.Dataset(df[input_cols], label=df["target"])
    scores = lgb.cv(LGBM_PARAMS, train_set, num_boost_round=100, stratified=False, seed=1)
    rmsle_score = scores["rmse-mean"][-1]
    return rmsle_score


selector = GBDTFeatureExplorer(
    input_cols=input_cols,
    target_col="target",
    fit_once=True,
    threshold_range=(0.6, 1.0),
    lgbm_params=params,
    lgbm_fit_kwargs=fit_params,
)

study = optuna.create_study(direction="minimize")
study.optimize(partial(objective, df_train, selector), n_trials=20)

selector.from_trial(study.best_trial)
print("Selected columns:", selector.get_selected_cols())

Installation

$ python setup.py install

If you want to use GPUs, cuDF and CuPy are required. See the cuDF installation guide.

For Developers

$ python setup.py test