Slides | Tutorial | Document | Installation
Flexible Feature Engineering & Exploration Library using GPUs and Optuna.
xfeat provides sklearn-like transformation classes for feature engineering and exploration. Unlike sklearn API, xfeat provides a dataframe-in, dataframe-out interface. xfeat supports both pandas and cuDF dataframes. By using cuDF and CuPy, xfeat can generate features 10 ~ 30 times faster than a naive pandas operation.
Group-by aggregation benchmark (result) | Target encoding benchmark (result) |
- Slides
- Tutorial notebook
- Feature Encoding and Pipelining
- Target encoding and benchmark result
- Group-by aggregation and benchmark result
- Feature selection with Optuna
More examples are available in the ./examples directory.
xfeat provides a dataframe-in, dataframe-out interface:
It is possible to sequentially concatenate encoder objects with xfeat.Pipeline
. To avoid repeating the same feature extraction process, it is useful to output the results to the feather file format.
- More encoder classes available here.
import pandas as pd
from xfeat import Pipeline, SelectNumerical, ArithmeticCombinations
# 2-order Arithmetic combinations.
Pipeline(
[
SelectNumerical(),
ArithmeticCombinations(
exclude_cols=["target"], drop_origin=True, operator="+", r=2,
),
]
).fit_transform(pd.read_feather("train_test.ftr")).reset_index(
drop=True
).to_feather(
"feature_arithmetic_combi2.ftr"
)
Target encoding can be greatly accelerated with cuDF. Internally, aggregation is computed on the GPU using CuPy.
from sklearn.model_selection import KFold
from xfeat import TargetEncoder
fold = KFold(n_splits=5, shuffle=False)
encoder = TargetEncoder(input_cols=cols, fold=fold)
df = cudf.from_pandas(df) # if cuDF is available.
df_encoded = encoder.fit_transform(df)
Benchmark result: Group-by aggregation and benchmark result.
from xfeat import aggregation
df = cudf.from_pandas(df) # if cuDF is available.
df_agg = aggregation(df,
group_key="user_id",
group_values=["price", "purchased_amount"],
agg_methods=["sum", "min", "max"]
).to_pandas()
Example code: examples/feature_selection_with_gbdt.py
from xfeat import GBDTFeatureSelector
params = {
"objective": "regression",
"seed": 111,
}
fit_kwargs = {
"num_boost_round": 10,
}
selector = GBDTFeatureSelector(
input_cols=cols,
target_col="target",
threshold=0.5,
lgbm_params=params,
lgbm_fit_kwargs=fit_kwargs,
)
df_selected = selector.fit_transform(df)
print("Selected columns:", selector._selected_cols)
GBDTFeatureSelector
uses a percentile hyperparameter to select features with the highest scores.
By using Optuna, we can search for the best value for this hyperparameter to maximize the objective.
Example code: examples/feature_selection_with_gbdt_and_optuna.py
import optuna
def objective(df, selector, trial):
selector.set_trial(trial)
selector.fit(df)
input_cols = selector.get_selected_cols()
# Evaluate with selected columns
train_set = lgb.Dataset(df[input_cols], label=df["target"])
scores = lgb.cv(LGBM_PARAMS, train_set, num_boost_round=100, stratified=False, seed=1)
rmsle_score = scores["rmse-mean"][-1]
return rmsle_score
selector = GBDTFeatureExplorer(
input_cols=input_cols,
target_col="target",
fit_once=True,
threshold_range=(0.6, 1.0),
lgbm_params=params,
lgbm_fit_kwargs=fit_params,
)
study = optuna.create_study(direction="minimize")
study.optimize(partial(objective, df_train, selector), n_trials=20)
selector.from_trial(study.best_trial)
print("Selected columns:", selector.get_selected_cols())
$ python setup.py install
If you want to use GPUs, cuDF and CuPy are required. See the cuDF installation guide.
$ python setup.py test