tdub.features¶

A module for performing feature selection

Class Summary¶

FeatureSelector(df, labels, weights[, …])

A class to steer the steps of feature selection.

Function Summary¶

`create_parquet_files`(qf_dir[, out_dir, …])	Create slimmed and selected parquet files from ROOT files.
`prepare_from_parquet`(data_dir, region[, …])	Prepare feature selection data from parquet files.

Reference¶

class tdub.features.FeatureSelector(df, labels, weights, importance_type='gain', corr_threshold=0.85, name=None)[source]¶

A class to steer the steps of feature selection.

Parameters

df (pandas.DataFrame) – The dataframe which contains signal and background events; it should also only contain features we wish to test for (it is expected to be “clean” from non-kinematic information, like metadata and weights).
weights (numpy.ndarray) – the weights array compatible with the dataframe
importance_type (str) – the importance type (“gain” or “split”)
labels (numpy.ndarray) – array of labels compatible with the dataframe (1 for \(tW\) and 0 for \(t\bar{t}\).
corr_threshold (float) – the threshold for excluding features based on correlations
name (str, optional) – give the selector a name

data¶

the raw dataframe as fed to the class instance

Type: pandas.DataFrame

weights¶

the raw weights array compatible with the dataframe

Type: numpy.ndarray

labels¶

the raw labels array compatible with the dataframe (we expect 1 for signal, \(tW\), and 0 for background, \(t\bar{t}\)).

Type: numpy.ndarray

raw_features¶

the list of all features determined at initialization

Type: list(str)

name¶

a name for the selector pipeline, required to save the result)

Type: str, optional

corr_threshold¶

the threshold for excluding features based on correlations

Type: float

default_clf_opts¶

the default arguments we initialize classifiers with.

Type: dict

corr_matrix¶

the raw correlation matrix for the features (requires calling the check_collinearity function)

Type: pandas.DataFrame

correlated¶

a dataframe matching features that satisfy the correlation threshold

Type: pandas.DataFrame

importances¶

the importances as determined by a vanilla GBDT (requires calling the check_importances function)

Type: pandas.DataFrame

candidates¶

list of candiate featurese (sorted by importance) as determined by calling the check_candidates

Type: list(str)

iterative_remove_aucs¶

a dictionary of the form {feature : auc} providing the AUC value for a BDT trained _without_ the feature given in the key. The keys are built from the candidates list.

Type: dict(str, float)

iterative_add_aucs¶

an array of AUC values built by iteratively adding the next best feature in the candidates list. (the first entry is calculated using only the top feature, the second entry uses the top 2 features, and so on).

Type: numpy.ndarray

Examples

>>> from tdub.features import FeatureSelector, prepare_from_parquet
>>> df, labels, weights = prepare_from_parquet("/path/to/pq/output", "2j1b", "DR")
>>> fs = FeatureSelector(df=df, labels=labels, weights=weights, corr_threshold=0.90)

check_candidates(n=20)[source]¶

Get the top uncorrelated features.

This will parse the correlations and most important features and build a list of ordered important features. When a feature that should be dropped due to a collinear feature is found, we ensure that the more important member of the pair is included in the resulting list and drop the other member of the pair. This will populate the candidates attribute for the class.

Parameters: n (int) – the total number of features to retrieve

Examples

>>> from tdub.features import FeatureSelector, prepare_from_parquet
>>> df, labels, weights = prepare_from_parquet("/path/to/pq/output", "2j1b", "DR")
>>> fs = FeatureSelector(df=df, labels=labels, weights=weights, corr_threshold=0.90)
>>> fs.check_for_uniques(and_drop=True)
>>> fs.check_collinearity()
>>> fs.check_importances(extra_fit_opts=dict(verbose=40, early_stopping_round=15))
>>> fs.check_candidates(n=25)

check_collinearity(threshold=None)[source]¶

Calculate the correlations of the features.

Given a correlation threshold this will construct a list of features that should be dropped based on the correlation values. This also adds a new property to the instance.

If the threshold argument is not None then the class instance’s corr_threshold property is updated.

Parameters: threshold (float, optional) – Override the existing correlations threshold.

Examples

Overriding the exclusion threshold:

>>> from tdub.features import FeatureSelector, prepare_from_parquet
>>> df, labels, weights = prepare_from_parquet("/path/to/pq/output", "2j1b", "DR")
>>> fs = FeatureSelector(df=df, labels=labels, weights=weights, corr_threshold=0.90)
>>> fs.check_for_uniques(and_drop=True)
>>> fs.corr_threshold
0.90
>>> fs.check_collinearity(threshold=0.85)
>>> fs.corr_threshold
0.85

check_for_uniques(and_drop=True)[source]¶

Check the dataframe for features that have a single unique value.

Parameters: and_drop (bool) – If True, and_drop any unique columns.

Examples

>>> from tdub.features import FeatureSelector, prepare_from_parquet
>>> df, labels, weights = prepare_from_parquet("/path/to/pq/output", "2j1b", "DR")
>>> fs = FeatureSelector(df=df, labels=labels, weights=weights, corr_threshold=0.90)
>>> fs.check_for_uniques(and_drop=True)

check_importances(extra_clf_opts=None, extra_fit_opts=None, n_fits=5, test_size=0.5)[source]¶

Train vanilla GBDT to calculate feature importance.

some default options are used for the lightgbm.LGBMClassifier instance and fit (see implementation); you can provide extras via function some arguments.

Parameters

extra_clf_opts (dict) – extra arguments forwarded to lightgbm.LGBMClassifier.
extra_fit_opts (dict) – extra arguments forwarded to lightgbm.LGBMClassifier.fit().
n_fits (int) – number of models to fit to determine importances
test_size (float) – forwarded to sklearn.model_selection.train_test_split()

Examples

>>> from tdub.features import FeatureSelector, prepare_from_parquet
>>> df, labels, weights = prepare_from_parquet("/path/to/pq/output", "2j1b", "DR")
>>> fs = FeatureSelector(df=df, labels=labels, weights=weights, corr_threshold=0.90)
>>> fs.check_for_uniques(and_drop=True)
>>> fs.check_collinearity()
>>> fs.check_importances(extra_fit_opts=dict(verbose=40, early_stopping_round=15))

check_iterative_add_aucs(max_features=None, extra_clf_opts=None, extra_fit_opts=None)[source]¶

Calculate aucs iteratively adding the next best feature.

After calling the check_candidates function we have a good set of candidate features; this function will train vanilla BDTs iteratively including one more feater at a time starting with the most important.

Parameters

max_features (int) – the maximum number of features to allow to be checked. default will be the length of the candidates list.
extra_clf_opts (dict) – extra arguments forwarded to lightgbm.LGBMClassifier.
extra_fit_opts (dict) – extra arguments forwarded to lightgbm.LGBMClassifier.fit().

Examples

>>> from tdub.features import FeatureSelector, prepare_from_parquet
>>> df, labels, weights = prepare_from_parquet("/path/to/pq/output", "2j1b", "DR")
>>> fs = FeatureSelector(df=df, labels=labels, weights=weights, corr_threshold=0.90)
>>> fs.check_for_uniques(and_drop=True)
>>> fs.check_collinearity()
>>> fs.check_importances(extra_fit_opts=dict(verbose=40, early_stopping_round=15))
>>> fs.check_candidates(n=25)
>>> fs.check_iterative_add_aucs(max_features=20)

check_iterative_remove_aucs(max_features=None, extra_clf_opts=None, extra_fit_opts=None)[source]¶

Calculate the aucs iteratively removing one feature at a time.

After calling the check_candidates function we have a good sete of candidate features; this function will train vanilla BDTs one at a time removing one of the candidate features. We rank the feature based on how impactful its removal is.

Parameters

max_features (int) – the maximum number of features to allow to be checked. default will be the length of the candidates list.
extra_clf_opts (dict) – extra arguments forwarded to lightgbm.LGBMClassifier.
extra_fit_opts (dict) – extra arguments forwarded to lightgbm.LGBMClassifier.fit().

Examples

>>> from tdub.features import FeatureSelector, prepare_from_parquet
>>> df, labels, weights = prepare_from_parquet("/path/to/pq/output", "2j1b", "DR")
>>> fs = FeatureSelector(df=df, labels=labels, weights=weights, corr_threshold=0.90)
>>> fs.check_for_uniques(and_drop=True)
>>> fs.check_collinearity()
>>> fs.check_importances(extra_fit_opts=dict(verbose=40, early_stopping_round=15))
>>> fs.check_candidates(n=25)
>>> fs.check_iterative_remove_aucs(max_features=20)

save_result()[source]¶

Save the results to a directory.

Parameters: output_dir (str or os.PathLike) – the directory to save relevant results to

Examples

>>> from tdub.features import FeatureSelector, prepare_from_parquet
>>> df, labels, weights = prepare_from_parquet("/path/to/pq/output", "2j1b", "DR")
>>> fs = FeatureSelector(df=df, labels=labels, weights=weights, corr_threshold=0.90)
>>> fs.check_for_uniques(and_drop=True)
>>> fs.check_collinearity()
>>> fs.check_importances(extra_fit_opts=dict(verbose=40, early_stopping_round=15))
>>> fs.check_candidates(n=25)
>>> fs.check_iterative_add_aucs(max_features=20)
>>> fs.name = "2j1b_DR"
>>> fs.save_result()

tdub.features.create_parquet_files(qf_dir, out_dir=None, entrysteps=None, use_campaign_weight=False)[source]¶

Create slimmed and selected parquet files from ROOT files.

this function requires pyarrow.

Parameters

qf_dir (str or os.PathLike) – directory to run tdub.data.quick_files()
out_dir (str or os.PathLike, optional) – directory to save output files
entrysteps (any, optional) – entrysteps option forwarded to tdub.frames.iterative_selection()
use_campaign_weight (bool) – multiply the nominal weight by the campaign weight. this is potentially necessary if the samples were prepared without the campaign weight included in the product which forms the nominal weight

Examples

>>> from tdub.features import create_parquet_files
>>> create_parquet_files("/path/to/root/files", "/path/to/pq/output", entrysteps="250 MB")

tdub.features.prepare_from_parquet(data_dir, region, nlo_method='DR', ttbar_frac=None, weight_mean=None, weight_scale=None, scale_sum_weights=True, test_case_size=None)[source]¶

Prepare feature selection data from parquet files.

this function requires pyarrow.

Parameters

data_dir (str or os.PathLike) – directory where the parquet files live
region (str or tdub.data.Region) – the region where we’re going to select features
nlo_method (str) – the \(tW\) sample (DR or DS)
ttbar_frac (str or float, optional) – if not None, this is the fraction of \(t\bar{t}\) events to use, “auto” (the default) uses some sensible defaults to fit in memory: 0.70 for 2j2b and 0.60 for 2j1b.
weight_mean (float, optional) – scale all weights such that the mean weight is this value. Cannot be used with weight_scale.
weight_scale (float, optional) – value to scale all weights by, cannot be used with weight_mean.
scale_sum_weights (bool) – scale sum of weights of signal to be sum of weights of background
test_case_size (int, optional) – if we want to perform a quick test, we use a subset of the data, for test_case_size=N we use N events from both signal and background. Cannot be used with ttbar_frac.

Returns

pandas.DataFrame – the dataframe which contains kinematic features
numpy.ndarray – the labels array for the events
numpy.ndarray – the weights array for the events

Examples

>>> from tdub.features import prepare_from_parquet
>>> df, labels, weights = prepare_from_parquet("/path/to/pq/output", "2j1b", "DR")