tdub.features

A module for performing feature selection

Class Summary

FeatureSelector(df, labels, weights[, …])

A class to steer the steps of feature selection.

Function Summary

create_parquet_files(qf_dir[, out_dir, …])

Create slimmed and selected parquet files from ROOT files.

prepare_from_parquet(data_dir, region[, …])

Prepare feature selection data from parquet files.

Reference

class tdub.features.FeatureSelector(df, labels, weights, importance_type='gain', corr_threshold=0.85, name=None)[source]

A class to steer the steps of feature selection.

Parameters
  • df (pandas.DataFrame) – The dataframe which contains signal and background events; it should also only contain features we wish to test for (it is expected to be “clean” from non-kinematic information, like metadata and weights).

  • weights (numpy.ndarray) – the weights array compatible with the dataframe

  • importance_type (str) – the importance type (“gain” or “split”)

  • labels (numpy.ndarray) – array of labels compatible with the dataframe (1 for \(tW\) and 0 for \(t\bar{t}\).

  • corr_threshold (float) – the threshold for excluding features based on correlations

  • name (str, optional) – give the selector a name

data

the raw dataframe as fed to the class instance

Type

pandas.DataFrame

weights

the raw weights array compatible with the dataframe

Type

numpy.ndarray

labels

the raw labels array compatible with the dataframe (we expect 1 for signal, \(tW\), and 0 for background, \(t\bar{t}\)).

Type

numpy.ndarray

raw_features

the list of all features determined at initialization

Type

list(str)

name

a name for the selector pipeline, required to save the result)

Type

str, optional

corr_threshold

the threshold for excluding features based on correlations

Type

float

default_clf_opts

the default arguments we initialize classifiers with.

Type

dict

corr_matrix

the raw correlation matrix for the features (requires calling the check_collinearity function)

Type

pandas.DataFrame

correlated

a dataframe matching features that satisfy the correlation threshold

Type

pandas.DataFrame

importances

the importances as determined by a vanilla GBDT (requires calling the check_importances function)

Type

pandas.DataFrame

candidates

list of candiate featurese (sorted by importance) as determined by calling the check_candidates

Type

list(str)

iterative_remove_aucs

a dictionary of the form {feature : auc} providing the AUC value for a BDT trained _without_ the feature given in the key. The keys are built from the candidates list.

Type

dict(str, float)

iterative_add_aucs

an array of AUC values built by iteratively adding the next best feature in the candidates list. (the first entry is calculated using only the top feature, the second entry uses the top 2 features, and so on).

Type

numpy.ndarray

Examples

>>> from tdub.features import FeatureSelector, prepare_from_parquet
>>> df, labels, weights = prepare_from_parquet("/path/to/pq/output", "2j1b", "DR")
>>> fs = FeatureSelector(df=df, labels=labels, weights=weights, corr_threshold=0.90)
check_candidates(n=20)[source]

Get the top uncorrelated features.

This will parse the correlations and most important features and build a list of ordered important features. When a feature that should be dropped due to a collinear feature is found, we ensure that the more important member of the pair is included in the resulting list and drop the other member of the pair. This will populate the candidates attribute for the class.

Parameters

n (int) – the total number of features to retrieve

Examples

>>> from tdub.features import FeatureSelector, prepare_from_parquet
>>> df, labels, weights = prepare_from_parquet("/path/to/pq/output", "2j1b", "DR")
>>> fs = FeatureSelector(df=df, labels=labels, weights=weights, corr_threshold=0.90)
>>> fs.check_for_uniques(and_drop=True)
>>> fs.check_collinearity()
>>> fs.check_importances(extra_fit_opts=dict(verbose=40, early_stopping_round=15))
>>> fs.check_candidates(n=25)
check_collinearity(threshold=None)[source]

Calculate the correlations of the features.

Given a correlation threshold this will construct a list of features that should be dropped based on the correlation values. This also adds a new property to the instance.

If the threshold argument is not None then the class instance’s corr_threshold property is updated.

Parameters

threshold (float, optional) – Override the existing correlations threshold.

Examples

Overriding the exclusion threshold:

>>> from tdub.features import FeatureSelector, prepare_from_parquet
>>> df, labels, weights = prepare_from_parquet("/path/to/pq/output", "2j1b", "DR")
>>> fs = FeatureSelector(df=df, labels=labels, weights=weights, corr_threshold=0.90)
>>> fs.check_for_uniques(and_drop=True)
>>> fs.corr_threshold
0.90
>>> fs.check_collinearity(threshold=0.85)
>>> fs.corr_threshold
0.85
check_for_uniques(and_drop=True)[source]

Check the dataframe for features that have a single unique value.

Parameters

and_drop (bool) – If True, and_drop any unique columns.

Examples

>>> from tdub.features import FeatureSelector, prepare_from_parquet
>>> df, labels, weights = prepare_from_parquet("/path/to/pq/output", "2j1b", "DR")
>>> fs = FeatureSelector(df=df, labels=labels, weights=weights, corr_threshold=0.90)
>>> fs.check_for_uniques(and_drop=True)
check_importances(extra_clf_opts=None, extra_fit_opts=None, n_fits=5, test_size=0.5)[source]

Train vanilla GBDT to calculate feature importance.

some default options are used for the lightgbm.LGBMClassifier instance and fit (see implementation); you can provide extras via function some arguments.

Parameters

Examples

>>> from tdub.features import FeatureSelector, prepare_from_parquet
>>> df, labels, weights = prepare_from_parquet("/path/to/pq/output", "2j1b", "DR")
>>> fs = FeatureSelector(df=df, labels=labels, weights=weights, corr_threshold=0.90)
>>> fs.check_for_uniques(and_drop=True)
>>> fs.check_collinearity()
>>> fs.check_importances(extra_fit_opts=dict(verbose=40, early_stopping_round=15))
check_iterative_add_aucs(max_features=None, extra_clf_opts=None, extra_fit_opts=None)[source]

Calculate aucs iteratively adding the next best feature.

After calling the check_candidates function we have a good set of candidate features; this function will train vanilla BDTs iteratively including one more feater at a time starting with the most important.

Parameters
  • max_features (int) – the maximum number of features to allow to be checked. default will be the length of the candidates list.

  • extra_clf_opts (dict) – extra arguments forwarded to lightgbm.LGBMClassifier.

  • extra_fit_opts (dict) – extra arguments forwarded to lightgbm.LGBMClassifier.fit().

Examples

>>> from tdub.features import FeatureSelector, prepare_from_parquet
>>> df, labels, weights = prepare_from_parquet("/path/to/pq/output", "2j1b", "DR")
>>> fs = FeatureSelector(df=df, labels=labels, weights=weights, corr_threshold=0.90)
>>> fs.check_for_uniques(and_drop=True)
>>> fs.check_collinearity()
>>> fs.check_importances(extra_fit_opts=dict(verbose=40, early_stopping_round=15))
>>> fs.check_candidates(n=25)
>>> fs.check_iterative_add_aucs(max_features=20)
check_iterative_remove_aucs(max_features=None, extra_clf_opts=None, extra_fit_opts=None)[source]

Calculate the aucs iteratively removing one feature at a time.

After calling the check_candidates function we have a good sete of candidate features; this function will train vanilla BDTs one at a time removing one of the candidate features. We rank the feature based on how impactful its removal is.

Parameters
  • max_features (int) – the maximum number of features to allow to be checked. default will be the length of the candidates list.

  • extra_clf_opts (dict) – extra arguments forwarded to lightgbm.LGBMClassifier.

  • extra_fit_opts (dict) – extra arguments forwarded to lightgbm.LGBMClassifier.fit().

Examples

>>> from tdub.features import FeatureSelector, prepare_from_parquet
>>> df, labels, weights = prepare_from_parquet("/path/to/pq/output", "2j1b", "DR")
>>> fs = FeatureSelector(df=df, labels=labels, weights=weights, corr_threshold=0.90)
>>> fs.check_for_uniques(and_drop=True)
>>> fs.check_collinearity()
>>> fs.check_importances(extra_fit_opts=dict(verbose=40, early_stopping_round=15))
>>> fs.check_candidates(n=25)
>>> fs.check_iterative_remove_aucs(max_features=20)
save_result()[source]

Save the results to a directory.

Parameters

output_dir (str or os.PathLike) – the directory to save relevant results to

Examples

>>> from tdub.features import FeatureSelector, prepare_from_parquet
>>> df, labels, weights = prepare_from_parquet("/path/to/pq/output", "2j1b", "DR")
>>> fs = FeatureSelector(df=df, labels=labels, weights=weights, corr_threshold=0.90)
>>> fs.check_for_uniques(and_drop=True)
>>> fs.check_collinearity()
>>> fs.check_importances(extra_fit_opts=dict(verbose=40, early_stopping_round=15))
>>> fs.check_candidates(n=25)
>>> fs.check_iterative_add_aucs(max_features=20)
>>> fs.name = "2j1b_DR"
>>> fs.save_result()
tdub.features.create_parquet_files(qf_dir, out_dir=None, entrysteps=None, use_campaign_weight=False)[source]

Create slimmed and selected parquet files from ROOT files.

this function requires pyarrow.

Parameters
  • qf_dir (str or os.PathLike) – directory to run tdub.data.quick_files()

  • out_dir (str or os.PathLike, optional) – directory to save output files

  • entrysteps (any, optional) – entrysteps option forwarded to tdub.frames.iterative_selection()

  • use_campaign_weight (bool) – multiply the nominal weight by the campaign weight. this is potentially necessary if the samples were prepared without the campaign weight included in the product which forms the nominal weight

Examples

>>> from tdub.features import create_parquet_files
>>> create_parquet_files("/path/to/root/files", "/path/to/pq/output", entrysteps="250 MB")
tdub.features.prepare_from_parquet(data_dir, region, nlo_method='DR', ttbar_frac=None, weight_mean=None, weight_scale=None, scale_sum_weights=True, test_case_size=None)[source]

Prepare feature selection data from parquet files.

this function requires pyarrow.

Parameters
  • data_dir (str or os.PathLike) – directory where the parquet files live

  • region (str or tdub.data.Region) – the region where we’re going to select features

  • nlo_method (str) – the \(tW\) sample (DR or DS)

  • ttbar_frac (str or float, optional) – if not None, this is the fraction of \(t\bar{t}\) events to use, “auto” (the default) uses some sensible defaults to fit in memory: 0.70 for 2j2b and 0.60 for 2j1b.

  • weight_mean (float, optional) – scale all weights such that the mean weight is this value. Cannot be used with weight_scale.

  • weight_scale (float, optional) – value to scale all weights by, cannot be used with weight_mean.

  • scale_sum_weights (bool) – scale sum of weights of signal to be sum of weights of background

  • test_case_size (int, optional) – if we want to perform a quick test, we use a subset of the data, for test_case_size=N we use N events from both signal and background. Cannot be used with ttbar_frac.

Returns

  • pandas.DataFrame – the dataframe which contains kinematic features

  • numpy.ndarray – the labels array for the events

  • numpy.ndarray – the weights array for the events

Examples

>>> from tdub.features import prepare_from_parquet
>>> df, labels, weights = prepare_from_parquet("/path/to/pq/output", "2j1b", "DR")