tdub.features¶
A module for performing feature selection
Class Summary¶
|
A class to steer the steps of feature selection. |
Function Summary¶
|
Create slimmed and selected parquet files from ROOT files. |
|
Prepare feature selection data from parquet files. |
Reference¶
-
class
tdub.features.
FeatureSelector
(df, labels, weights, importance_type='gain', corr_threshold=0.85, name=None)[source]¶ A class to steer the steps of feature selection.
- Parameters
df (pandas.DataFrame) – The dataframe which contains signal and background events; it should also only contain features we wish to test for (it is expected to be “clean” from non-kinematic information, like metadata and weights).
weights (numpy.ndarray) – the weights array compatible with the dataframe
importance_type (str) – the importance type (“gain” or “split”)
labels (numpy.ndarray) – array of labels compatible with the dataframe (
1
for \(tW\) and0
for \(t\bar{t}\).corr_threshold (float) – the threshold for excluding features based on correlations
name (str, optional) – give the selector a name
-
data
¶ the raw dataframe as fed to the class instance
- Type
-
weights
¶ the raw weights array compatible with the dataframe
- Type
-
labels
¶ the raw labels array compatible with the dataframe (we expect
1
for signal, \(tW\), and0
for background, \(t\bar{t}\)).- Type
-
corr_matrix
¶ the raw correlation matrix for the features (requires calling the
check_collinearity
function)- Type
a dataframe matching features that satisfy the correlation threshold
- Type
-
importances
¶ the importances as determined by a vanilla GBDT (requires calling the
check_importances
function)- Type
-
candidates
¶ list of candiate featurese (sorted by importance) as determined by calling the
check_candidates
-
iterative_remove_aucs
¶ a dictionary of the form
{feature : auc}
providing the AUC value for a BDT trained _without_ the feature given in the key. The keys are built from thecandidates
list.
-
iterative_add_aucs
¶ an array of AUC values built by iteratively adding the next best feature in the candidates list. (the first entry is calculated using only the top feature, the second entry uses the top 2 features, and so on).
- Type
Examples
>>> from tdub.features import FeatureSelector, prepare_from_parquet >>> df, labels, weights = prepare_from_parquet("/path/to/pq/output", "2j1b", "DR") >>> fs = FeatureSelector(df=df, labels=labels, weights=weights, corr_threshold=0.90)
-
check_candidates
(n=20)[source]¶ Get the top uncorrelated features.
This will parse the correlations and most important features and build a list of ordered important features. When a feature that should be dropped due to a collinear feature is found, we ensure that the more important member of the pair is included in the resulting list and drop the other member of the pair. This will populate the
candidates
attribute for the class.- Parameters
n (int) – the total number of features to retrieve
Examples
>>> from tdub.features import FeatureSelector, prepare_from_parquet >>> df, labels, weights = prepare_from_parquet("/path/to/pq/output", "2j1b", "DR") >>> fs = FeatureSelector(df=df, labels=labels, weights=weights, corr_threshold=0.90) >>> fs.check_for_uniques(and_drop=True) >>> fs.check_collinearity() >>> fs.check_importances(extra_fit_opts=dict(verbose=40, early_stopping_round=15)) >>> fs.check_candidates(n=25)
-
check_collinearity
(threshold=None)[source]¶ Calculate the correlations of the features.
Given a correlation threshold this will construct a list of features that should be dropped based on the correlation values. This also adds a new property to the instance.
If the
threshold
argument is not None then the class instance’scorr_threshold
property is updated.- Parameters
threshold (float, optional) – Override the existing correlations threshold.
Examples
Overriding the exclusion threshold:
>>> from tdub.features import FeatureSelector, prepare_from_parquet >>> df, labels, weights = prepare_from_parquet("/path/to/pq/output", "2j1b", "DR") >>> fs = FeatureSelector(df=df, labels=labels, weights=weights, corr_threshold=0.90) >>> fs.check_for_uniques(and_drop=True) >>> fs.corr_threshold 0.90 >>> fs.check_collinearity(threshold=0.85) >>> fs.corr_threshold 0.85
-
check_for_uniques
(and_drop=True)[source]¶ Check the dataframe for features that have a single unique value.
- Parameters
and_drop (bool) – If
True
, and_drop any unique columns.
Examples
>>> from tdub.features import FeatureSelector, prepare_from_parquet >>> df, labels, weights = prepare_from_parquet("/path/to/pq/output", "2j1b", "DR") >>> fs = FeatureSelector(df=df, labels=labels, weights=weights, corr_threshold=0.90) >>> fs.check_for_uniques(and_drop=True)
-
check_importances
(extra_clf_opts=None, extra_fit_opts=None, n_fits=5, test_size=0.5)[source]¶ Train vanilla GBDT to calculate feature importance.
some default options are used for the
lightgbm.LGBMClassifier
instance and fit (see implementation); you can provide extras via function some arguments.- Parameters
extra_clf_opts (dict) – extra arguments forwarded to
lightgbm.LGBMClassifier
.extra_fit_opts (dict) – extra arguments forwarded to
lightgbm.LGBMClassifier.fit()
.n_fits (int) – number of models to fit to determine importances
test_size (float) – forwarded to
sklearn.model_selection.train_test_split()
Examples
>>> from tdub.features import FeatureSelector, prepare_from_parquet >>> df, labels, weights = prepare_from_parquet("/path/to/pq/output", "2j1b", "DR") >>> fs = FeatureSelector(df=df, labels=labels, weights=weights, corr_threshold=0.90) >>> fs.check_for_uniques(and_drop=True) >>> fs.check_collinearity() >>> fs.check_importances(extra_fit_opts=dict(verbose=40, early_stopping_round=15))
-
check_iterative_add_aucs
(max_features=None, extra_clf_opts=None, extra_fit_opts=None)[source]¶ Calculate aucs iteratively adding the next best feature.
After calling the check_candidates function we have a good set of candidate features; this function will train vanilla BDTs iteratively including one more feater at a time starting with the most important.
- Parameters
max_features (int) – the maximum number of features to allow to be checked. default will be the length of the
candidates
list.extra_clf_opts (dict) – extra arguments forwarded to
lightgbm.LGBMClassifier
.extra_fit_opts (dict) – extra arguments forwarded to
lightgbm.LGBMClassifier.fit()
.
Examples
>>> from tdub.features import FeatureSelector, prepare_from_parquet >>> df, labels, weights = prepare_from_parquet("/path/to/pq/output", "2j1b", "DR") >>> fs = FeatureSelector(df=df, labels=labels, weights=weights, corr_threshold=0.90) >>> fs.check_for_uniques(and_drop=True) >>> fs.check_collinearity() >>> fs.check_importances(extra_fit_opts=dict(verbose=40, early_stopping_round=15)) >>> fs.check_candidates(n=25) >>> fs.check_iterative_add_aucs(max_features=20)
-
check_iterative_remove_aucs
(max_features=None, extra_clf_opts=None, extra_fit_opts=None)[source]¶ Calculate the aucs iteratively removing one feature at a time.
After calling the check_candidates function we have a good sete of candidate features; this function will train vanilla BDTs one at a time removing one of the candidate features. We rank the feature based on how impactful its removal is.
- Parameters
max_features (int) – the maximum number of features to allow to be checked. default will be the length of the
candidates
list.extra_clf_opts (dict) – extra arguments forwarded to
lightgbm.LGBMClassifier
.extra_fit_opts (dict) – extra arguments forwarded to
lightgbm.LGBMClassifier.fit()
.
Examples
>>> from tdub.features import FeatureSelector, prepare_from_parquet >>> df, labels, weights = prepare_from_parquet("/path/to/pq/output", "2j1b", "DR") >>> fs = FeatureSelector(df=df, labels=labels, weights=weights, corr_threshold=0.90) >>> fs.check_for_uniques(and_drop=True) >>> fs.check_collinearity() >>> fs.check_importances(extra_fit_opts=dict(verbose=40, early_stopping_round=15)) >>> fs.check_candidates(n=25) >>> fs.check_iterative_remove_aucs(max_features=20)
-
save_result
()[source]¶ Save the results to a directory.
- Parameters
output_dir (str or os.PathLike) – the directory to save relevant results to
Examples
>>> from tdub.features import FeatureSelector, prepare_from_parquet >>> df, labels, weights = prepare_from_parquet("/path/to/pq/output", "2j1b", "DR") >>> fs = FeatureSelector(df=df, labels=labels, weights=weights, corr_threshold=0.90) >>> fs.check_for_uniques(and_drop=True) >>> fs.check_collinearity() >>> fs.check_importances(extra_fit_opts=dict(verbose=40, early_stopping_round=15)) >>> fs.check_candidates(n=25) >>> fs.check_iterative_add_aucs(max_features=20) >>> fs.name = "2j1b_DR" >>> fs.save_result()
-
tdub.features.
create_parquet_files
(qf_dir, out_dir=None, entrysteps=None, use_campaign_weight=False)[source]¶ Create slimmed and selected parquet files from ROOT files.
this function requires pyarrow.
- Parameters
qf_dir (str or os.PathLike) – directory to run
tdub.data.quick_files()
out_dir (str or os.PathLike, optional) – directory to save output files
entrysteps (any, optional) – entrysteps option forwarded to
tdub.frames.iterative_selection()
use_campaign_weight (bool) – multiply the nominal weight by the campaign weight. this is potentially necessary if the samples were prepared without the campaign weight included in the product which forms the nominal weight
Examples
>>> from tdub.features import create_parquet_files >>> create_parquet_files("/path/to/root/files", "/path/to/pq/output", entrysteps="250 MB")
-
tdub.features.
prepare_from_parquet
(data_dir, region, nlo_method='DR', ttbar_frac=None, weight_mean=None, weight_scale=None, scale_sum_weights=True, test_case_size=None)[source]¶ Prepare feature selection data from parquet files.
this function requires pyarrow.
- Parameters
data_dir (str or os.PathLike) – directory where the parquet files live
region (str or tdub.data.Region) – the region where we’re going to select features
nlo_method (str) – the \(tW\) sample (
DR
orDS
)ttbar_frac (str or float, optional) – if not
None
, this is the fraction of \(t\bar{t}\) events to use, “auto” (the default) uses some sensible defaults to fit in memory: 0.70 for 2j2b and 0.60 for 2j1b.weight_mean (float, optional) – scale all weights such that the mean weight is this value. Cannot be used with
weight_scale
.weight_scale (float, optional) – value to scale all weights by, cannot be used with
weight_mean
.scale_sum_weights (bool) – scale sum of weights of signal to be sum of weights of background
test_case_size (int, optional) – if we want to perform a quick test, we use a subset of the data, for
test_case_size=N
we useN
events from both signal and background. Cannot be used withttbar_frac
.
- Returns
pandas.DataFrame – the dataframe which contains kinematic features
numpy.ndarray – the labels array for the events
numpy.ndarray – the weights array for the events
Examples
>>> from tdub.features import prepare_from_parquet >>> df, labels, weights = prepare_from_parquet("/path/to/pq/output", "2j1b", "DR")