tdub.ml_train¶

A module for handling training

Class Summary¶

`ResponseHistograms`(response_type, model, …)	Create and use histogrammed model response information.
`SingleTrainingSummary`(*[, auc, ks_test_sig, …])	Describes some properties of a single training.

Function Summary¶

`persist_prepared_data`(out_dir, df, labels, …)	Persist prepared data to disk.
`prepare_from_root`(sig_files, bkg_files, region)	Prepare the data to train in a region with signal and background ROOT files.
`folded_training`(df, labels, weights, params, …)	Execute a folded training.
`lgbm_gen_classifier`([train_axes])	Create a classifier using LightGBM.
`lgbm_train_classifier`(clf, X_train, y_train, …)	Train a LGBMClassifier.
`single_training`(df, labels, weights, …[, …])	Execute a single training with some parameters.
`sklearn_gen_classifier`([…])	Create a classifier using scikit-learn.
`sklearn_train_classifier`(clf, X_train, …)	Train a Scikit-learn classifier.
`tdub_train_axes`([learning_rate, max_depth, …])	Construct a dictionary of default tdub training tune.

Reference¶

class tdub.ml_train.ResponseHistograms(response_type, model, X_train, X_test, y_train, y_test, w_train, w_test, nbins=30)[source]¶

Create and use histogrammed model response information.

Parameters

response_type (str) –
Models provide different types of response, like a raw prediction or a probability of signal. This class supports:
- ”predict” (for LGBM),
- ”decision_function” (for Scikit-learn)
- ”proba” (for either).
model (BaseEstimator) – The trained model.
X_train (array_like) – Training data feature matrix.
X_test (array_like) – Testing data feature matrix.
y_train (array_like) – Training data labels.
y_test (array_like) – Testing data labels.
w_train (array_like) – Training data event weights
w_test (array_like) – Testing data event weights
nbins (int) – Number of bins to use.

draw(ax=None, xlabel=None)[source]¶

Draw the response histograms.

Parameters

ax (matplotlib.axes.Axes, optional) – Predefined matplotlib axes to use.
xlabel (str, optional) – Override the automated xlabel definition.

Returns

matplotlib.figure.Figure – The matplotlib figure object.
matplotlib.axes.Axes – The matplotlib axes object.

property ks_bkg_pval¶

Two sample binned KS p-value for background.

Type: float

property ks_bkg_test¶

Two sample binned KS test for background.

Type: float

property ks_sig_pval¶

Two sample binned KS p-value for signal.

Type: float

property ks_sig_test¶

Two sample binned KS test for signal.

Type: float

class tdub.ml_train.SingleTrainingSummary(*, auc=- 1.0, ks_test_sig=- 1.0, ks_pvalue_sig=- 1.0, ks_test_bkg=- 1.0, ks_pvalue_bkg=- 1.0, **kwargs)[source]¶

Describes some properties of a single training.

Parameters

auc (float) – the AUC value for the model
ks_test_sig (float) – the binned KS test value for signal
ks_pvalue_sig (float) – the binned KS test p-value for signal
ks_test_bkg (float) – the binned KS test value for background
ks_pvalue_bkg (float) – the binned KS test p-value for background
kwargs (dict) – currently unused

auc¶

the AUC value for the model

Type: float

ks_test_sig¶

the binned KS test value for signal

Type: float

ks_pvalue_sig¶

the binned KS test p-value for signal

Type: float

ks_test_bkg¶

the binned KS test value for background

Type: float

ks_pvalue_bkg¶

the binned KS test p-value for background

Type: float

tdub.ml_train.persist_prepared_data(out_dir, df, labels, weights)[source]¶

Persist prepared data to disk.

The product of tdub.ml_train.prepare_from_root() is easily persistable to disk; this function performs that task. If the same prepared data is going to be used for multiple training executations, one can save CPU cycles by saving the prepared data instead of starting higher upstream with our ROOT ntuples.

Parameters

out_dir (str or os.PathLike) – Directory to save output to.
df (pandas.DataFrame) – Prepared DataFrame object.
labels (numpy.ndarray) – Prepared labels.
weights (numpy.ndarray) – Prepared weights.

Examples

>>> from tdub.data import quick_files
>>> from tdub.train import prepare_from_root, persist_prepared_data
>>> qfiles = quick_files("/path/to/data")
>>> df, y, w = prepare_from_root(qfiles["tW_DR"], qfiles["ttbar"], "2j2b")
>>> persist_prepared_data("/path/to/output/data", df, y, w)

tdub.ml_train.prepare_from_root(sig_files, bkg_files, region, branches=None, override_selection=None, weight_mean=None, weight_scale=None, scale_sum_weights=True, use_campaign_weight=False, use_tptrw=False, use_trrw=False, test_case_size=None, bkg_sample_frac=None)[source]¶

Prepare the data to train in a region with signal and background ROOT files.

Parameters

sig_files (list(str)) – List of signal ROOT files.
bkg_files (list(str)) – List of background ROOT files.
region (Region or str) – Region where we’re going to perform the training.
branches (list(str), optional) – Override the list of features (usually defined by the region).
override_selection (str, optional) – Manual selection string to apply to the dataset (this will override the region defined selection).
weight_mean (float, optional) – Scale all weights such that the mean weight is this value. Cannot be used with weight_scale.
weight_scale (float, optional) – Value to scale all weights by, cannot be used with weight_mean.
scale_sum_weights (bool) – Scale sum of weights of signal to be sum of weights of background.
use_campaign_weight (bool) – See the parameter description for tdub.frames.iterative_selection().
use_tptrw (bool) – Apply the top pt reweighting factor.
use_trrw (bool) – Apply the top recursive reweighting factor.
test_case_size (int, optional) – Prepare a small test case dataset using this many training and testing samples.
bkg_sample_frac (float, optional) – Sample a fraction of the background data.

Returns

pandas.DataFrame – Event feature matrix.
numpy.ndarray – Event labels (0 for background; 1 for signal).
numpy.ndarray – Event weights.

Examples

>>> from tdub.data import quick_files
>>> from tdub.train import prepare_from_root
>>> qfiles = quick_files("/path/to/data")
>>> df, labels, weights = prepare_from_root(qfiles["tW_DR"], qfiles["ttbar"], "2j2b")

tdub.ml_train.folded_training(df, labels, weights, params, fit_kw, output_dir, region, kfold_kw=None)[source]¶

Execute a folded training.

Train a lightgbm.LGBMClassifier model using \(k\)-fold cross validation using the given input data and parameters. The models resulting from the training (and other important training information) are saved to output_dir. The entries in the kfold_kw argument are forwarded to the sklearn.model_selection.KFold class for data preprocessing. The default arguments that we use are (random_state is controlled by the tdub.config module):

n_splits: 3
shuffle: True

Parameters

df (pandas.DataFrame) – Feature matrix in dataframe format
labels (numpy.ndarray) – Event labels (1 for signal; 0 for background)
weights (numpy.ndarray) – Event weights
params (dict(str, Any)) – Dictionary of lightgbm.LGBMClassifier parameters
fit_kw (dict(str, Any)) – Dictionary of arguments forwarded to lightgbm.LGBMClassifier.fit().
output_dir (str or os.PathLike) – Directory to save results of training
region (str) – String representing the region
kfold_kw (optional dict(str, Any)) – Arguments passed to sklearn.model_selection.KFold

Returns

Negative mean area under the ROC curve (AUC)

Return type

float

Examples

>>> from tdub.data import quick_files
>>> from tdub.train import prepare_from_root
>>> from tdub.train import folded_training
>>> qfiles = quick_files("/path/to/data")
>>> df, labels, weights = prepare_from_root(qfiles["tW_DR"], qfiles["ttbar"], "2j2b")
>>> params = dict(
...     boosting_type="gbdt",
...     num_leaves=42,
...     learning_rate=0.05
...     reg_alpha=0.2,
...     reg_lambda=0.8,
...     max_depth=5,
... )
>>> folded_training(
...     df,
...     labels,
...     weights,
...     params,
...     {"verbose": 20},
...     "/path/to/train/output",
...     "2j2b",
...     kfold_kw={"n_splits": 5, "shuffle": True},
... )

tdub.ml_train.lgbm_gen_classifier(train_axes=None, **clf_params)[source]¶

Create a classifier using LightGBM.

Parameters

train_axes (dict[str, Any]) – Values of required tdub training parameters.
clf_params (kwargs) – Extra arguments passed to the constructor.

Returns

The classifier.

Return type

lightgbm.LGBMClassifier

tdub.ml_train.lgbm_train_classifier(clf, X_train, y_train, w_train, validation_fraction=0.2, early_stopping_rounds=10, **fit_params)[source]¶

Train a LGBMClassifier.

Parameters

clf (lightgbm.LGBMClassifier) – The classifier
X_train (array_like) – Training events matrix
y_train (array_like) – Training event labels
w_train (array_like) – Training event weights
validation_fraction (float) – Fraction of training events to use in validation set.
early_stopping_rounds (int) – Number of early stopping rounds to use in training.
fit_params (keyword arguments) – Extra keyword arguments passed to the classifier.

Returns

The same classifier object passed to the function

Return type

lightgbm.LGBMClassifier

tdub.ml_train.single_training(df, labels, weights, train_axes, output_dir, test_size=0.4, early_stopping_rounds=None, extra_summary_entries=None, use_sklearn=False, use_xgboost=False, save_lgbm_txt=False)[source]¶

Execute a single training with some parameters.

The model and some useful information (mostly plots) are saved to output_dir.

Parameters

df (pandas.DataFrame) – Feature matrix in dataframe format
labels (numpy.ndarray) – Event labels (1 for signal; 0 for background)
weights (numpy.ndarray) – Event weights
train_axes (dict(str, Any)) – Dictionary of parameters defining the tdub train axes.
output_dir (str or os.PathLike) – Directory to save results of training
test_size (float) – Test size for splitting into training and testing sets
early_stopping_rounds (int, optional) – Number of rounds to have no improvement for stopping training.
extra_summary_entries (dict, optional) – Extra entries to save in the JSON output summary.
use_sklearn (bool) – Use Scikit-learn’s HistGradientBoostingClassifier.
use_xgboost (bool) – Use XGBoost’s XGBClassifier.
save_lgbm_txt (bool) – Save fitted LGBM model to text file (ignored if either use_sklearn or use_xgboost is True).

Returns

Useful information about the training

Return type

SingleTrainingSummary

Examples

>>> from tdub.data import quick_files
>>> from tdub.train import prepare_from_root, single_round, tdub_train_axes
>>> qfiles = quick_files("/path/to/data")
>>> df, labels, weights = prepare_from_root(qfiles["tW_DR"], qfiles["ttbar"], "2j2b")
>>> train_axes = tdub_train_axes()
...     learning_rate=0.05
...     max_depth=5,
... )
>>> single_round(
...     df,
...     labels,
...     weights,
...     tdub_train_axes,
...     "training_output",
... )

tdub.ml_train.sklearn_gen_classifier(early_stopping_rounds=10, validation_fraction=0.2, train_axes=None, **clf_params)[source]¶

Create a classifier using scikit-learn.

This uses Scikit-learn’s sklearn.ensemble.HistGradientBoostingClassifier.

The constructor to define early stopping rounds. Extra keyword arguments passed to the classifier initialization

Parameters

early_stopping_rounds (int) – Passed as the n_iter_no_change argument to scikit-learn’s HistGradientBoostingClassifier.
validation_fraction (float) – Passed to the validation_fraction argument in scikit-learn’s HistGradientBoostingClassifier.
train_axes (dict[str, Any]) – Values of required tdub training parameters.
clf_params (kwargs) – Extra arguments passed to the constructor.

Returns

The classifier.

Return type

sklearn.ensemble.HistGradientBoostingClassifier

tdub.ml_train.sklearn_train_classifier(clf, X_train, y_train, w_train, **fit_params)[source]¶

Train a Scikit-learn classifier.

Parameters

clf (sklearn.ensemble.HistGradientBoostingClassifier) – The classifier
X_train (array_like) – Training events matrix
y_train (array_like) – Training event labels
w_train (array_like) – Training event weights
fit_params (kwargs) – Extra keyword arguments passed to the classifier.

Returns

The same classifier object passed to the function.

Return type

sklearn.ensemble.HistGradientBoostingClassifier

tdub.ml_train.tdub_train_axes(learning_rate=0.1, max_depth=5, min_child_samples=50, num_leaves=31, reg_lambda=0.0, **kwargs)[source]¶

Construct a dictionary of default tdub training tune.

Extra keyword arguments are swallowed but never used.

Parameters

learning_rate (float) – Learning rate for a classifier.
max_depth (int) – Max depth for a classifier.
min_child_samples (int) – Min child samples for a classifier.
num_leaves (int) – Num leaves for a classifier.
reg_lambda (float) – Lambda regularation (L2 regularation).

Returns

The argument names and values

Return type

dict(str, Any)