tdub.ml_train

A module for handling training

Class Summary

ResponseHistograms(response_type, model, …)

Create and use histogrammed model response information.

SingleTrainingSummary(*[, auc, ks_test_sig, …])

Describes some properties of a single training.

Function Summary

persist_prepared_data(out_dir, df, labels, …)

Persist prepared data to disk.

prepare_from_root(sig_files, bkg_files, region)

Prepare the data to train in a region with signal and background ROOT files.

folded_training(df, labels, weights, params, …)

Execute a folded training.

lgbm_gen_classifier([train_axes])

Create a classifier using LightGBM.

lgbm_train_classifier(clf, X_train, y_train, …)

Train a LGBMClassifier.

single_training(df, labels, weights, …[, …])

Execute a single training with some parameters.

sklearn_gen_classifier([…])

Create a classifier using scikit-learn.

sklearn_train_classifier(clf, X_train, …)

Train a Scikit-learn classifier.

tdub_train_axes([learning_rate, max_depth, …])

Construct a dictionary of default tdub training tune.

Reference

class tdub.ml_train.ResponseHistograms(response_type, model, X_train, X_test, y_train, y_test, w_train, w_test, nbins=30)[source]

Create and use histogrammed model response information.

Parameters
  • response_type (str) –

    Models provide different types of response, like a raw prediction or a probability of signal. This class supports:

    • ”predict” (for LGBM),

    • ”decision_function” (for Scikit-learn)

    • ”proba” (for either).

  • model (BaseEstimator) – The trained model.

  • X_train (array_like) – Training data feature matrix.

  • X_test (array_like) – Testing data feature matrix.

  • y_train (array_like) – Training data labels.

  • y_test (array_like) – Testing data labels.

  • w_train (array_like) – Training data event weights

  • w_test (array_like) – Testing data event weights

  • nbins (int) – Number of bins to use.

draw(ax=None, xlabel=None)[source]

Draw the response histograms.

Parameters
  • ax (matplotlib.axes.Axes, optional) – Predefined matplotlib axes to use.

  • xlabel (str, optional) – Override the automated xlabel definition.

Returns

  • matplotlib.figure.Figure – The matplotlib figure object.

  • matplotlib.axes.Axes – The matplotlib axes object.

property ks_bkg_pval

Two sample binned KS p-value for background.

Type

float

property ks_bkg_test

Two sample binned KS test for background.

Type

float

property ks_sig_pval

Two sample binned KS p-value for signal.

Type

float

property ks_sig_test

Two sample binned KS test for signal.

Type

float

class tdub.ml_train.SingleTrainingSummary(*, auc=- 1.0, ks_test_sig=- 1.0, ks_pvalue_sig=- 1.0, ks_test_bkg=- 1.0, ks_pvalue_bkg=- 1.0, **kwargs)[source]

Describes some properties of a single training.

Parameters
  • auc (float) – the AUC value for the model

  • ks_test_sig (float) – the binned KS test value for signal

  • ks_pvalue_sig (float) – the binned KS test p-value for signal

  • ks_test_bkg (float) – the binned KS test value for background

  • ks_pvalue_bkg (float) – the binned KS test p-value for background

  • kwargs (dict) – currently unused

auc

the AUC value for the model

Type

float

ks_test_sig

the binned KS test value for signal

Type

float

ks_pvalue_sig

the binned KS test p-value for signal

Type

float

ks_test_bkg

the binned KS test value for background

Type

float

ks_pvalue_bkg

the binned KS test p-value for background

Type

float

tdub.ml_train.persist_prepared_data(out_dir, df, labels, weights)[source]

Persist prepared data to disk.

The product of tdub.ml_train.prepare_from_root() is easily persistable to disk; this function performs that task. If the same prepared data is going to be used for multiple training executations, one can save CPU cycles by saving the prepared data instead of starting higher upstream with our ROOT ntuples.

Parameters

Examples

>>> from tdub.data import quick_files
>>> from tdub.train import prepare_from_root, persist_prepared_data
>>> qfiles = quick_files("/path/to/data")
>>> df, y, w = prepare_from_root(qfiles["tW_DR"], qfiles["ttbar"], "2j2b")
>>> persist_prepared_data("/path/to/output/data", df, y, w)
tdub.ml_train.prepare_from_root(sig_files, bkg_files, region, branches=None, override_selection=None, weight_mean=None, weight_scale=None, scale_sum_weights=True, use_campaign_weight=False, use_tptrw=False, use_trrw=False, test_case_size=None, bkg_sample_frac=None)[source]

Prepare the data to train in a region with signal and background ROOT files.

Parameters
  • sig_files (list(str)) – List of signal ROOT files.

  • bkg_files (list(str)) – List of background ROOT files.

  • region (Region or str) – Region where we’re going to perform the training.

  • branches (list(str), optional) – Override the list of features (usually defined by the region).

  • override_selection (str, optional) – Manual selection string to apply to the dataset (this will override the region defined selection).

  • weight_mean (float, optional) – Scale all weights such that the mean weight is this value. Cannot be used with weight_scale.

  • weight_scale (float, optional) – Value to scale all weights by, cannot be used with weight_mean.

  • scale_sum_weights (bool) – Scale sum of weights of signal to be sum of weights of background.

  • use_campaign_weight (bool) – See the parameter description for tdub.frames.iterative_selection().

  • use_tptrw (bool) – Apply the top pt reweighting factor.

  • use_trrw (bool) – Apply the top recursive reweighting factor.

  • test_case_size (int, optional) – Prepare a small test case dataset using this many training and testing samples.

  • bkg_sample_frac (float, optional) – Sample a fraction of the background data.

Returns

Examples

>>> from tdub.data import quick_files
>>> from tdub.train import prepare_from_root
>>> qfiles = quick_files("/path/to/data")
>>> df, labels, weights = prepare_from_root(qfiles["tW_DR"], qfiles["ttbar"], "2j2b")
tdub.ml_train.folded_training(df, labels, weights, params, fit_kw, output_dir, region, kfold_kw=None)[source]

Execute a folded training.

Train a lightgbm.LGBMClassifier model using \(k\)-fold cross validation using the given input data and parameters. The models resulting from the training (and other important training information) are saved to output_dir. The entries in the kfold_kw argument are forwarded to the sklearn.model_selection.KFold class for data preprocessing. The default arguments that we use are (random_state is controlled by the tdub.config module):

  • n_splits: 3

  • shuffle: True

Parameters
Returns

Negative mean area under the ROC curve (AUC)

Return type

float

Examples

>>> from tdub.data import quick_files
>>> from tdub.train import prepare_from_root
>>> from tdub.train import folded_training
>>> qfiles = quick_files("/path/to/data")
>>> df, labels, weights = prepare_from_root(qfiles["tW_DR"], qfiles["ttbar"], "2j2b")
>>> params = dict(
...     boosting_type="gbdt",
...     num_leaves=42,
...     learning_rate=0.05
...     reg_alpha=0.2,
...     reg_lambda=0.8,
...     max_depth=5,
... )
>>> folded_training(
...     df,
...     labels,
...     weights,
...     params,
...     {"verbose": 20},
...     "/path/to/train/output",
...     "2j2b",
...     kfold_kw={"n_splits": 5, "shuffle": True},
... )
tdub.ml_train.lgbm_gen_classifier(train_axes=None, **clf_params)[source]

Create a classifier using LightGBM.

Parameters
  • train_axes (dict[str, Any]) – Values of required tdub training parameters.

  • clf_params (kwargs) – Extra arguments passed to the constructor.

Returns

The classifier.

Return type

lightgbm.LGBMClassifier

tdub.ml_train.lgbm_train_classifier(clf, X_train, y_train, w_train, validation_fraction=0.2, early_stopping_rounds=10, **fit_params)[source]

Train a LGBMClassifier.

Parameters
  • clf (lightgbm.LGBMClassifier) – The classifier

  • X_train (array_like) – Training events matrix

  • y_train (array_like) – Training event labels

  • w_train (array_like) – Training event weights

  • validation_fraction (float) – Fraction of training events to use in validation set.

  • early_stopping_rounds (int) – Number of early stopping rounds to use in training.

  • fit_params (keyword arguments) – Extra keyword arguments passed to the classifier.

Returns

The same classifier object passed to the function

Return type

lightgbm.LGBMClassifier

tdub.ml_train.single_training(df, labels, weights, train_axes, output_dir, test_size=0.4, early_stopping_rounds=None, extra_summary_entries=None, use_sklearn=False, use_xgboost=False, save_lgbm_txt=False)[source]

Execute a single training with some parameters.

The model and some useful information (mostly plots) are saved to output_dir.

Parameters
  • df (pandas.DataFrame) – Feature matrix in dataframe format

  • labels (numpy.ndarray) – Event labels (1 for signal; 0 for background)

  • weights (numpy.ndarray) – Event weights

  • train_axes (dict(str, Any)) – Dictionary of parameters defining the tdub train axes.

  • output_dir (str or os.PathLike) – Directory to save results of training

  • test_size (float) – Test size for splitting into training and testing sets

  • early_stopping_rounds (int, optional) – Number of rounds to have no improvement for stopping training.

  • extra_summary_entries (dict, optional) – Extra entries to save in the JSON output summary.

  • use_sklearn (bool) – Use Scikit-learn’s HistGradientBoostingClassifier.

  • use_xgboost (bool) – Use XGBoost’s XGBClassifier.

  • save_lgbm_txt (bool) – Save fitted LGBM model to text file (ignored if either use_sklearn or use_xgboost is True).

Returns

Useful information about the training

Return type

SingleTrainingSummary

Examples

>>> from tdub.data import quick_files
>>> from tdub.train import prepare_from_root, single_round, tdub_train_axes
>>> qfiles = quick_files("/path/to/data")
>>> df, labels, weights = prepare_from_root(qfiles["tW_DR"], qfiles["ttbar"], "2j2b")
>>> train_axes = tdub_train_axes()
...     learning_rate=0.05
...     max_depth=5,
... )
>>> single_round(
...     df,
...     labels,
...     weights,
...     tdub_train_axes,
...     "training_output",
... )
tdub.ml_train.sklearn_gen_classifier(early_stopping_rounds=10, validation_fraction=0.2, train_axes=None, **clf_params)[source]

Create a classifier using scikit-learn.

This uses Scikit-learn’s sklearn.ensemble.HistGradientBoostingClassifier.

The constructor to define early stopping rounds. Extra keyword arguments passed to the classifier initialization

Parameters
  • early_stopping_rounds (int) – Passed as the n_iter_no_change argument to scikit-learn’s HistGradientBoostingClassifier.

  • validation_fraction (float) – Passed to the validation_fraction argument in scikit-learn’s HistGradientBoostingClassifier.

  • train_axes (dict[str, Any]) – Values of required tdub training parameters.

  • clf_params (kwargs) – Extra arguments passed to the constructor.

Returns

The classifier.

Return type

sklearn.ensemble.HistGradientBoostingClassifier

tdub.ml_train.sklearn_train_classifier(clf, X_train, y_train, w_train, **fit_params)[source]

Train a Scikit-learn classifier.

Parameters
  • clf (sklearn.ensemble.HistGradientBoostingClassifier) – The classifier

  • X_train (array_like) – Training events matrix

  • y_train (array_like) – Training event labels

  • w_train (array_like) – Training event weights

  • fit_params (kwargs) – Extra keyword arguments passed to the classifier.

Returns

The same classifier object passed to the function.

Return type

sklearn.ensemble.HistGradientBoostingClassifier

tdub.ml_train.tdub_train_axes(learning_rate=0.1, max_depth=5, min_child_samples=50, num_leaves=31, reg_lambda=0.0, **kwargs)[source]

Construct a dictionary of default tdub training tune.

Extra keyword arguments are swallowed but never used.

Parameters
  • learning_rate (float) – Learning rate for a classifier.

  • max_depth (int) – Max depth for a classifier.

  • min_child_samples (int) – Min child samples for a classifier.

  • num_leaves (int) – Num leaves for a classifier.

  • reg_lambda (float) – Lambda regularation (L2 regularation).

Returns

The argument names and values

Return type

dict(str, Any)