tdub.ml_train¶
A module for handling training
Class Summary¶
|
Create and use histogrammed model response information. |
|
Describes some properties of a single training. |
Function Summary¶
|
Persist prepared data to disk. |
|
Prepare the data to train in a region with signal and background ROOT files. |
|
Execute a folded training. |
|
Create a classifier using LightGBM. |
|
Train a LGBMClassifier. |
|
Execute a single training with some parameters. |
Create a classifier using scikit-learn. |
|
|
Train a Scikit-learn classifier. |
|
Construct a dictionary of default tdub training tune. |
Reference¶
-
class
tdub.ml_train.
ResponseHistograms
(response_type, model, X_train, X_test, y_train, y_test, w_train, w_test, nbins=30)[source]¶ Create and use histogrammed model response information.
- Parameters
response_type (str) –
Models provide different types of response, like a raw prediction or a probability of signal. This class supports:
”predict” (for LGBM),
”decision_function” (for Scikit-learn)
”proba” (for either).
model (BaseEstimator) – The trained model.
X_train (array_like) – Training data feature matrix.
X_test (array_like) – Testing data feature matrix.
y_train (array_like) – Training data labels.
y_test (array_like) – Testing data labels.
w_train (array_like) – Training data event weights
w_test (array_like) – Testing data event weights
nbins (int) – Number of bins to use.
-
draw
(ax=None, xlabel=None)[source]¶ Draw the response histograms.
- Parameters
ax (matplotlib.axes.Axes, optional) – Predefined matplotlib axes to use.
xlabel (str, optional) – Override the automated xlabel definition.
- Returns
matplotlib.figure.Figure – The matplotlib figure object.
matplotlib.axes.Axes – The matplotlib axes object.
-
class
tdub.ml_train.
SingleTrainingSummary
(*, auc=- 1.0, ks_test_sig=- 1.0, ks_pvalue_sig=- 1.0, ks_test_bkg=- 1.0, ks_pvalue_bkg=- 1.0, **kwargs)[source]¶ Describes some properties of a single training.
- Parameters
auc (float) – the AUC value for the model
ks_test_sig (float) – the binned KS test value for signal
ks_pvalue_sig (float) – the binned KS test p-value for signal
ks_test_bkg (float) – the binned KS test value for background
ks_pvalue_bkg (float) – the binned KS test p-value for background
kwargs (dict) – currently unused
-
tdub.ml_train.
persist_prepared_data
(out_dir, df, labels, weights)[source]¶ Persist prepared data to disk.
The product of
tdub.ml_train.prepare_from_root()
is easily persistable to disk; this function performs that task. If the same prepared data is going to be used for multiple training executations, one can save CPU cycles by saving the prepared data instead of starting higher upstream with our ROOT ntuples.- Parameters
out_dir (str or os.PathLike) – Directory to save output to.
df (pandas.DataFrame) – Prepared DataFrame object.
labels (numpy.ndarray) – Prepared labels.
weights (numpy.ndarray) – Prepared weights.
Examples
>>> from tdub.data import quick_files >>> from tdub.train import prepare_from_root, persist_prepared_data >>> qfiles = quick_files("/path/to/data") >>> df, y, w = prepare_from_root(qfiles["tW_DR"], qfiles["ttbar"], "2j2b") >>> persist_prepared_data("/path/to/output/data", df, y, w)
-
tdub.ml_train.
prepare_from_root
(sig_files, bkg_files, region, branches=None, override_selection=None, weight_mean=None, weight_scale=None, scale_sum_weights=True, use_campaign_weight=False, use_tptrw=False, use_trrw=False, test_case_size=None, bkg_sample_frac=None)[source]¶ Prepare the data to train in a region with signal and background ROOT files.
- Parameters
region (Region or str) – Region where we’re going to perform the training.
branches (list(str), optional) – Override the list of features (usually defined by the region).
override_selection (str, optional) – Manual selection string to apply to the dataset (this will override the region defined selection).
weight_mean (float, optional) – Scale all weights such that the mean weight is this value. Cannot be used with weight_scale.
weight_scale (float, optional) – Value to scale all weights by, cannot be used with weight_mean.
scale_sum_weights (bool) – Scale sum of weights of signal to be sum of weights of background.
use_campaign_weight (bool) – See the parameter description for
tdub.frames.iterative_selection()
.use_tptrw (bool) – Apply the top pt reweighting factor.
use_trrw (bool) – Apply the top recursive reweighting factor.
test_case_size (int, optional) – Prepare a small test case dataset using this many training and testing samples.
bkg_sample_frac (float, optional) – Sample a fraction of the background data.
- Returns
pandas.DataFrame
– Event feature matrix.numpy.ndarray
– Event labels (0 for background; 1 for signal).numpy.ndarray
– Event weights.
Examples
>>> from tdub.data import quick_files >>> from tdub.train import prepare_from_root >>> qfiles = quick_files("/path/to/data") >>> df, labels, weights = prepare_from_root(qfiles["tW_DR"], qfiles["ttbar"], "2j2b")
-
tdub.ml_train.
folded_training
(df, labels, weights, params, fit_kw, output_dir, region, kfold_kw=None)[source]¶ Execute a folded training.
Train a
lightgbm.LGBMClassifier
model using \(k\)-fold cross validation using the given input data and parameters. The models resulting from the training (and other important training information) are saved tooutput_dir
. The entries in thekfold_kw
argument are forwarded to thesklearn.model_selection.KFold
class for data preprocessing. The default arguments that we use are (random_state is controlled by the tdub.config module):n_splits
: 3shuffle
:True
- Parameters
df (pandas.DataFrame) – Feature matrix in dataframe format
labels (numpy.ndarray) – Event labels (
1
for signal;0
for background)weights (
numpy.ndarray
) – Event weightsparams (dict(str, Any)) – Dictionary of
lightgbm.LGBMClassifier
parametersfit_kw (dict(str, Any)) – Dictionary of arguments forwarded to
lightgbm.LGBMClassifier.fit()
.output_dir (str or os.PathLike) – Directory to save results of training
region (str) – String representing the region
kfold_kw (optional dict(str, Any)) – Arguments passed to
sklearn.model_selection.KFold
- Returns
Negative mean area under the ROC curve (AUC)
- Return type
Examples
>>> from tdub.data import quick_files >>> from tdub.train import prepare_from_root >>> from tdub.train import folded_training >>> qfiles = quick_files("/path/to/data") >>> df, labels, weights = prepare_from_root(qfiles["tW_DR"], qfiles["ttbar"], "2j2b") >>> params = dict( ... boosting_type="gbdt", ... num_leaves=42, ... learning_rate=0.05 ... reg_alpha=0.2, ... reg_lambda=0.8, ... max_depth=5, ... ) >>> folded_training( ... df, ... labels, ... weights, ... params, ... {"verbose": 20}, ... "/path/to/train/output", ... "2j2b", ... kfold_kw={"n_splits": 5, "shuffle": True}, ... )
-
tdub.ml_train.
lgbm_gen_classifier
(train_axes=None, **clf_params)[source]¶ Create a classifier using LightGBM.
- Parameters
- Returns
The classifier.
- Return type
-
tdub.ml_train.
lgbm_train_classifier
(clf, X_train, y_train, w_train, validation_fraction=0.2, early_stopping_rounds=10, **fit_params)[source]¶ Train a LGBMClassifier.
- Parameters
clf (lightgbm.LGBMClassifier) – The classifier
X_train (array_like) – Training events matrix
y_train (array_like) – Training event labels
w_train (array_like) – Training event weights
validation_fraction (float) – Fraction of training events to use in validation set.
early_stopping_rounds (int) – Number of early stopping rounds to use in training.
fit_params (keyword arguments) – Extra keyword arguments passed to the classifier.
- Returns
The same classifier object passed to the function
- Return type
-
tdub.ml_train.
single_training
(df, labels, weights, train_axes, output_dir, test_size=0.4, early_stopping_rounds=None, extra_summary_entries=None, use_sklearn=False, use_xgboost=False, save_lgbm_txt=False)[source]¶ Execute a single training with some parameters.
The model and some useful information (mostly plots) are saved to output_dir.
- Parameters
df (pandas.DataFrame) – Feature matrix in dataframe format
labels (numpy.ndarray) – Event labels (1 for signal; 0 for background)
weights (numpy.ndarray) – Event weights
train_axes (dict(str, Any)) – Dictionary of parameters defining the tdub train axes.
output_dir (str or os.PathLike) – Directory to save results of training
test_size (float) – Test size for splitting into training and testing sets
early_stopping_rounds (int, optional) – Number of rounds to have no improvement for stopping training.
extra_summary_entries (dict, optional) – Extra entries to save in the JSON output summary.
use_sklearn (bool) – Use Scikit-learn’s HistGradientBoostingClassifier.
use_xgboost (bool) – Use XGBoost’s XGBClassifier.
save_lgbm_txt (bool) – Save fitted LGBM model to text file (ignored if either
use_sklearn
oruse_xgboost
isTrue
).
- Returns
Useful information about the training
- Return type
Examples
>>> from tdub.data import quick_files >>> from tdub.train import prepare_from_root, single_round, tdub_train_axes >>> qfiles = quick_files("/path/to/data") >>> df, labels, weights = prepare_from_root(qfiles["tW_DR"], qfiles["ttbar"], "2j2b") >>> train_axes = tdub_train_axes() ... learning_rate=0.05 ... max_depth=5, ... ) >>> single_round( ... df, ... labels, ... weights, ... tdub_train_axes, ... "training_output", ... )
-
tdub.ml_train.
sklearn_gen_classifier
(early_stopping_rounds=10, validation_fraction=0.2, train_axes=None, **clf_params)[source]¶ Create a classifier using scikit-learn.
This uses Scikit-learn’s
sklearn.ensemble.HistGradientBoostingClassifier
.The constructor to define early stopping rounds. Extra keyword arguments passed to the classifier initialization
- Parameters
early_stopping_rounds (int) – Passed as the n_iter_no_change argument to scikit-learn’s HistGradientBoostingClassifier.
validation_fraction (float) – Passed to the validation_fraction argument in scikit-learn’s HistGradientBoostingClassifier.
train_axes (dict[str, Any]) – Values of required tdub training parameters.
clf_params (kwargs) – Extra arguments passed to the constructor.
- Returns
The classifier.
- Return type
-
tdub.ml_train.
sklearn_train_classifier
(clf, X_train, y_train, w_train, **fit_params)[source]¶ Train a Scikit-learn classifier.
- Parameters
clf (sklearn.ensemble.HistGradientBoostingClassifier) – The classifier
X_train (array_like) – Training events matrix
y_train (array_like) – Training event labels
w_train (array_like) – Training event weights
fit_params (kwargs) – Extra keyword arguments passed to the classifier.
- Returns
The same classifier object passed to the function.
- Return type
-
tdub.ml_train.
tdub_train_axes
(learning_rate=0.1, max_depth=5, min_child_samples=50, num_leaves=31, reg_lambda=0.0, **kwargs)[source]¶ Construct a dictionary of default tdub training tune.
Extra keyword arguments are swallowed but never used.
- Parameters
- Returns
The argument names and values
- Return type