tdub.ml_apply¶
A module for applying trained models
Class Summary¶
Base class for describing a completed training to apply to other data. |
|
|
Provides access to the properties of a folded training. |
|
Provides access to the properties of a single result. |
Function Summary¶
|
Get a NumPy array which is the response for all events in df. |
Reference¶
-
class
tdub.ml_apply.
BaseTrainSummary
[source]¶ Base class for describing a completed training to apply to other data.
-
apply_to_dataframe
(df, column_name, do_query)[source]¶ Apply trained model(s) to events in a dataframe df.
All BaseTrainSummary classes must implement this function.
-
property
features
¶ Features used by the model.
-
parse_summary_json
(summary_file)[source]¶ Parse a traning’s summary json file.
This populates the class properties with values and the resulting dictionary is saved to be accessible via the summary property. The common class properties (which all BaseTrainSummarys have by defition) besides summary are features, region, and selecton_used. This function will define those, so all BaseTrainSummary inheriting classes should call the super implementation of this method if a daughter implementation is necessary to add additional summary properties.
- Parameters
summary_file (os.PathLike) – The summary json file.
-
property
region
¶ Region where the training was executed.
-
property
selection_used
¶ Numexpr selection used on the trained datasets.
-
property
summary
¶ Training summary dictionary from the training json.
-
-
class
tdub.ml_apply.
FoldedTrainSummary
(fold_output)[source]¶ Bases:
tdub.ml_apply.BaseTrainSummary
Provides access to the properties of a folded training.
- Parameters
fold_output (str) – Directory with the folded training output.
Examples
>>> from tdub.apply import FoldedTrainSummary >>> fr_1j1b = FoldedTrainSummary("/path/to/folded_training_1j1b")
-
apply_to_dataframe
(df, column_name='unnamed_response', do_query=False)[source]¶ Apply trained models to an arbitrary dataframe.
This function will augment the dataframe with a new column (with a name given by the
column_name
argument) if it doesn’t already exist. If the dataframe is empty this function does nothing.- Parameters
df (pandas.DataFrame) – Dataframe to read and augment.
column_name (str) – Name to give the BDT response variable.
do_query (bool) – Perform a query on the dataframe to select events belonging to the region associated with training result; necessary if the dataframe hasn’t been pre-filtered.
Examples
>>> from tdub.apply import FoldedTrainSummary >>> from tdub.frames import raw_dataframe >>> df = raw_dataframe("/path/to/file.root") >>> fr_1j1b = FoldedTrainSummary("/path/to/folded_training_1j1b") >>> fr_1j1b.apply_to_dataframe(df, do_query=True)
-
property
folder
¶ Folding object used during training.
-
property
model0
¶ Model for the 0th fold.
-
property
model1
¶ Model for the 1st fold.
-
property
model2
¶ Model for the 2nd fold.
-
parse_summary_json
(summary_file)[source]¶ Parse a training’s summary json file.
- Parameters
summary_file (str or os.PathLike) – the summary json file
-
class
tdub.ml_apply.
SingleTrainSummary
(training_output)[source]¶ Bases:
tdub.ml_apply.BaseTrainSummary
Provides access to the properties of a single result.
- Parameters
training_output (str) – Directory containing the training result.
Examples
>>> from tdub.apply import SingleTrainSummary >>> res_1j1b = SingleTrainSummary("/path/to/some_1j1b_training_outdir")
-
apply_to_dataframe
(df, column_name='unnamed_response', do_query=False)[source]¶ Apply trained model to an arbitrary dataframe.
This function will augment the dataframe with a new column (with a name given by the
column_name
argument) if it doesn’t already exist. If the dataframe is empty this function does nothing.- Parameters
df (pandas.DataFrame) – Dataframe to read and augment.
column_name (str) – Name to give the BDT response variable.
do_query (bool) – Perform a query on the dataframe to select events belonging to the region associated with training result; necessary if the dataframe hasn’t been pre-filtered.
Examples
>>> from tdub.apply import FoldedTrainSummary >>> from tdub.frames import raw_dataframe >>> df = raw_dataframe("/path/to/file.root") >>> sr_1j1b = SingleTrainSummary("/path/to/single_training_1j1b") >>> sr_1j1b.apply_to_dataframe(df, do_query=True)
-
property
model
¶ Trained model.
-
tdub.ml_apply.
build_array
(summaries, df)[source]¶ Get a NumPy array which is the response for all events in df.
This will use the
apply_to_dataframe()
function from the list of summaries. We query the input dataframe to ensure that we apply to the correct events. If the input dataframe is empty then an empty array is written to disk.- Parameters
summaries (list(BaseTrainSummary)) – Sequence of training summaries to use.
df (pandas.DataFrame) – Dataframe of events to use to calculate the response.
Examples
Using folded summaries:
>>> from tdub.apply import FoldedTrainSummary, build_array >>> from tdub.frames import raw_dataframe >>> df = raw_dataframe("/path/to/file.root") >>> fr_1j1b = FoldedTrainSummary("/path/to/folded_training_1j1b") >>> fr_2j1b = FoldedTrainSummary("/path/to/folded_training_2j1b") >>> fr_2j2b = FoldedTrainSummary("/path/to/folded_training_2j2b") >>> res = build_array([fr_1j1b, fr_2j1b, fr_2j2b], df)
Using single summaries:
>>> from tdub.apply import SingleTrainSummary, build_array >>> from tdub.frames import raw_dataframe >>> df = raw_dataframe("/path/to/file.root") >>> sr_1j1b = SingleTrainSummary("/path/to/single_training_1j1b") >>> sr_2j1b = SingleTrainSummary("/path/to/single_training_2j1b") >>> sr_2j2b = SingleTrainSummary("/path/to/single_training_2j2b") >>> res = build_array([sr_1j1b, sr_2j1b, sr_2j2b], df)