tdub

Actions Status Documentation Status PyPI Python Version

tdub is a Python project for handling some downstsream steps in the ATLAS Run 2 \(tW\) inclusive cross section analysis. The project provides a simple command line interface for performing standard analysis tasks including:

  • BDT feature selection and hyperparameter optimization.

  • Training BDT models on our Monte Carlo.

  • Applying trained BDT models to our data and Monte Carlo.

  • Generating plots from various raw sources (our ROOT files and Classifier training output).

  • Generating plots from the output of TRExFitter.

For potentially finer-grained tasks the API is fully documented. The API mainly provides quick and easy access to pythonic representations (i.e. dataframes or NumPy arrays) of our datasets (which of course originate from ROOT files), modularized ML tasks, and a set of utilities tailored for interacting with our specific datasets.

Click Based CLI

The command line interface provides a way to execute a handful of common tasks without touching any Python code. The CLI is implemented using click.

tdub

Top Level CLI function.

tdub [OPTIONS] COMMAND [ARGS]...

apply

Tasks to apply machine learning models to data.

tdub apply [OPTIONS] COMMAND [ARGS]...
all

Generate BDT response arrays for all ROOT files in DATADIR.

tdub apply all [OPTIONS] DATADIR ARRNAME OUTDIR WORKSPACE

Options

-f, --fold-results <fold_results>

fold output directories

-s, --single-results <single_results>

single result dirs

--and-submit

submit the condor jobs

Arguments

DATADIR

Required argument

ARRNAME

Required argument

OUTDIR

Required argument

WORKSPACE

Required argument

single

Generate BDT response array for INFILE and save to .npy file.

We generate the .npy files using either single training results (-s flag) or folded training results (-f flag).

tdub apply single [OPTIONS] INFILE ARRNAME OUTDIR

Options

-f, --fold-results <fold_results>

fold output directories

-s, --single-results <single_results>

single result dirs

Arguments

INFILE

Required argument

ARRNAME

Required argument

OUTDIR

Required argument

misc

Tasks under a miscellaneous umbrella.

tdub misc [OPTIONS] COMMAND [ARGS]...
drdscomps

Generate plots comparing DR and DS (with BDT cuts shown).

tdub misc drdscomps [OPTIONS] DATADIR

Options

-o, --outdir <outdir>

Output directory.

--thesis

Flag for thesis label.

Arguments

DATADIR

Required argument

soverb

Get signal over background using data in DATADIR and a SELECTIONS file.

the format of the JSON entries should be “region”: “numexpr selection”.

tdub misc soverb [OPTIONS] DATADIR SELECTIONS

Options

-t, --use-tptrw

use top pt reweighting

Arguments

DATADIR

Required argument

SELECTIONS

Required argument

rex

Tasks interacting with TRExFitter results.

tdub rex [OPTIONS] COMMAND [ARGS]...
grimpacts

Print summary of grouped impacts.

tdub rex grimpacts [OPTIONS] REX_DIR

Options

--tablefmt <tablefmt>

Format passed to tabulate.

--include-total

Include FullSyst entry

Arguments

REX_DIR

Required argument

impact

Generate impact plot from TRExFitter result.

tdub rex impact [OPTIONS] REX_DIR

Options

--thesis

Flat to use thesis label.

Arguments

REX_DIR

Required argument

impstabs

Generate impact stability tests based on rexpy output.

tdub rex impstabs [OPTIONS] HERWIG704 HERWIG713

Options

-o, --outdir <outdir>

Output directory.

Arguments

HERWIG704

Required argument

HERWIG713

Required argument

index

Generate index.html file for the workspace.

tdub rex index [OPTIONS] REX_DIR

Arguments

REX_DIR

Required argument

stabs

Generate stability tests based on rexpy output.

tdub rex stabs [OPTIONS] UMBRELLA

Options

-o, --outdir <outdir>

Output directory.

-t, --tests <tests>

Tests to run.

Arguments

UMBRELLA

Required argument

stacks

Generate plots from TRExFitter result.

tdub rex stacks [OPTIONS] REX_DIR

Options

--chisq, --no-chisq

Do or don’t print chi-square information.

--internal, --no-internal

Do or don’t include internal label.

--thesis, --no-thesis

Use thesis label

--png, --no-png

Also save PNG version of plots.

-n, --n-test <n_test>

Test only n plots (for stacks).

Arguments

REX_DIR

Required argument

train

Tasks to perform machine learning steps.

tdub train [OPTIONS] COMMAND [ARGS]...
check

Check the results of a parameter scan WORKSPACE.

tdub train check [OPTIONS] WORKSPACE

Options

-p, --print-top

Print the top results

-n, --n-res <n_res>

Number of top results to print

Default:

10

Arguments

WORKSPACE

Required argument

fold

Perform a folded training based on a hyperparameter scan result.

tdub train fold [OPTIONS] SCANDIR DATADIR

Options

-t, --use-tptrw

use top pt reweighting

-n, --n-splits <n_splits>

number of splits for folding

Default:

3

Arguments

SCANDIR

Required argument

DATADIR

Required argument

itables

Generate importance tables.

tdub train itables [OPTIONS] SUMMARY_FILE

Arguments

SUMMARY_FILE

Required argument

prep

Prepare data for training.

tdub train prep [OPTIONS] DATADIR {1j1b|2j1b|2j2b} OUTDIR

Options

-p, --pre-exec <pre_exec>

Python code to pre-execute

-n, --nlo-method <nlo_method>

tW simluation NLO method

Default:

DR

-x, --override-selection <override_selection>

override selection with contents of file

-t, --use-tptrw

apply top pt reweighting

-r, --use-trrw

apply top recursive reweighting

-i, --ignore-list <ignore_list>

variable ignore list file

-m, --multiple-ttbar-samples

use multiple ttbar MC samples

-a, --use-inc-af2

use inclusive af2 samples

-f, --bkg-sample-frac <bkg_sample_frac>

use a fraction of the background

-d, --use-dilep

train with dilepton samples

Arguments

DATADIR

Required argument

REGION

Required argument

OUTDIR

Required argument

scan

Perform a parameter scan via condor jobs.

DATADIR points to the intput ROOT files, training is performed on the REGION and all output is saved to WORKSPACE.

$ tdub train scan /data/path 2j2b scan_2j2b

tdub train scan [OPTIONS] DATADIR WORKSPACE

Options

-p, --pre-exec <pre_exec>

Python code to pre-execute

-e, --early-stop <early_stop>

number of early stopping rounds

Default:

10

-s, --test-size <test_size>

training test size

Default:

0.4

--overwrite

overwrite existing workspace

--and-submit

submit the condor jobs

Arguments

DATADIR

Required argument

WORKSPACE

Required argument

shapes

Generate shape comparison plots.

tdub train shapes [OPTIONS] DATADIR

Options

-o, --outdir <outdir>

Directory to save output.

Arguments

DATADIR

Required argument

single

Execute single training round.

tdub train single [OPTIONS] DATADIR OUTDIR

Options

-p, --pre-exec <pre_exec>

Python code to pre-execute

-s, --test-size <test_size>

training test size

Default:

0.4

-e, --early-stop <early_stop>

number of early stopping rounds

Default:

10

-k, --use-sklearn

use sklearn instead of lgbm

-g, --use-xgboost

use xgboost instead of lgbm

-l, --learning-rate <learning_rate>

learning_rate model parameter

Default:

0.1

-n, --num-leaves <num_leaves>

num_leaves model parameter

Default:

16

-m, --min-child-samples <min_child_samples>

min_child_samples model parameter

Default:

500

-d, --max-depth <max_depth>

max_depth model parameter

Default:

5

-r, --reg-lambda <reg_lambda>

lambda (L2) regularization

Default:

0

-a, --auto-region

Use parameters associated with region

Default:

False

Arguments

DATADIR

Required argument

OUTDIR

Required argument

tdub.art

A module for art (plots)

Function Summary

canvas_from_counts(counts, errors, bin_edges)

Create a plot canvas given a dictionary of counts and bin edges.

draw_atlas_label(ax[, follow, cme_and_lumi, ...])

Draw the ATLAS label text, with extra lines if desired.

draw_impact_barh(ax, df[, hi_color, ...])

Draw the impact plot.

draw_uncertainty_bands(uncertainty, ...[, ...])

Draw uncertainty bands on both axes in stack plot with a ratio.

legend_last_to_first(ax, **kwargs)

Move the last element of the legend to first.

one_sided_comparison_plot(nominal, one_up, edges)

Create plot for one sided systematic comparison.

setup_tdub_style()

Modify matplotlib's rcParams to our preference.

Reference

tdub.art.canvas_from_counts(counts, errors, bin_edges, uncertainty=None, total_mc=None, logy=False, mpl_triplet=None, combine_minor=True, **subplots_kw)[source]

Create a plot canvas given a dictionary of counts and bin edges.

The counts and errors dictionaries are expected to have the following keys:

  • “Data”

  • “tW_DR” or “tW”

  • “ttbar”

  • “Zjets”

  • “Diboson”

  • “MCNP”

Parameters:
  • counts (dict(str, np.ndarray)) – a dictionary pairing samples to bin counts.

  • errors (dict(str, np.ndarray)) – a dictionray pairing samples to bin count errors.

  • bin_edges (array_like) – the histogram bin edges.

  • uncertainty (tdub.root.TGraphAsymmErrors) – Uncertainty (TGraphAsym).

  • total_mc (tdub.root.TH1) – Total MC histogram (TH1D).

  • logy (bool) – Use log scale on y-axis.

  • mpl_triplet ((plt.Figure, plt.Axes, plt.Axes), optional) – Existing matplotlib triplet.

  • combine_minor (bool) – Combine minor backgrounds into a single contribution (Zjets, Diboson, and MCNP will be labeled “Minor Backgrounds”).

  • subplots_kw (dict) – remaining keyword arguments passed to matplotlib.pyplot.subplots().

Returns:

tdub.art.draw_atlas_label(ax, follow='Internal', cme_and_lumi=True, extra_lines=None, cme=13, lumi=139, x=0.04, y=0.905, follow_shift=0.17, s1=18, s2=14, thesis=False)[source]

Draw the ATLAS label text, with extra lines if desired.

Parameters:
  • ax (matplotlib.axes.Axes) – Axes to draw the text on.

  • follow (str) – Text to follow the ATLAS label (usually ‘Internal’).

  • extra_lines (list(str), optional) – Set of extra lines to draw below ATLAS label.

  • cme (int or float) – Center-of-mass energy.

  • lumi (int or float) – Integrated luminosity of the data.

  • x (float) – x-location of the text.

  • y (float) – y-location of the text.

  • follow_shift (float) – x-shift of the text following the ATLAS label.

  • s1 (int) – Size of the main label.

  • s2 (int) – Size of the extra text

  • thesis (bool) – Flag for is thesis

tdub.art.draw_impact_barh(ax, df, hi_color='steelblue', lo_color='mediumturquoise', height_fill=0.8, height_line=0.8)[source]

Draw the impact plot.

Parameters:
  • ax (matplotlib.axes.Axes) – Axes for the “delta mu” impact.

  • df (pandas.DataFrame) – Dataframe containing impact information.

  • hi_color (str) – Up variation color.

  • lo_color (str) – Down variation color.

  • height_fill (float) – Height for the filled bars (post-fit).

  • height_line (float) – Height for the line (unfilled) bars (pre-fit).

Returns:

  • matplotlib.axes.Axes – Axes for the impact: “delta mu”.

  • matplotlib.axes.Axes – Axes for the nuisance parameter pull.

tdub.art.draw_uncertainty_bands(uncertainty, total_mc, ax, axr, label='Uncertainty', edgecolor='mediumblue', zero_threshold=0.25)[source]

Draw uncertainty bands on both axes in stack plot with a ratio.

Parameters:
  • uncertainty (tdub.root.TGraphAsymmErrors) – ROOT TGraphAsymmErrors with full systematic uncertainty.

  • total_mc (tdub.root.TH1) – ROOT TH1 providing the full Monte Carlo prediction.

  • ax (matplotlib.axes.Axes) – Main axis (where histogram stack is painted)

  • axr (matplotlib.axes.Axes) – Ratio axis

  • label (str) – Legend label for the uncertainty.

  • zero_threshold (float) – When total MC events are below threshold, zero contents and error.

tdub.art.legend_last_to_first(ax, **kwargs)[source]

Move the last element of the legend to first.

Parameters:
tdub.art.one_sided_comparison_plot(nominal, one_up, edges, thesis=False)[source]

Create plot for one sided systematic comparison.

Parameters:
Returns:

tdub.art.setup_tdub_style()[source]

Modify matplotlib’s rcParams to our preference.

tdub.batch

A module for running batch jobs (currently targets the US ATLAS BNL cluster).

Function Summary

add_condor_arguments(arguments, to_file)

Add an arguments line to a condor submission script.

condor_preamble(workspace, exe[, universe, ...])

Create the preamble of a condor submission script.

create_condor_workspace(name[, overwrite])

Create a condor workspace given a name.

Reference

tdub.batch.add_condor_arguments(arguments, to_file)[source]

Add an arguments line to a condor submission script.

the arguments argument is prefixed with “Arguments = “ and written to to_file.

Parameters:
  • arguments (str) – the arguments line

  • to_file (TextIO) – the open file stream

Examples

>>> import tdub.batch as tb
>>> import shutil
>>> ws = tb.create_condor_workspace("./some/ws")
>>> with open(ws / "condor.sub", "w") as f:
...     preamble = tb.condor_preamble(ws, shutil.which("tdub"), to_file=f)
...     tb.add_condor_arguments("train-single ......", f)
tdub.batch.condor_preamble(workspace, exe, universe='vanilla', memory='2GB', email='ddavis@phy.duke.edu', notification='Error', getenv='True', to_file=None, **kwargs)[source]

Create the preamble of a condor submission script.

Extra kwargs create additional preamble entries. See the HTCondor documentation for more details on all parameters.

Parameters:
  • workspace (str or os.PathLike) – the filesystem directry where the workspace is

  • exe (str or os.PathLike) – the path of the executable that condor will run

  • universe (str) – the HTCondor universe

  • memory (str) – the requested memory

  • email (str) – the email to send updates to (if any)

  • notification (str) – the condor notification argument

  • to_file (TextIO, optional) – if not None, write the string to file

Returns:

the submission script preamble

Return type:

str

Examples

>>> import tdub.batch as tb
>>> import shutil
>>> ws = tb.create_condor_workspace("./some/ws")
>>> with open(ws / "condor.sub", "w") as f:
...     preamble = tb.condor_preamble(ws, shutil.which("tdub"), to_file=f)
...     tb.add_condor_arguments("train-single ......", f)
tdub.batch.create_condor_workspace(name, overwrite=False)[source]

Create a condor workspace given a name.

This will create a new directory containing log, out, and err directories inside. The workspace argument to the condor_preamble() function assumes creation of a workspace via this function.

Missing parent directories will always be created.

Parameters:
  • name (str or os.PathLike) – the desired filesystem path for the workspace

  • overwrite (bool) – if True, an existing workspace will be overwritten

Raises:

OSError – if the filesystem path exists and exist_ok is False

Returns:

filesystem path to the workspace

Return type:

pathlib.PosixPath

Examples

>>> import tdub.batch as tb
>>> import shutil
>>> ws = tb.create_condor_workspace("./some/ws")
>>> with open(ws / "condor.sub", "w") as f:
...     preamble = tb.condor_preamble(ws, shutil.which("tdub"), to_file=f)
...     tb.add_condor_arguments("train-single ......", f)

tdub.config

Analysis configuration module.

tdub is a Python library for physics analysis. Naturally some properties of the analysis need to be easily modifiable for various studies. This module houses a handful of variables that can be modified simply by importing the module.

For example, we can call tdub.data.features_for() and expect different results without changing the API usage, just changing the configuration module FEATURESET_foo constants:

>>> from tdub.data import features_for
>>> features_for("2j2b")
['mass_lep1jet1', 'mass_lep2jet1', 'pT_jet2', ...]
>>> import tdub.config
>>> tdub.config.FEATURESET_2j2b = ["pT_jet1", "met"]
>>> features_for("2j2b")
['pT_jet1', 'met']

Similarly, we can modify the selection via this module:

>>> from tdub.data import selection_for
>>> selection_for("2j2b")
'(reg2j2b == True) & (OS == True)'
>>> import tdub.config
>>> tdub.config.SELECTION_2j2b = "(reg2j2b == True) & (OS == True) & (mass_lep1jet1 < 155)"
>>> selection_for("2j2b")
'(reg2j2b == True) & (OS == True) & (mass_lep1jet1 < 155)'

This module also contains some convenience functions for helping to automate the process of providing some sensible defaults for some configuration options, but not at import time (i.e. if the default requires importing a module or parsing some data from the web).

Constant Summary

AVOID_IN_CLF

List of features to avoid in classifiers.

AVOID_IN_CLF_1j1b

List of features to avoid specifically in 1j1b classifiers.

AVOID_IN_CLF_2j1b

List of features to avoid specifically in 2j1b classifiers.

AVOID_IN_CLF_2j2b

List of features to avoid specifically in 2j2b classifiers.

DEFAULT_SCAN_PARAMETERS

The default grid to perform a parameter scan.

FEATURESET_1j1b

List of features we use for classifiers in the 1j1b region.

FEATURESET_2j1b

List of features we use for classifiers in the 2j1b region.

FEATURESET_2j2b

List of features we use for classifiers in the 2j2b region.

PLOTTING_LOGY

Plots (defined as TRExFitter Regions) to use log scale.

PLOTTING_META_TABLE

Plotting metadata table.

RANDOM_STATE

Seed for various random tasks requiring reproducibility.

SELECTION_1j1b

The numexpr selection string for the 1j1b region.

SELECTION_2j1b

The numexpr selection string for the 2j1b region.

SELECTION_2j2b

The numexpr selection string for the 2j2b region.

Function Summary

init_meta_table()

Load metadata from network to define PLOTTING_META_TABLE.

init_meta_logy()

Set a sensible default PLOTTING_LOGY value.

Constant Reference

tdub.config.AVOID_IN_CLF

List of features to avoid in classifiers.

Type:

list(str)

tdub.config.AVOID_IN_CLF_1j1b

List of features to avoid specifically in 1j1b classifiers.

Type:

list(str)

tdub.config.AVOID_IN_CLF_2j1b

List of features to avoid specifically in 2j1b classifiers.

Type:

list(str)

tdub.config.AVOID_IN_CLF_2j2b

List of features to avoid specifically in 2j2b classifiers.

Type:

list(str)

tdub.config.FEATURESET_1j1b

List of features we use for classifiers in the 1j1b region.

Type:

list(str)

tdub.config.FEATURESET_2j1b

List of features we use for classifiers in the 2j1b region.

Type:

list(str)

tdub.config.FEATURESET_2j2b

List of features we use for classifiers in the 2j2b region.

Type:

list(str)

tdub.config.PLOTTING_LOGY

Plots (defined as TRExFitter Regions) to use log scale.

Type:

list(str)

tdub.config.PLOTTING_META_TABLE

Plotting metadata table.

Type:

dict, optional

tdub.config.RANDOM_STATE

Seed for various random tasks requiring reproducibility.

Type:

int

tdub.config.SELECTION_1j1b

The numexpr selection string for the 1j1b region.

Type:

str

tdub.config.SELECTION_2j1b

The numexpr selection string for the 2j1b region.

Type:

str

tdub.config.SELECTION_2j2b

The numexpr selection string for the 2j2b region.

Type:

str

Function Reference

tdub.config.init_meta_table()[source]

Load metadata from network to define PLOTTING_META_TABLE.

tdub.config.init_meta_logy()[source]

Set a sensible default PLOTTING_LOGY value.

tdub.data

A module for handling our data.

Class Summary

Region(value)

A simple enum class for easily using region information.

SampleInfo(input_file)

Describes a sample's attritubes given it's name.

Function Summary

as_region(region)

Convert input to Region.

avoids_for(region)

Get the features to avoid for the given region.

branches_from(source[, tree, ignore_weights])

Get a list of branches from a data source.

categorize_branches(source)

Categorize branches into a separated lists.

features_for(region)

Get the feature list for a region.

quick_files(datapath[, campaign, tree])

Get a dictionary connecting sample processes to file lists.

selection_as_numexpr(selection)

Get the numexpr selection string from an arbitrary selection.

selection_as_root(selection)

Get the ROOT selection string from an arbitrary selection.

selection_branches(selection)

Construct the minimal set of branches required for a selection.

selection_for(region[, additional])

Get the selection for a given region.

Reference

class tdub.data.Region(value)[source]

A simple enum class for easily using region information.

r1j1b

Label for our 1j1b region.

r2j1b

Label for our 2j1b region.

r2j2b

Label for our 2j2b region.

Examples

Using this enum for grabing the 2j2b region from a set of files:

>>> from tdub.data import Region, selection_for
>>> from tdub.frames import iterative_selection
>>> df = iterative_selection(files, selection_for(Region.r2j2b))
static from_str(s)[source]

Get enum value for the given string.

This function supports three ways to define a region; prefixed with “r”, prefixed with “reg”, or no prefix at all. For example, Region.r2j2b can be retrieved like so:

  • Region.from_str("r2j2b")

  • Region.from_str("reg2j2b")

  • Region.from_str("2j2b")

Parameters:

s (str) – String representation of the desired region

Returns:

Enum version

Return type:

Region

Examples

>>> from tdub.data import Region
>>> Region.from_str("1j1b")
<Region.r1j1b: 0>
class tdub.data.SampleInfo(input_file)[source]

Describes a sample’s attritubes given it’s name.

Parameters:

input_file (str) – File stem containing the necessary groups to parse.

phy_process

Physics process (e.g. ttbar or tW_DR or Zjets)

Type:

str

dsid

Dataset ID

Type:

int

sim_type

Simulation type, “FS” or “AFII”

Type:

str

campaign

Campaign, MC16{a,d,e}

Type:

str

tree

Original tree (e.g. “nominal” or “EG_SCALE_ALL__1up”)

Type:

str

Examples

>>> from tdub.data import SampleInfo
>>> sampinfo = SampleInfo("ttbar_410472_AFII_MC16d_nominal.root")
>>> sampinfo.phy_process
ttbar
>>> sampinfo.dsid
410472
>>> sampinfo.sim_type
AFII
>>> sampinfo.campaign
MC16d
>>> sampinfo.tree
nominal
tdub.data.as_region(region)[source]

Convert input to Region.

Meant to be similar to numpy.asarray() function.

Parameters:

region (str or Region) – Region already as a Region or as a str

Returns:

Region representation.

Return type:

Region

Examples

>>> from tdub.data import as_region, Region
>>> as_region("r2j1b")
<Region.r2j1b: 1>
>>> as_region(Region.r2j2b)
<Region.r2j2b: 2>
tdub.data.avoids_for(region)[source]

Get the features to avoid for the given region.

See the tdub.config module for definition of the variables to avoid (and how to modify them).

Parameters:

region (str or tdub.data.Region) – Region to get the associated avoided branches.

Returns:

Features to avoid for the region.

Return type:

list(str)

Examples

>>> from tdub.data import avoids_for, Region
>>> avoids_for(Region.r2j1b)
['HT_jet1jet2', 'deltaR_lep1lep2_jet1jet2met', 'mass_lep2jet1', 'pT_jet2']
>>> avoids_for("2j2b")
['deltaR_jet1_jet2']
tdub.data.branches_from(source, tree='WtLoop_nominal', ignore_weights=False)[source]

Get a list of branches from a data source.

If the source is a list of files, the first file is the only file that is parsed.

Parameters:
  • source (str, list(str), os.PathLike, list(os.PathLike), or uproot File/Tree) – What to parse to get the branch information.

  • tree (str) – Name of the tree to get branches from

  • ignore_weights (bool) – Flag to ignore all branches starting with weight_.

Returns:

Branches from the source.

Return type:

list(str)

Raises:

TypeError – If source can’t be used to find a list of branches.

Examples

>>> from tdub.data import branches_from
>>> branches_from("/path/to/file.root", ignore_weights=True)
["pT_lep1", "pT_lep2"]
>>> branches_from("/path/to/file.root")
["pT_lep1", "pT_lep2", "weight_nominal", "weight_tptrw"]
tdub.data.categorize_branches(source)[source]

Categorize branches into a separated lists.

The categories:

  • kinematics: for kinematic features (used for classifiers)

  • weights: for any branch that starts or ends with weight

  • meta: for meta information (final state information)

Parameters:

source (list(str)) – Complete list of branches to be categorized.

Returns:

Dictionary connecting categories to their associated list of branchess.

Return type:

dict(str, list(str))

Examples

>>> from tdub.data import categorize_branches, branches_from
>>> branches = ["pT_lep1", "pT_lep2", "weight_nominal", "weight_sys_jvt", "reg2j2b"]
>>> cated = categorize_branches(branches)
>>> cated["weights"]
['weight_sys_jvt', 'weight_nominal']
>>> cated["meta"]
['reg2j2b']
>>> cated["kinematics"]
['pT_lep1', 'pT_lep2']

Using a ROOT file:

>>> root_file = PosixPath("/path/to/file.root")
>>> cated = categorize_branches(branches_from(root_file))
tdub.data.features_for(region)[source]

Get the feature list for a region.

See the tdub.config module for the definitions of the feature lists (and how to modify them).

Parameters:

region (str or tdub.data.Region) – Region as a string or enum entry. Using "ALL" returns a list of unique features from all regions.

Returns:

Features for that region (or all regions).

Return type:

list(str)

Examples

>>> from pprint import pprint
>>> from tdub.data import features_for
>>> pprint(features_for("reg2j1b"))
['mass_lep1jet1',
 'mass_lep1jet2',
 'mass_lep2jet1',
 'mass_lep2jet2',
 'pT_jet2',
 'pTsys_lep1lep2jet1jet2met',
 'psuedoContTagBin_jet1',
 'psuedoContTagBin_jet2']
tdub.data.quick_files(datapath, campaign=None, tree='nominal')[source]

Get a dictionary connecting sample processes to file lists.

The lists of files are sorted alphabetically. These types of samples are currently tested:

  • tW_DR (410648, 410649 full sim)

  • tW_DR_AFII (410648, 410649 fast sim)

  • tW_DR_PS (411038, 411039 fast sim)

  • tW_DR_inc (410646, 410647 full sim)

  • tW_DR_inc_AFII (410646, 410647 fast sim)

  • tW_DS (410656, 410657 full sim)

  • tW_DS_inc (410654, 410655 ful sim)

  • ttbar (410472 full sim)

  • ttbar_AFII (410472 fast sim)

  • ttbar_PS (410558 fast sim)

  • ttbar_PS713 (411234 fast sim)

  • ttbar_hdamp (410482 fast sim)

  • ttbar_inc (410470 full sim)

  • ttbar_inc_AFII (410470 fast sim)

  • Diboson

  • Zjets

  • MCNP

  • Data

Parameters:
  • datapath (str or os.PathLike) – Path where all of the ROOT files live.

  • campaign (str, optional) – Enforce a single campaign (“MC16a”, “MC16d”, or “MC16e”).

  • tree (str) – Upstream AnalysisTop ntuple tree.

Returns:

The dictionary of processes and their associated files.

Return type:

dict(str, list(str))

Examples

>>> from pprint import pprint
>>> from tdub.data import quick_files
>>> qf = quick_files("/path/to/some_files") ## has 410472 ttbar samples
>>> pprint(qf["ttbar"])
['/path/to/some/files/ttbar_410472_FS_MC16a_nominal.root',
 '/path/to/some/files/ttbar_410472_FS_MC16d_nominal.root',
 '/path/to/some/files/ttbar_410472_FS_MC16e_nominal.root']
>>> qf = quick_files("/path/to/some/files", campaign="MC16d")
>>> pprint(qf["tW_DR"])
['/path/to/some/files/tW_DR_410648_FS_MC16d_nominal.root',
 '/path/to/some/files/tW_DR_410649_FS_MC16d_nominal.root']
>>> qf = quick_files("/path/to/some/files", campaign="MC16a")
>>> pprint(qf["Data"])
['/path/to/some/files/Data15_data15_Data_Data_nominal.root',
 '/path/to/some/files/Data16_data16_Data_Data_nominal.root']
tdub.data.selection_as_numexpr(selection)[source]

Get the numexpr selection string from an arbitrary selection.

Parameters:

selection (str) – Selection string in ROOT or numexpr

Returns:

Selection in numexpr format.

Return type:

str

Examples

>>> selection = "reg1j1b == true && OS == true && mass_lep1jet1 < 155"
>>> from tdub.data import selection_as_numexpr
>>> selection_as_numexpr(selection)
'(reg1j1b == True) & (OS == True) & (mass_lep1jet1 < 155)'
tdub.data.selection_as_root(selection)[source]

Get the ROOT selection string from an arbitrary selection.

Parameters:

selection (str) – The selection string in ROOT or numexpr

Returns:

The same selection in ROOT format.

Return type:

str

Examples

>>> selection = "(reg1j1b == True) & (OS == True) & (mass_lep1jet1 < 155)"
>>> from tdub.data import selection_as_root
>>> selection_as_root(selection)
'(reg1j1b == true) && (OS == true) && (mass_lep1jet1 < 155)'
tdub.data.selection_branches(selection)[source]

Construct the minimal set of branches required for a selection.

Parameters:

selection (str) – Selection string in ROOT or numexpr

Returns:

Necessary branches/variables

Return type:

set(str)

Examples

>>> from tdub.data import minimal_selection_branches
>>> selection = "(reg1j1b == True) & (OS == True) & (mass_lep1lep2 > 100)"
>>> minimal_branches(selection)
{'OS', 'mass_lep1lep2', 'reg1j1b'}
>>> selection = "reg2j1b == true && OS == true && (mass_lep1jet1 < 155)"
>>> minimal_branches(selection)
{'OS', 'mass_lep1jet1', 'reg2j1b'}
tdub.data.selection_for(region, additional=None)[source]

Get the selection for a given region.

We have three regions with a default selection (1j1b, 2j1b, and 2j2b), these are the possible argument options (in str or Enum form). See the tdub.config module for the definitions of the selections (and how to modify them).

Parameters:
  • region (str or Region) – Region to get the selection for

  • additional (str, optional) – Additional selection (in ROOT or numexpr form). This will connect the region specific selection using and.

Returns:

Selection string in numexpr format.

Return type:

str

Examples

>>> from tdub.data import Region, selection_for
>>> selection_for(Region.r2j1b)
'(reg2j1b == True) & (OS == True)'
>>> selection_for("reg1j1b")
'(reg1j1b == True) & (OS == True)'
>>> selection_for("2j2b")
'(reg2j2b == True) & (OS == True)'
>>> selection_for("2j2b", additional="minimaxmbl < 155")
'((reg2j2b == True) & (OS == True)) & (minimaxmbl < 155)'
>>> selection_for("2j1b", additional="mass_lep1jetb < 155 && mass_lep2jetb < 155")
'((reg1j1b == True) & (OS == True)) & ((mass_lep1jetb < 155) & (mass_lep2jetb < 155))'

tdub.frames

A module for handling dataframes.

Factory Function Summary

iterative_selection(files, selection[, ...])

Build a selected dataframe via uproot's iterate.

raw_dataframe(files[, tree, weight_name, ...])

Construct a raw pandas flavored Dataframe with help from uproot.

Helper Function Summary

apply_weight(df, weight_name[, exclude])

Apply (multiply) a weight to all other weights in the DataFrame.

apply_weight_campaign(df[, exclude])

Multiply nominal and systematic weights by the campaign weight.

apply_weight_tptrw(df[, exclude])

Multiply nominal and systematic weights by the top pt reweight term.

apply_weight_trrw(df[, exclude])

Multiply nominal and systematic weights by the top recursive reweight term.

drop_avoid(df[, region])

Drop columns that we avoid in classifiers.

drop_cols(df, *cols)

Drop some columns from a dataframe.

drop_jet2(df)

Drop all columns with jet2 properties.

satisfying_selection(*dfs, selection)

Get subsets of dataframes that satisfy a selection.

Reference

tdub.frames.iterative_selection(files, selection, tree='WtLoop_nominal', weight_name='weight_nominal', branches=None, keep_category=None, exclude_avoids=False, use_campaign_weight=False, use_tptrw=False, use_trrw=False, sample_frac=None, **kwargs)[source]

Build a selected dataframe via uproot’s iterate.

If we want to build a memory-hungry dataframe and apply a selection this helps us avoid crashing due to using all of our RAM. Constructing a dataframe with this function is useful when we want to grab many branches in a large dataset that won’t fit in memory before the selection.

The selection can be in either numexpr or ROOT form, we ensure that a ROOT style selection is converted to numexpr for use with pandas.eval().

Parameters:
  • files (list(str) or str) – A single ROOT file or list of ROOT files.

  • selection (str) – Selection string (numexpr or ROOT form accepted).

  • tree (str) – Tree name to turn into a dataframe.

  • weight_name (str) – Weight branch to preserve.

  • branches (list(str), optional) – List of branches to include as columns in the dataframe, default is None (all branches).

  • keep_category (str, optional) – If not None, the selected dataframe(s) will only include columns which are part of the given category (see tdub.data.categorize_branches()). The weight branch is always kept.

  • exclude_avoids (bool) – Exclude branches defined by tdub.config.AVOID_IN_CLF.

  • use_campaign_weight (bool) – Multiply the nominal weight by the campaign weight. this is potentially necessary if the samples were prepared without the campaign weight included in the product which forms the nominal weight.

  • use_tptrw (bool) – Apply the top pt reweighting factor.

  • use_trrw (bool) – Apply the top recursive reweighting factor.

  • sample_frac (float, optional) – Sample a fraction of the available data.

Returns:

The final selected dataframe(s) from the files.

Return type:

pandas.DataFrame

Examples

Creating a ttbar_df dataframe a single tW_df dataframe:

>>> from tdub.frames import iterative_selection
>>> from tdub.data import quick_files
>>> from tdub.data import selection_for
>>> qf = quick_files("/path/to/files")
>>> ttbar_dfs = iterative_selection(qf["ttbar"], selection_for("2j2b"),
...                                 entrysteps="1 GB")
>>> tW_df = iterative_selection(qf["tW_DR"], selection_for("2j2b"))

Keep only kinematic branches after selection and ignore avoided columns:

>>> tW_df = iterative_selection(qf["tW_DR"],
...                             selection_for("2j2b"),
...                             exclue_avoids=True,
...                             keep_category="kinematics")
tdub.frames.raw_dataframe(files, tree='WtLoop_nominal', weight_name='weight_nominal', branches=None, drop_weight_sys=False, **kwargs)[source]

Construct a raw pandas flavored Dataframe with help from uproot.

We call this dataframe “raw” because it hasn’t been parsed by any other tdub.frames functionality (no selection performed, kinematic and weight branches won’t be separated, etc.) – just a pure raw dataframe from some ROOT files.

Extra kwargs are fed to uproot’s arrays() interface.

Parameters:
  • files (list(str) or str) – Single ROOT file or list of ROOT files.

  • tree (str) – The tree name to turn into a dataframe.

  • weight_name (str) – Weight branch (we make sure to grab it if you give something other than None to branches).

  • branches (list(str), optional) – List of branches to include as columns in the dataframe, default is None, includes all branches.

  • drop_weight_sys (bool) – Drop all weight systematics from the being grabbed.

Returns:

The pandas flavored DataFrame with all requested branches

Return type:

pandas.DataFrame

Examples

>>> from tdub.data import quick_files
>>> from tdub.frames import raw_dataframe
>>> files = quick_files("/path/to/files")["ttbar"]
>>> df = raw_dataframe(files)
tdub.frames.apply_weight(df, weight_name, exclude=None)[source]

Apply (multiply) a weight to all other weights in the DataFrame.

This will multiply the nominal weight and all systematic weights in the DataFrame by the weight_name column. We augment pandas.DataFrame with this function.

Parameters:
  • df (pandas.DataFrame) – Dataaframe to operate on.

  • weight_name (str) – Column name to multiple all other weight columns by.

  • exclude (list(str), optional) – List of columns ot exclude when determining the other weight columns to operate on.

Examples

>>> import tdub.frames
>>> df = tdub.frames.raw_dataframe("/path/to/file.root")
>>> df.apply_weight("weight_campaign")
tdub.frames.apply_weight_campaign(df, exclude=None)[source]

Multiply nominal and systematic weights by the campaign weight.

This is useful for samples that were produced without the campaign weight term already applied to all other weights. We augment pandas.DataFrame with this function.

Parameters:
  • df (pandas.DataFrame) – Dataframe to operate on.

  • exclude (list(str), optional) – List of columns to exclude when determining the other weight columns to operate on.

Examples

>>> import tdub.frames
>>> df = tdub.frames.raw_dataframe("/path/to/file.root")
>>> df.weight_nominal[5]
0.003
>>> df.weight_campaign[5]
0.4
>>> df.apply_weight_campaign()
>>> df.weight_nominal[5]
0.0012
tdub.frames.apply_weight_tptrw(df, exclude=None)[source]

Multiply nominal and systematic weights by the top pt reweight term.

This is useful for samples that were produced without the top pt reweighting term already applied to all other weights. We augment pandas.DataFrame with this function.

Parameters:
  • df (pandas.DataFrame) – Dataframe to operate on.

  • exclude (list(str), optional) – List of columns to exclude when determining the other weight columns to operate on.

Examples

>>> import tdub.frames
>>> df = tdub.frames.raw_dataframe("/path/to/file.root")
>>> df.weight_nominal[5]
0.002
>>> df.weight_tptrw_tool[5]
0.98
>>> df.apply_weight_tptrw()
>>> df.weight_nominal[5]
0.00196
tdub.frames.apply_weight_trrw(df, exclude=None)[source]

Multiply nominal and systematic weights by the top recursive reweight term.

This is useful for samples that were produced without the top recursive reweighting term already applied to all other weights. We augment pandas.DataFrame with this function.

Parameters:
  • df (pandas.DataFrame) – Dataframe to operate on.

  • exclude (list(str), optional) – List of columns to exclude when determining the other weight columns to operate on.

Examples

>>> import tdub.frames
>>> df = tdub.frames.raw_dataframe("/path/to/file.root")
>>> df.weight_nominal[5]
0.002
>>> df.weight_trrw_tool[5]
0.98
>>> df.apply_weight_trrw()
>>> df.weight_nominal[5]
0.00196
tdub.frames.drop_avoid(df, region=None)[source]

Drop columns that we avoid in classifiers.

Uses tdub.frames.drop_cols() with a predefined set of columns (tdub.config.AVOID_IN_CLF). We augment pandas.DataFrame with this function.

Parameters:
  • df (pandas.DataFrame) – Dataframe that you want to slim.

  • region (optional, str or tdub.data.Region) – Region to augment the list of dropped columns (see the region specific AVOID constants in the config module).

Examples

>>> from tdub.frames import drop_avoid
>>> import pandas as pd
>>> df = pd.read_parquet("some_file.parquet")
>>> "E_jetL1" in df.columns:
True
>>> drop_avoid(df)
>>> "E_jetL1" in df.columns:
False
tdub.frames.drop_cols(df, *cols)[source]

Drop some columns from a dataframe.

This is a convenient function because it just ignores branches that don’t exist in the dataframe that are present in cols. We augment pandas.DataFrame with this function

Parameters:
  • df (pandas.DataFrame) – Dataframe which we want to slim.

  • *cols (sequence of strings) – Columns to remove

Examples

>>> import pandas as pd
>>> from tdub.data import drop_cols
>>> df = pd.read_parquet("some_file.parquet")
>>> "E_jet1" in df.columns:
True
>>> "mass_jet1" in df.columns:
True
>>> "mass_jet2" in df.columns:
True
>>> drop_cols(df, "E_jet1", "mass_jet1")
>>> "E_jet1" in df.columns:
False
>>> "mass_jet1" in df.columns:
False
>>> df.drop_cols("mass_jet2") # use augmented df class
>>> "mass_jet2" in df.columns:
False
tdub.frames.drop_jet2(df)[source]

Drop all columns with jet2 properties.

In the 1j1b region we obviously don’t have a second jet; so this lets us get rid of all columns dependent on jet2 kinematic properties. We augment pandas.DataFrame with this function.

Parameters:

df (pandas.DataFrame) – Dataframe that we want to slim.

Examples

>>> from tdub.frames import drop_jet2
>>> import pandas as pd
>>> df = pd.read_parquet("some_file.parquet")
>>> "pTsys_lep1lep2jet1jet2met" in df.columns:
True
>>> drop_jet2(df)
>>> "pTsys_lep1lep2jet1jet2met" in df.columns:
False
tdub.frames.satisfying_selection(*dfs, selection)[source]

Get subsets of dataframes that satisfy a selection.

The selection string can be in either ROOT or numexpr form (we ensure to convert ROOT to numexpr).

Parameters:
  • *dfs (sequence of pandas.DataFrame) – Dataframes to apply the selection to.

  • selection (str) – Selection string (in numexpr or ROOT form).

Returns:

Dataframes satisfying the selection string.

Return type:

list(pandas.DataFrame)

Examples

>>> from tdub.data import quick_files
>>> from tdub.frames import raw_dataframe, satisfying_selection
>>> qf = quick_files("/path/to/files")
>>> df_tW_DR = raw_dataframe(qf["tW_DR"])
>>> df_ttbar = raw_dataframe(qf["ttbar"])
>>> low_bdt = "(bdt_response < 0.4)"
>>> tW_DR_selected, ttbar_selected = satisfying_selection(
...     dfim_tW_DR.df, dfim_ttbar.df, selection=low_bdt
... )

tdub.features

A module for performing feature selection

Class Summary

FeatureSelector(df, labels, weights[, ...])

A class to steer the steps of feature selection.

Function Summary

create_parquet_files(qf_dir[, out_dir, ...])

Create slimmed and selected parquet files from ROOT files.

prepare_from_parquet(data_dir, region[, ...])

Prepare feature selection data from parquet files.

Reference

class tdub.features.FeatureSelector(df, labels, weights, importance_type='gain', corr_threshold=0.85, name=None)[source]

A class to steer the steps of feature selection.

Parameters:
  • df (pandas.DataFrame) – The dataframe which contains signal and background events; it should also only contain features we wish to test for (it is expected to be “clean” from non-kinematic information, like metadata and weights).

  • weights (numpy.ndarray) – the weights array compatible with the dataframe

  • importance_type (str) – the importance type (“gain” or “split”)

  • labels (numpy.ndarray) – array of labels compatible with the dataframe (1 for \(tW\) and 0 for \(t\bar{t}\).

  • corr_threshold (float) – the threshold for excluding features based on correlations

  • name (str, optional) – give the selector a name

data

the raw dataframe as fed to the class instance

Type:

pandas.DataFrame

weights

the raw weights array compatible with the dataframe

Type:

numpy.ndarray

labels

the raw labels array compatible with the dataframe (we expect 1 for signal, \(tW\), and 0 for background, \(t\bar{t}\)).

Type:

numpy.ndarray

raw_features

the list of all features determined at initialization

Type:

list(str)

name

a name for the selector pipeline, required to save the result)

Type:

str, optional

corr_threshold

the threshold for excluding features based on correlations

Type:

float

default_clf_opts

the default arguments we initialize classifiers with.

Type:

dict

corr_matrix

the raw correlation matrix for the features (requires calling the check_collinearity function)

Type:

pandas.DataFrame

correlated

a dataframe matching features that satisfy the correlation threshold

Type:

pandas.DataFrame

importances

the importances as determined by a vanilla GBDT (requires calling the check_importances function)

Type:

pandas.DataFrame

candidates

list of candiate featurese (sorted by importance) as determined by calling the check_candidates

Type:

list(str)

iterative_remove_aucs

a dictionary of the form {feature : auc} providing the AUC value for a BDT trained _without_ the feature given in the key. The keys are built from the candidates list.

Type:

dict(str, float)

iterative_add_aucs

an array of AUC values built by iteratively adding the next best feature in the candidates list. (the first entry is calculated using only the top feature, the second entry uses the top 2 features, and so on).

Type:

numpy.ndarray

Examples

>>> from tdub.features import FeatureSelector, prepare_from_parquet
>>> df, labels, weights = prepare_from_parquet("/path/to/pq/output", "2j1b", "DR")
>>> fs = FeatureSelector(df=df, labels=labels, weights=weights, corr_threshold=0.90)
check_candidates(n=20)[source]

Get the top uncorrelated features.

This will parse the correlations and most important features and build a list of ordered important features. When a feature that should be dropped due to a collinear feature is found, we ensure that the more important member of the pair is included in the resulting list and drop the other member of the pair. This will populate the candidates attribute for the class.

Parameters:

n (int) – the total number of features to retrieve

Examples

>>> from tdub.features import FeatureSelector, prepare_from_parquet
>>> df, labels, weights = prepare_from_parquet("/path/to/pq/output", "2j1b", "DR")
>>> fs = FeatureSelector(df=df, labels=labels, weights=weights, corr_threshold=0.90)
>>> fs.check_for_uniques(and_drop=True)
>>> fs.check_collinearity()
>>> fs.check_importances(extra_fit_opts=dict(verbose=40, early_stopping_round=15))
>>> fs.check_candidates(n=25)
check_collinearity(threshold=None)[source]

Calculate the correlations of the features.

Given a correlation threshold this will construct a list of features that should be dropped based on the correlation values. This also adds a new property to the instance.

If the threshold argument is not None then the class instance’s corr_threshold property is updated.

Parameters:

threshold (float, optional) – Override the existing correlations threshold.

Examples

Overriding the exclusion threshold:

>>> from tdub.features import FeatureSelector, prepare_from_parquet
>>> df, labels, weights = prepare_from_parquet("/path/to/pq/output", "2j1b", "DR")
>>> fs = FeatureSelector(df=df, labels=labels, weights=weights, corr_threshold=0.90)
>>> fs.check_for_uniques(and_drop=True)
>>> fs.corr_threshold
0.90
>>> fs.check_collinearity(threshold=0.85)
>>> fs.corr_threshold
0.85
check_for_uniques(and_drop=True)[source]

Check the dataframe for features that have a single unique value.

Parameters:

and_drop (bool) – If True, and_drop any unique columns.

Examples

>>> from tdub.features import FeatureSelector, prepare_from_parquet
>>> df, labels, weights = prepare_from_parquet("/path/to/pq/output", "2j1b", "DR")
>>> fs = FeatureSelector(df=df, labels=labels, weights=weights, corr_threshold=0.90)
>>> fs.check_for_uniques(and_drop=True)
check_importances(extra_clf_opts=None, extra_fit_opts=None, n_fits=5, test_size=0.5)[source]

Train vanilla GBDT to calculate feature importance.

some default options are used for the lightgbm.LGBMClassifier instance and fit (see implementation); you can provide extras via function some arguments.

Parameters:

Examples

>>> from tdub.features import FeatureSelector, prepare_from_parquet
>>> df, labels, weights = prepare_from_parquet("/path/to/pq/output", "2j1b", "DR")
>>> fs = FeatureSelector(df=df, labels=labels, weights=weights, corr_threshold=0.90)
>>> fs.check_for_uniques(and_drop=True)
>>> fs.check_collinearity()
>>> fs.check_importances(extra_fit_opts=dict(verbose=40, early_stopping_round=15))
check_iterative_add_aucs(max_features=None, extra_clf_opts=None, extra_fit_opts=None)[source]

Calculate aucs iteratively adding the next best feature.

After calling the check_candidates function we have a good set of candidate features; this function will train vanilla BDTs iteratively including one more feater at a time starting with the most important.

Parameters:
  • max_features (int) – the maximum number of features to allow to be checked. default will be the length of the candidates list.

  • extra_clf_opts (dict) – extra arguments forwarded to lightgbm.LGBMClassifier.

  • extra_fit_opts (dict) – extra arguments forwarded to lightgbm.LGBMClassifier.fit().

Examples

>>> from tdub.features import FeatureSelector, prepare_from_parquet
>>> df, labels, weights = prepare_from_parquet("/path/to/pq/output", "2j1b", "DR")
>>> fs = FeatureSelector(df=df, labels=labels, weights=weights, corr_threshold=0.90)
>>> fs.check_for_uniques(and_drop=True)
>>> fs.check_collinearity()
>>> fs.check_importances(extra_fit_opts=dict(verbose=40, early_stopping_round=15))
>>> fs.check_candidates(n=25)
>>> fs.check_iterative_add_aucs(max_features=20)
check_iterative_remove_aucs(max_features=None, extra_clf_opts=None, extra_fit_opts=None)[source]

Calculate the aucs iteratively removing one feature at a time.

After calling the check_candidates function we have a good sete of candidate features; this function will train vanilla BDTs one at a time removing one of the candidate features. We rank the feature based on how impactful its removal is.

Parameters:
  • max_features (int) – the maximum number of features to allow to be checked. default will be the length of the candidates list.

  • extra_clf_opts (dict) – extra arguments forwarded to lightgbm.LGBMClassifier.

  • extra_fit_opts (dict) – extra arguments forwarded to lightgbm.LGBMClassifier.fit().

Examples

>>> from tdub.features import FeatureSelector, prepare_from_parquet
>>> df, labels, weights = prepare_from_parquet("/path/to/pq/output", "2j1b", "DR")
>>> fs = FeatureSelector(df=df, labels=labels, weights=weights, corr_threshold=0.90)
>>> fs.check_for_uniques(and_drop=True)
>>> fs.check_collinearity()
>>> fs.check_importances(extra_fit_opts=dict(verbose=40, early_stopping_round=15))
>>> fs.check_candidates(n=25)
>>> fs.check_iterative_remove_aucs(max_features=20)
save_result()[source]

Save the results to a directory.

Parameters:

output_dir (str or os.PathLike) – the directory to save relevant results to

Examples

>>> from tdub.features import FeatureSelector, prepare_from_parquet
>>> df, labels, weights = prepare_from_parquet("/path/to/pq/output", "2j1b", "DR")
>>> fs = FeatureSelector(df=df, labels=labels, weights=weights, corr_threshold=0.90)
>>> fs.check_for_uniques(and_drop=True)
>>> fs.check_collinearity()
>>> fs.check_importances(extra_fit_opts=dict(verbose=40, early_stopping_round=15))
>>> fs.check_candidates(n=25)
>>> fs.check_iterative_add_aucs(max_features=20)
>>> fs.name = "2j1b_DR"
>>> fs.save_result()
tdub.features.create_parquet_files(qf_dir, out_dir=None, entrysteps=None, use_campaign_weight=False)[source]

Create slimmed and selected parquet files from ROOT files.

this function requires pyarrow.

Parameters:
  • qf_dir (str or os.PathLike) – directory to run tdub.data.quick_files()

  • out_dir (str or os.PathLike, optional) – directory to save output files

  • entrysteps (any, optional) – entrysteps option forwarded to tdub.frames.iterative_selection()

  • use_campaign_weight (bool) – multiply the nominal weight by the campaign weight. this is potentially necessary if the samples were prepared without the campaign weight included in the product which forms the nominal weight

Examples

>>> from tdub.features import create_parquet_files
>>> create_parquet_files("/path/to/root/files", "/path/to/pq/output", entrysteps="250 MB")
tdub.features.prepare_from_parquet(data_dir, region, nlo_method='DR', ttbar_frac=None, weight_mean=None, weight_scale=None, scale_sum_weights=True, test_case_size=None)[source]

Prepare feature selection data from parquet files.

this function requires pyarrow.

Parameters:
  • data_dir (str or os.PathLike) – directory where the parquet files live

  • region (str or tdub.data.Region) – the region where we’re going to select features

  • nlo_method (str) – the \(tW\) sample (DR or DS)

  • ttbar_frac (str or float, optional) – if not None, this is the fraction of \(t\bar{t}\) events to use, “auto” (the default) uses some sensible defaults to fit in memory: 0.70 for 2j2b and 0.60 for 2j1b.

  • weight_mean (float, optional) – scale all weights such that the mean weight is this value. Cannot be used with weight_scale.

  • weight_scale (float, optional) – value to scale all weights by, cannot be used with weight_mean.

  • scale_sum_weights (bool) – scale sum of weights of signal to be sum of weights of background

  • test_case_size (int, optional) – if we want to perform a quick test, we use a subset of the data, for test_case_size=N we use N events from both signal and background. Cannot be used with ttbar_frac.

Returns:

  • pandas.DataFrame – the dataframe which contains kinematic features

  • numpy.ndarray – the labels array for the events

  • numpy.ndarray – the weights array for the events

Examples

>>> from tdub.features import prepare_from_parquet
>>> df, labels, weights = prepare_from_parquet("/path/to/pq/output", "2j1b", "DR")

tdub.hist

A module for histogramming

Class Summary

SystematicComparison(nominal, up, down)

Systematic template histogram comparison.

Function Summary

bin_centers(bin_edges)

Get bin centers given bin edges.

edges_and_centers(bins[, range])

Create arrays for edges and bin centers.

to_uniform_bins(bin_edges)

Convert a set of variable width bins to arbitrary uniform bins.

Reference

class tdub.hist.SystematicComparison(nominal, up, down)[source]

Systematic template histogram comparison.

nominal

Nominal histogram bin counts.

Type:

numpy.ndarray

up

Up variation histogram bin counts.

Type:

numpy.ndarray

down

Down variation histogram bin counts.

Type:

numpy.ndarray

percent_diff_up

Percent difference between nominal and up varation.

Type:

numpy.ndarray

percent_diff_down

Percent difference between nominald and down variation.

Type:

numpy.ndaray

static one_sided(nominal, up)[source]

Generate components of a systematic comparion plot.

Parameters:
  • nominal (numpy.ndarray) – Histogram bin counts for the nominal template.

  • up (numpy.ndarray) – Histogram bin counts for the “up” variation.

Returns:

The complete description of the comparison

Return type:

SystematicComparison

property percent_diff_max

maximum for percent difference.

Type:

float

property percent_diff_min

minimum for percent difference.

Type:

float

property template_max

maximum height of a variation.

Type:

float

tdub.hist.bin_centers(bin_edges)[source]

Get bin centers given bin edges.

Parameters:

bin_edges (numpy.ndarray) – edges defining binning

Returns:

the centers associated with the edges

Return type:

numpy.ndarray

Examples

>>> import numpy as np
>>> from tdub.hist import bin_centers
>>> bin_edges = np.linspace(25, 225, 11)
>>> centers = bin_centers(bin_edges)
>>> bin_edges
array([ 25.,  45.,  65.,  85., 105., 125., 145., 165., 185., 205., 225.])
>>> centers
array([ 35.,  55.,  75.,  95., 115., 135., 155., 175., 195., 215.])
tdub.hist.edges_and_centers(bins, range=None)[source]

Create arrays for edges and bin centers.

Parameters:
  • bins (int or sequence of scalers) – the number of bins or sequence representing bin edges

  • range (tuple(float, float), optional) – the minimum and maximum defining the bin range (used if bins is integral)

Returns:

Examples

from bin multiplicity and a range

>>> from tdub.hist import edges_and_centers
>>> edges, centers = edges_and_centers(bins=20, range=(25, 225))

from pre-existing edges

>>> edges, centers = edges_and_centers(np.linspace(0, 10, 21))
tdub.hist.to_uniform_bins(bin_edges)[source]

Convert a set of variable width bins to arbitrary uniform bins.

This will create a set of bin edges such that the bin centers are at whole numbers, i.e. 5 variable width bins will return an array from 0.5 to 5.5: [0.5, 1.5, 2.5, 3.5, 4.5, 5.5].

Parameters:

bin_edges (numpy.ndarray) – Array of bin edges.

Returns:

The new set of uniform bins

Return type:

numpy.ndarray

Examples

>>> import numpy as np
>>> from tdub.hist import to_uniform_bins
>>> var_width = [0, 1, 3, 7, 15]
>>> to_uniform_bins(var_width)
array([0.5, 1.5, 2.5, 3.5, 4.5])

tdub.math

A module with math utilities

Function Summary

chisquared_cdf_c(chi2, ndf)

Calculate \(\chi^2\) probability from the value and NDF.

chisquared_test(h1, err1, h2, err2)

Perform \(\chi^2\) test on two histograms.

kolmogorov_prob(z)

Calculate the Kolmogorov distribution function.

ks_twosample_binned(hist1, hist2, err1, err2)

Calculate KS statistic and p-value for two binned distributions.

Reference

tdub.math.chisquared_cdf_c(chi2, ndf)[source]

Calculate \(\chi^2\) probability from the value and NDF.

See ROOT’s TMath::Prob & ROOT::Math::chisquared_cdf_c. Quoting the ROOT documentation:

Computation of the probability for a certain \(\chi^2\) and number of degrees of freedom (ndf). Calculations are based on the incomplete gamma function \(P(a,x)\), where \(a=\mathrm{ndf}/2\) and \(x=\chi^2/2\).

\(P(a,x)\) represents the probability that the observed \(\chi^2\) for a correct model should be less than the value \(\chi^2\). The returned probability corresponds to \(1-P(a,x)\), which denotes the probability that an observed \(\chi^2\) exceeds the value \(\chi^2\) by chance, even for a correct model.

Parameters:
  • chi2 (float) – the \(\chi^2\) value

  • ndf (float) – the degrees of freedom

Returns:

the \(\chi^2\) probability

Return type:

float

tdub.math.chisquared_test(h1, err1, h2, err2)[source]

Perform \(\chi^2\) test on two histograms.

Parameters:
Returns:

the \(\chi^2\) test value, the degrees of freedom, and the probability

Return type:

(float, int, float)

tdub.math.kolmogorov_prob(z)[source]

Calculate the Kolmogorov distribution function.

See ROOT’s implementation in TMath (TMath::KolmogorovProb).

Parameters:

z (float) – the value to test

Returns:

the probability that the test statistic exceeds \(z\) (assuming the null hypothesis).

Return type:

float

Examples

>>> from tdub.math import kolmogorov_prob
>>> kolmogorov_prob(1.13)
0.15549781841748692
tdub.math.ks_twosample_binned(hist1, hist2, err1, err2)[source]

Calculate KS statistic and p-value for two binned distributions.

See ROOT’s implementation in TH1 (TH1::KolmogorovTest).

Parameters:
  • hist1 (numpy.ndarray) – the histogram counts for the first distribution

  • hist2 (numpy.ndarray) – the histogram counts for the second distribution

  • err1 (numpy.ndarray) – the error on the histogram counts for the first distribution

  • err2 (numpy.ndarray) – the error on the histogram counts for the second distribution

Returns:

first: the test-statistic; second: the probability of the test (much less than 1 means distributions are incompatible)

Return type:

(float, float)

Examples

>>> import pygram11
>>> from tdub.math import ks_twosample_binned
>>> data1, data2, w1, w2 = some_function_to_get_data()
>>> h1, err1 = pygram11.histogram(data1, weights=w1, bins=40, range=(-3, 3))
>>> h2, err2 = pygram11.histogram(data2, weights=w2, bins=40, range=(-3, 3))
>>> kst, ksp = ks_twosample_binned(h1, h2, err1, err2)

tdub.ml_apply

A module for applying trained models

Class Summary

BaseTrainSummary()

Base class for describing a completed training to apply to other data.

FoldedTrainSummary(fold_output)

Provides access to the properties of a folded training.

SingleTrainSummary(training_output)

Provides access to the properties of a single result.

Function Summary

build_array(summaries, df)

Get a NumPy array which is the response for all events in df.

Reference

class tdub.ml_apply.BaseTrainSummary[source]

Base class for describing a completed training to apply to other data.

apply_to_dataframe(df, column_name, do_query)[source]

Apply trained model(s) to events in a dataframe df.

All BaseTrainSummary classes must implement this function.

property features

Features used by the model.

parse_summary_json(summary_file)[source]

Parse a traning’s summary json file.

This populates the class properties with values and the resulting dictionary is saved to be accessible via the summary property. The common class properties (which all BaseTrainSummarys have by defition) besides summary are features, region, and selecton_used. This function will define those, so all BaseTrainSummary inheriting classes should call the super implementation of this method if a daughter implementation is necessary to add additional summary properties.

Parameters:

summary_file (os.PathLike) – The summary json file.

property region

Region where the training was executed.

property selection_used

Numexpr selection used on the trained datasets.

property summary

Training summary dictionary from the training json.

class tdub.ml_apply.FoldedTrainSummary(fold_output)[source]

Bases: BaseTrainSummary

Provides access to the properties of a folded training.

Parameters:

fold_output (str) – Directory with the folded training output.

Examples

>>> from tdub.apply import FoldedTrainSummary
>>> fr_1j1b = FoldedTrainSummary("/path/to/folded_training_1j1b")
apply_to_dataframe(df, column_name='unnamed_response', do_query=False)[source]

Apply trained models to an arbitrary dataframe.

This function will augment the dataframe with a new column (with a name given by the column_name argument) if it doesn’t already exist. If the dataframe is empty this function does nothing.

Parameters:
  • df (pandas.DataFrame) – Dataframe to read and augment.

  • column_name (str) – Name to give the BDT response variable.

  • do_query (bool) – Perform a query on the dataframe to select events belonging to the region associated with training result; necessary if the dataframe hasn’t been pre-filtered.

Examples

>>> from tdub.apply import FoldedTrainSummary
>>> from tdub.frames import raw_dataframe
>>> df = raw_dataframe("/path/to/file.root")
>>> fr_1j1b = FoldedTrainSummary("/path/to/folded_training_1j1b")
>>> fr_1j1b.apply_to_dataframe(df, do_query=True)
property folder

Folding object used during training.

property model0

Model for the 0th fold.

property model1

Model for the 1st fold.

property model2

Model for the 2nd fold.

parse_summary_json(summary_file)[source]

Parse a training’s summary json file.

Parameters:

summary_file (str or os.PathLike) – the summary json file

class tdub.ml_apply.SingleTrainSummary(training_output)[source]

Bases: BaseTrainSummary

Provides access to the properties of a single result.

Parameters:

training_output (str) – Directory containing the training result.

Examples

>>> from tdub.apply import SingleTrainSummary
>>> res_1j1b = SingleTrainSummary("/path/to/some_1j1b_training_outdir")
apply_to_dataframe(df, column_name='unnamed_response', do_query=False)[source]

Apply trained model to an arbitrary dataframe.

This function will augment the dataframe with a new column (with a name given by the column_name argument) if it doesn’t already exist. If the dataframe is empty this function does nothing.

Parameters:
  • df (pandas.DataFrame) – Dataframe to read and augment.

  • column_name (str) – Name to give the BDT response variable.

  • do_query (bool) – Perform a query on the dataframe to select events belonging to the region associated with training result; necessary if the dataframe hasn’t been pre-filtered.

Examples

>>> from tdub.apply import FoldedTrainSummary
>>> from tdub.frames import raw_dataframe
>>> df = raw_dataframe("/path/to/file.root")
>>> sr_1j1b = SingleTrainSummary("/path/to/single_training_1j1b")
>>> sr_1j1b.apply_to_dataframe(df, do_query=True)
property model

Trained model.

tdub.ml_apply.build_array(summaries, df)[source]

Get a NumPy array which is the response for all events in df.

This will use the apply_to_dataframe() function from the list of summaries. We query the input dataframe to ensure that we apply to the correct events. If the input dataframe is empty then an empty array is written to disk.

Parameters:

Examples

Using folded summaries:

>>> from tdub.apply import FoldedTrainSummary, build_array
>>> from tdub.frames import raw_dataframe
>>> df = raw_dataframe("/path/to/file.root")
>>> fr_1j1b = FoldedTrainSummary("/path/to/folded_training_1j1b")
>>> fr_2j1b = FoldedTrainSummary("/path/to/folded_training_2j1b")
>>> fr_2j2b = FoldedTrainSummary("/path/to/folded_training_2j2b")
>>> res = build_array([fr_1j1b, fr_2j1b, fr_2j2b], df)

Using single summaries:

>>> from tdub.apply import SingleTrainSummary, build_array
>>> from tdub.frames import raw_dataframe
>>> df = raw_dataframe("/path/to/file.root")
>>> sr_1j1b = SingleTrainSummary("/path/to/single_training_1j1b")
>>> sr_2j1b = SingleTrainSummary("/path/to/single_training_2j1b")
>>> sr_2j2b = SingleTrainSummary("/path/to/single_training_2j2b")
>>> res = build_array([sr_1j1b, sr_2j1b, sr_2j2b], df)

tdub.ml_train

A module for handling training

Class Summary

ResponseHistograms(response_type, model, ...)

Create and use histogrammed model response information.

SingleTrainingSummary(*[, auc, ks_test_sig, ...])

Describes some properties of a single training.

Function Summary

persist_prepared_data(out_dir, df, labels, ...)

Persist prepared data to disk.

prepare_from_root(sig_files, bkg_files, region)

Prepare the data to train in a region with signal and background ROOT files.

folded_training(df, labels, weights, params, ...)

Execute a folded training.

lgbm_gen_classifier([train_axes])

Create a classifier using LightGBM.

lgbm_train_classifier(clf, X_train, y_train, ...)

Train a LGBMClassifier.

single_training(df, labels, weights, ...[, ...])

Execute a single training with some parameters.

sklearn_gen_classifier([...])

Create a classifier using scikit-learn.

sklearn_train_classifier(clf, X_train, ...)

Train a Scikit-learn classifier.

tdub_train_axes([learning_rate, max_depth, ...])

Construct a dictionary of default tdub training tune.

Reference

class tdub.ml_train.ResponseHistograms(response_type, model, X_train, X_test, y_train, y_test, w_train, w_test, nbins=30)[source]

Create and use histogrammed model response information.

Parameters:
  • response_type (str) –

    Models provide different types of response, like a raw prediction or a probability of signal. This class supports:

    • ”predict” (for LGBM),

    • ”decision_function” (for Scikit-learn)

    • ”proba” (for either).

  • model (BaseEstimator) – The trained model.

  • X_train (array_like) – Training data feature matrix.

  • X_test (array_like) – Testing data feature matrix.

  • y_train (array_like) – Training data labels.

  • y_test (array_like) – Testing data labels.

  • w_train (array_like) – Training data event weights

  • w_test (array_like) – Testing data event weights

  • nbins (int) – Number of bins to use.

as_dict()[source]

dict: The histogram information as a dictionary.

draw(ax=None, xlabel=None)[source]

Draw the response histograms.

Parameters:
  • ax (matplotlib.axes.Axes, optional) – Predefined matplotlib axes to use.

  • xlabel (str, optional) – Override the automated xlabel definition.

Returns:

  • matplotlib.figure.Figure – The matplotlib figure object.

  • matplotlib.axes.Axes – The matplotlib axes object.

property ks_bkg_pval

Two sample binned KS p-value for background.

Type:

float

property ks_bkg_test

Two sample binned KS test for background.

Type:

float

property ks_sig_pval

Two sample binned KS p-value for signal.

Type:

float

property ks_sig_test

Two sample binned KS test for signal.

Type:

float

class tdub.ml_train.SingleTrainingSummary(*, auc=-1.0, ks_test_sig=-1.0, ks_pvalue_sig=-1.0, ks_test_bkg=-1.0, ks_pvalue_bkg=-1.0, **kwargs)[source]

Describes some properties of a single training.

Parameters:
  • auc (float) – the AUC value for the model

  • ks_test_sig (float) – the binned KS test value for signal

  • ks_pvalue_sig (float) – the binned KS test p-value for signal

  • ks_test_bkg (float) – the binned KS test value for background

  • ks_pvalue_bkg (float) – the binned KS test p-value for background

  • kwargs (dict) – currently unused

auc

the AUC value for the model

Type:

float

ks_test_sig

the binned KS test value for signal

Type:

float

ks_pvalue_sig

the binned KS test p-value for signal

Type:

float

ks_test_bkg

the binned KS test value for background

Type:

float

ks_pvalue_bkg

the binned KS test p-value for background

Type:

float

tdub.ml_train.persist_prepared_data(out_dir, df, labels, weights)[source]

Persist prepared data to disk.

The product of tdub.ml_train.prepare_from_root() is easily persistable to disk; this function performs that task. If the same prepared data is going to be used for multiple training executations, one can save CPU cycles by saving the prepared data instead of starting higher upstream with our ROOT ntuples.

Parameters:

Examples

>>> from tdub.data import quick_files
>>> from tdub.train import prepare_from_root, persist_prepared_data
>>> qfiles = quick_files("/path/to/data")
>>> df, y, w = prepare_from_root(qfiles["tW_DR"], qfiles["ttbar"], "2j2b")
>>> persist_prepared_data("/path/to/output/data", df, y, w)
tdub.ml_train.prepare_from_root(sig_files, bkg_files, region, branches=None, override_selection=None, weight_mean=None, weight_scale=None, scale_sum_weights=True, use_campaign_weight=False, use_tptrw=False, use_trrw=False, test_case_size=None, bkg_sample_frac=None)[source]

Prepare the data to train in a region with signal and background ROOT files.

Parameters:
  • sig_files (list(str)) – List of signal ROOT files.

  • bkg_files (list(str)) – List of background ROOT files.

  • region (Region or str) – Region where we’re going to perform the training.

  • branches (list(str), optional) – Override the list of features (usually defined by the region).

  • override_selection (str, optional) – Manual selection string to apply to the dataset (this will override the region defined selection).

  • weight_mean (float, optional) – Scale all weights such that the mean weight is this value. Cannot be used with weight_scale.

  • weight_scale (float, optional) – Value to scale all weights by, cannot be used with weight_mean.

  • scale_sum_weights (bool) – Scale sum of weights of signal to be sum of weights of background.

  • use_campaign_weight (bool) – See the parameter description for tdub.frames.iterative_selection().

  • use_tptrw (bool) – Apply the top pt reweighting factor.

  • use_trrw (bool) – Apply the top recursive reweighting factor.

  • test_case_size (int, optional) – Prepare a small test case dataset using this many training and testing samples.

  • bkg_sample_frac (float, optional) – Sample a fraction of the background data.

Returns:

Examples

>>> from tdub.data import quick_files
>>> from tdub.train import prepare_from_root
>>> qfiles = quick_files("/path/to/data")
>>> df, labels, weights = prepare_from_root(qfiles["tW_DR"], qfiles["ttbar"], "2j2b")
tdub.ml_train.folded_training(df, labels, weights, params, fit_kw, output_dir, region, kfold_kw=None)[source]

Execute a folded training.

Train a lightgbm.LGBMClassifier model using \(k\)-fold cross validation using the given input data and parameters. The models resulting from the training (and other important training information) are saved to output_dir. The entries in the kfold_kw argument are forwarded to the sklearn.model_selection.KFold class for data preprocessing. The default arguments that we use are (random_state is controlled by the tdub.config module):

  • n_splits: 3

  • shuffle: True

Parameters:
Returns:

Negative mean area under the ROC curve (AUC)

Return type:

float

Examples

>>> from tdub.data import quick_files
>>> from tdub.train import prepare_from_root
>>> from tdub.train import folded_training
>>> qfiles = quick_files("/path/to/data")
>>> df, labels, weights = prepare_from_root(qfiles["tW_DR"], qfiles["ttbar"], "2j2b")
>>> params = dict(
...     boosting_type="gbdt",
...     num_leaves=42,
...     learning_rate=0.05
...     reg_alpha=0.2,
...     reg_lambda=0.8,
...     max_depth=5,
... )
>>> folded_training(
...     df,
...     labels,
...     weights,
...     params,
...     {"verbose": 20},
...     "/path/to/train/output",
...     "2j2b",
...     kfold_kw={"n_splits": 5, "shuffle": True},
... )
tdub.ml_train.lgbm_gen_classifier(train_axes=None, **clf_params)[source]

Create a classifier using LightGBM.

Parameters:
  • train_axes (dict[str, Any]) – Values of required tdub training parameters.

  • clf_params (kwargs) – Extra arguments passed to the constructor.

Returns:

The classifier.

Return type:

lightgbm.LGBMClassifier

tdub.ml_train.lgbm_train_classifier(clf, X_train, y_train, w_train, validation_fraction=0.2, early_stopping_rounds=10, **fit_params)[source]

Train a LGBMClassifier.

Parameters:
  • clf (lightgbm.LGBMClassifier) – The classifier

  • X_train (array_like) – Training events matrix

  • y_train (array_like) – Training event labels

  • w_train (array_like) – Training event weights

  • validation_fraction (float) – Fraction of training events to use in validation set.

  • early_stopping_rounds (int) – Number of early stopping rounds to use in training.

  • fit_params (keyword arguments) – Extra keyword arguments passed to the classifier.

Returns:

The same classifier object passed to the function

Return type:

lightgbm.LGBMClassifier

tdub.ml_train.single_training(df, labels, weights, train_axes, output_dir, test_size=0.4, early_stopping_rounds=None, extra_summary_entries=None, use_sklearn=False, use_xgboost=False, save_lgbm_txt=False)[source]

Execute a single training with some parameters.

The model and some useful information (mostly plots) are saved to output_dir.

Parameters:
  • df (pandas.DataFrame) – Feature matrix in dataframe format

  • labels (numpy.ndarray) – Event labels (1 for signal; 0 for background)

  • weights (numpy.ndarray) – Event weights

  • train_axes (dict(str, Any)) – Dictionary of parameters defining the tdub train axes.

  • output_dir (str or os.PathLike) – Directory to save results of training

  • test_size (float) – Test size for splitting into training and testing sets

  • early_stopping_rounds (int, optional) – Number of rounds to have no improvement for stopping training.

  • extra_summary_entries (dict, optional) – Extra entries to save in the JSON output summary.

  • use_sklearn (bool) – Use Scikit-learn’s HistGradientBoostingClassifier.

  • use_xgboost (bool) – Use XGBoost’s XGBClassifier.

  • save_lgbm_txt (bool) – Save fitted LGBM model to text file (ignored if either use_sklearn or use_xgboost is True).

Returns:

Useful information about the training

Return type:

SingleTrainingSummary

Examples

>>> from tdub.data import quick_files
>>> from tdub.train import prepare_from_root, single_round, tdub_train_axes
>>> qfiles = quick_files("/path/to/data")
>>> df, labels, weights = prepare_from_root(qfiles["tW_DR"], qfiles["ttbar"], "2j2b")
>>> train_axes = tdub_train_axes()
...     learning_rate=0.05
...     max_depth=5,
... )
>>> single_round(
...     df,
...     labels,
...     weights,
...     tdub_train_axes,
...     "training_output",
... )
tdub.ml_train.sklearn_gen_classifier(early_stopping_rounds=10, validation_fraction=0.2, train_axes=None, **clf_params)[source]

Create a classifier using scikit-learn.

This uses Scikit-learn’s sklearn.ensemble.HistGradientBoostingClassifier.

The constructor to define early stopping rounds. Extra keyword arguments passed to the classifier initialization

Parameters:
  • early_stopping_rounds (int) – Passed as the n_iter_no_change argument to scikit-learn’s HistGradientBoostingClassifier.

  • validation_fraction (float) – Passed to the validation_fraction argument in scikit-learn’s HistGradientBoostingClassifier.

  • train_axes (dict[str, Any]) – Values of required tdub training parameters.

  • clf_params (kwargs) – Extra arguments passed to the constructor.

Returns:

The classifier.

Return type:

sklearn.ensemble.HistGradientBoostingClassifier

tdub.ml_train.sklearn_train_classifier(clf, X_train, y_train, w_train, **fit_params)[source]

Train a Scikit-learn classifier.

Parameters:
  • clf (sklearn.ensemble.HistGradientBoostingClassifier) – The classifier

  • X_train (array_like) – Training events matrix

  • y_train (array_like) – Training event labels

  • w_train (array_like) – Training event weights

  • fit_params (kwargs) – Extra keyword arguments passed to the classifier.

Returns:

The same classifier object passed to the function.

Return type:

sklearn.ensemble.HistGradientBoostingClassifier

tdub.ml_train.tdub_train_axes(learning_rate=0.1, max_depth=5, min_child_samples=50, num_leaves=31, reg_lambda=0.0, **kwargs)[source]

Construct a dictionary of default tdub training tune.

Extra keyword arguments are swallowed but never used.

Parameters:
  • learning_rate (float) – Learning rate for a classifier.

  • max_depth (int) – Max depth for a classifier.

  • min_child_samples (int) – Min child samples for a classifier.

  • num_leaves (int) – Num leaves for a classifier.

  • reg_lambda (float) – Lambda regularation (L2 regularation).

Returns:

The argument names and values

Return type:

dict(str, Any)

tdub.rex

A module for parsing TRExFitter results and producing additional plots/tables.

Class Summary

FitParam([name, label, pre_down, pre_up, ...])

Fit parameter description as a dataclass.

GroupedImpact([name, avg, sig_lo, sig_hi])

Fit grouped impact summary.

Function Summary

available_regions(rex_dir)

Get a list of available regions from a TRExFitter result directory.

chisq(rex_dir, region[, stage])

Get prefit \(\chi^2\) information from TRExFitter region.

chisq_text(rex_dir, region[, stage])

Generate nicely formatted text for \(\chi^2\) information.

compare_nuispar(name, rex_dir1, rex_dir2[, ...])

Compare nuisance parameter info between two fits.

compare_uncertainty(rex_dir1, rex_dir2[, ...])

Compare uncertainty between two fits.

comparison_summary(rex_dir1, rex_dir2[, ...])

Summarize a comparison of two fits.

data_histogram(rex_dir, region[, fit_name])

Get the histogram for the Data in a region from a TRExFitter result.

delta_param(param1, param2)

Calculate difference between two fit parameters.

delta_poi(rex_dir1, rex_dir2[, fit_name1, ...])

Calculate difference of a POI between two results directories.

fit_parameter(fit_file, name[, prettify])

Retrieve a parameter from fit result text file.

grouped_impacts(rex_dir[, include_total])

Grab grouped impacts from a fit workspace.

grouped_impacts_table(rex_dir[, tablefmt, ...])

Construct a table of grouped impacts.

meta_axis_label(region, bin_width[, meta_table])

Construct an axis label from metadata table.

meta_text(region, stage)

Construct a piece of text based on the region and fit stage.

nuispar_impact(rex_dir, name[, label])

Extract a specific nuisance parameter from a fit.

nuispar_impacts(rex_dir[, sort])

Extract a list of nuisance parameter impacts from a fit.

nuispar_impact_plot_df(nuispars)

Construct a DataFrame to organize impact plot ingredients.

nuispar_impact_plot_top20(rex_dir[, thesis])

Plot the top 20 nuisance parameters based on impact.

plot_all_regions(rex_dir, outdir[, stage, ...])

Plot all regions discovered in a TRExFitter result directory.

plot_region_stage_ff(args)

Free (multiprocessing compatible) function to plot a region + stage.

prefit_total_and_uncertainty(rex_dir, region)

Get the prefit total MC prediction and uncertainty band for a region.

prefit_histogram(root_file, sample, region)

Get a prefit histogram from a file.

prefit_histograms(rex_dir, samples, region)

Retrieve sample prefit histograms for a region.

prettify_label(label)

Fix parameter label to look nice for plots.

postfit_available(rex_dir)

Check if TRExFitter result directory contains postFit information.

postfit_total_and_uncertainty(rex_dir, region)

Get the postfit total MC prediction and uncertainty band for a region.

postfit_histogram(root_file, sample)

Get a postfit histogram from a file.

postfit_histograms(rex_dir, samples, region)

Retrieve sample postfit histograms for a region.

stability_test_standard(umbrella[, outdir, ...])

Perform a battery of standard stability tests.

stability_test_parton_shower_impacts(...[, ...])

Perform a battery of parton shower impact stability tests.

stack_canvas(rex_dir, region[, stage, ...])

Create a pre- or post-fit plot canvas for a TRExFitter region.

Reference

class tdub.rex.FitParam(name='', label='', pre_down=0.0, pre_up=0.0, post_down=0.0, post_up=0.0, central=0.0, sig_lo=0.0, sig_hi=0.0, post_max=0.0)[source]

Fit parameter description as a dataclass.

name

Technical name of the nuisance parameter.

Type:

str

label

Pretty name for plotting.

Type:

str

pre_down

Prefit down variation impact on mu.

Type:

float

pre_up

Prefit up variation impact on mu.

Type:

float

post_down

Postfit down variation impact on mu.

Type:

float

post_up

Postfit up variation impact on mu.

Type:

float

central

Central value of the NP.

Type:

float

sig_lo

Lo error on the NP.

Type:

float

sig_hi

Hi error on the NP.

Type:

float

class tdub.rex.GroupedImpact(name='', avg=0.0, sig_lo=0.0, sig_hi=0.0)[source]

Fit grouped impact summary.

name

Technical name for the group.

Type:

str

avg

Average impact estimate.

Type:

float

sig_lo

Down fluctuation estimate.

Type:

float

sig_hi

Up flucuation estimate.

Type:

float

property org_entry

Org table entry (rounded).

Type:

str

property org_entry_raw

Org table entry (raw numbers).

Type:

str

tdub.rex.available_regions(rex_dir)[source]

Get a list of available regions from a TRExFitter result directory.

Parameters:

rex_dir (str or pathlib.Path) – Path of the TRExFitter result directory

Returns:

Regions discovered in the TRExFitter result directory.

Return type:

list(str)

tdub.rex.chisq(rex_dir, region, stage='pre')[source]

Get prefit \(\chi^2\) information from TRExFitter region.

Parameters:
  • rex_dir (str or pathlib.Path) – Path of the TRExFitter result directory

  • region (str) – TRExFitter region name.

  • stage (str) – Drawing fit stage, (‘pre’ or ‘post’).

Returns:

  • float\(\chi^2\) value for the region.

  • int – Number of degrees of freedom.

  • float\(\chi^2\) probability for the region.

tdub.rex.chisq_text(rex_dir, region, stage='pre')[source]

Generate nicely formatted text for \(\chi^2\) information.

Deploys tdub.rex.chisq() for grab the info.

Parameters:
  • rex_dir (str or pathlib.Path) – Path of the TRExFitter result directory

  • region (str) – TRExFitter region name.

  • stage (str) – Drawing fit stage, (‘pre’ or ‘post’).

Returns:

Formatted string showing the \(\chi^2\) information.

Return type:

str

tdub.rex.compare_nuispar(name, rex_dir1, rex_dir2, label1=None, label2=None, np_label=None, print_to=None)[source]

Compare nuisance parameter info between two fits.

Parameters:
  • name (str) – Name of the nuisance parameter.

  • rex_dir1 (str or pathlib.Path) – Path of the first TRExFitter result directory.

  • rex_dir2 (str or pathlib.Path) – Path of the second TRExFitter result directory.

  • label1 (str, optional) – Define label for the first fit (defaults to rex_dir1).

  • label2 (str, optional) – Define label for the second fit (defaults to rex_dir2).

  • np_label (str, optional) – Give the nuisance parameter a label other than its name.

  • print_to (io.TextIOBase, optional) – Where to print results (defaults to sys.stdout).

tdub.rex.compare_uncertainty(rex_dir1, rex_dir2, fit_name1='tW', fit_name2='tW', label1=None, label2=None, poi='SigXsecOverSM', print_to=None)[source]

Compare uncertainty between two fits.

Parameters:
  • rex_dir1 (str or pathlib.Path) – Path of the first TRExFitter result directory.

  • rex_dir2 (str or pathlib.Path) – Path of the second TRExFitter result directory.

  • fit_name1 (str) – Name of the first fit.

  • fit_name2 (str) – Name of the second fit.

  • label1 (str, optional) – Define label for the first fit (defaults to rex_dir1).

  • label2 (str, optional) – Define label for the second fit (defaults to rex_dir2).

  • poi (str) – Name of the parameter of interest.

  • print_to (io.TextIOBase, optional) – Where to print results (defaults to sys.stdout).

tdub.rex.comparison_summary(rex_dir1, rex_dir2, fit_name1='tW', fit_name2='tW', label1=None, label2=None, fit_poi='SigXsecOverSM', nuispars=None, nuispar_labels=None, print_to=None)[source]

Summarize a comparison of two fits.

Parameters:
  • rex_dir1 (str or pathlib.Path) – Path of the first TRExFitter result directory.

  • rex_dir2 (str or pathlib.Path) – Path of the second TRExFitter result directory.

  • fit_name1 (str) – Name of the first fit.

  • fit_name2 (str) – Name of the second fit.

  • label1 (str, optional) – Define label for the first fit (defaults to rex_dir1).

  • label2 (str, optional) – Define label for the second fit (defaults to rex_dir2).

  • fit_poi (str) – Name of the parameter of interest.

  • nuispars (list(str), optional) – Nuisance parameters to compare.

  • nuispar_labels (list(str), optional) – Labels to give each nuisance parameter other than the default name.

  • print_to (io.TextIOBase, optional) – Where to print results (defaults to sys.stdout).

tdub.rex.data_histogram(rex_dir, region, fit_name='tW')[source]

Get the histogram for the Data in a region from a TRExFitter result.

Parameters:
  • rex_dir (str or pathlib.Path) – Path of the TRExFitter result directory

  • region (str) – TRExFitter region name.

  • fit_name (str) – Name of the Fit

Returns:

Histogram for the Data sample.

Return type:

tdub.root.TH1

tdub.rex.delta_param(param1, param2)[source]

Calculate difference between two fit parameters.

Parameters:
Returns:

  • float – Difference between the central values

  • float – Up uncertainty

  • float – Down uncertainty

tdub.rex.delta_poi(rex_dir1, rex_dir2, fit_name1='tW', fit_name2='tW', poi='SigXsecOverSM')[source]

Calculate difference of a POI between two results directories.

The default arguments will perform a calculation of \(\Delta\mu\) between two different fits. Standard error propagation is performed on both the up and down uncertainties.

Parameters:
  • rex_dir1 (str or pathlib.Path) – Path of the first TRExFitter result directory.

  • rex_dir2 (str or pathlib.Path) – Path of the second TRExFitter result directory.

  • fit_name1 (str) – Name of the first fit.

  • fit_name2 (str) – Name of the second fit.

  • poi (str) – Name of the parameter of interest.

Returns:

  • float – Central value of delta mu.

  • float – Up uncertainty on delta mu.

  • float – Down uncertainty on delta mu.

tdub.rex.fit_parameter(fit_file, name, prettify=False)[source]

Retrieve a parameter from fit result text file.

Parameters:
Raises:

ValueError – If the parameter name isn’t discovered.

Returns:

Fit parameter description.

Return type:

tdub.rex.FitParam

tdub.rex.grouped_impacts(rex_dir, include_total=False)[source]

Grab grouped impacts from a fit workspace.

Parameters:
  • rex_dir (str or pathlib.Path) – Path of the TRExFitter result directory.

  • include_total (bool) – Include the FullSyst entry.

Yields:

GroupedImpact – Iterator of grouped impacts in the fit.

tdub.rex.grouped_impacts_table(rex_dir, tablefmt='orgtbl', descending=False, **kwargs)[source]

Construct a table of grouped impacts.

Uses the https://pypi.org/project/tabulate project.

Parameters:
  • rex_dir (str or pathlib.Path) – Path of the TRExFitter result directory

  • tablefmt (str) – Format passed to tabulate.

  • descending (bool) – Sort by descending order

  • **kwargs (dict) – Passed to grouped_impacts()

Returns:

Table representation.

Return type:

str

tdub.rex.plot_all_regions(rex_dir, outdir, stage='pre', fit_name='tW', show_chisq=True, n_test=-1, internal=True, thesis=False, save_png=False)[source]

Plot all regions discovered in a TRExFitter result directory.

Parameters:
  • rex_dir (str or pathlib.Path) – Path of the TRExFitter result directory

  • outdir (str or pathlib.Path) – Path to save resulting files to

  • stage (str) – Fitting stage (“pre” or “post”).

  • fit_name (str) – Name of the Fit

  • show_chisq (bool) – Print \(\chi^2\) information on ratio canvas.

  • n_test (int) – Maximum number of regions to plot (for quick tests).

  • internal (bool) – Flag for internal label.

  • thesis (bool) – Flag for thesis label.

  • save_png (bool) – Save png versions along with the pdf versions of plots.

tdub.rex.plot_region_stage_ff(args)[source]

Free (multiprocessing compatible) function to plot a region + stage.

This function is designed to be used internally by plot_all_regions(), where it is sent to a multiprocessing pool. Not meant for generic usage.

Parameters:

args (list(Any)) – Arguments passed to stack_canvas().

tdub.rex.meta_axis_label(region, bin_width, meta_table=None)[source]

Construct an axis label from metadata table.

Parameters:
  • region (str) – TRExFitter region to use.

  • bin_width (float) – Bin width for y-axis label.

  • meta_table (dict, optional) – Table of metadata for labeling plotting axes. If None (default), the definition stored in the variable tdub.config.PLOTTING_META_TABLE is used.

Returns:

  • str – x-axis label for the region.

  • str – y-axis label for the region.

tdub.rex.meta_text(region, stage)[source]

Construct a piece of text based on the region and fit stage.

Parameters:
  • region (str) – TRExFitter Region to use.

  • stage (str) – Fitting stage (“pre” or “post”).

Returns:

Resulting metadata text

Return type:

str

tdub.rex.nuispar_impact(rex_dir, name, label=None)[source]

Extract a specific nuisance parameter from a fit.

Parameters:
  • rex_dir (str or pathlib.Path) – Path of the TRExFitter result directory.

  • name (str) – Name of the nuisance parameter.

  • label (str, optional) – Give the nuisance parameter a label other than its name.

Returns:

Desired nuisance parameter summary.

Return type:

tdub.rex.FitParam

tdub.rex.nuispar_impacts(rex_dir, sort=True)[source]

Extract a list of nuisance parameter impacts from a fit.

Parameters:

rex_dir (str or pathlib.Path) – Path of the TRExFitter result directory.

Returns:

The nuisance parameters.

Return type:

list(tdub.rex.FitParam)

tdub.rex.nuispar_impact_plot_df(nuispars)[source]

Construct a DataFrame to organize impact plot ingredients.

Parameters:

nuispars (list(FitParam)) – The nuisance parameters.

Returns:

DataFrame describing the plot ingredients.

Return type:

pandas.DataFrame

tdub.rex.nuispar_impact_plot_top20(rex_dir, thesis=False)[source]

Plot the top 20 nuisance parameters based on impact.

Parameters:
  • rex_dir (str, pathlib.Path) – Path of the TRExFitter result directory.

  • thesis (: bool) – Flag for thesis label.

tdub.rex.prefit_total_and_uncertainty(rex_dir, region)[source]

Get the prefit total MC prediction and uncertainty band for a region.

Parameters:
  • rex_dir (str or pathlib.Path) – Path of the TRExFitter result directory.

  • region (str) – Region to get error band for.

Returns:

tdub.rex.prefit_histogram(root_file, sample, region)[source]

Get a prefit histogram from a file.

Parameters:
Returns:

Desired histogram.

Return type:

tdub.root.TH1

tdub.rex.prefit_histograms(rex_dir, samples, region, fit_name='tW')[source]

Retrieve sample prefit histograms for a region.

Parameters:
  • rex_dir (str or pathlib.Path) – Path of the TRExFitter result directory

  • samples (Iterable(str)) – Physics samples of the desired histograms

  • region (str) – Region to get histograms for

  • fit_name (str) – Name of the Fit

Returns:

Prefit histograms.

Return type:

dict(str, tdub.root.TH1)

tdub.rex.prettify_label(label)[source]

Fix parameter label to look nice for plots.

Replace underscores with whitespace, TeXify some stuff, remove unnecessary things, etc.

Parameters:

label (str) – Original label.

Returns:

Prettified label.

Return type:

str

tdub.rex.postfit_available(rex_dir)[source]

Check if TRExFitter result directory contains postFit information.

Parameters:

rex_dir (str or pathlib.Path) – Path of the TRExFitter result directory

Returns:

True of postFit discovered

Return type:

bool

tdub.rex.postfit_total_and_uncertainty(rex_dir, region)[source]

Get the postfit total MC prediction and uncertainty band for a region.

Parameters:
  • rex_dir (str or pathlib.Path) – Path of the TRExFitter result directory.

  • region (str) – Region to get error band for.

Returns:

tdub.rex.postfit_histogram(root_file, sample)[source]

Get a postfit histogram from a file.

Parameters:
Returns:

Desired histogram.

Return type:

tdub.root.TH1

tdub.rex.postfit_histograms(rex_dir, samples, region)[source]

Retrieve sample postfit histograms for a region.

Parameters:
  • rex_dir (str or pathlib.Path) – Path of the TRExFitter result directory

  • region (str) – Region to get histograms for

  • samples (Iterable(str)) – Physics samples of the desired histograms

Returns:

Postfit histograms detected in the TRExFitter result directory.

Return type:

dict(str, tdub.root.TH1)

tdub.rex.stability_test_standard(umbrella, outdir=None, tests='all')[source]

Perform a battery of standard stability tests.

This function expects a rigid umbrella directory structure, based on the output of results that are generated by rexpy.

Parameters:
  • umbrella (pathlib.Path) – Umbrella directory containing all fits run via rexpy’s standard fits.

  • outdir (pathlib.Path, optional) – Directory to save results (defaults to current working directory).

  • tests (str or list(str)) –

    Which tests to execute. (default is “all”). The possible tests include:

    • "sys-drops", which shows the stability test for dropping some systematics.

    • "indiv-camps", which shows the stability test for limiting the fit to individual campaigns.

    • "regions", which shows the stability test for limiting the fit to subsets of the analysis regions.

    • "b0-check", which shows the stability test for limiting the fit to individual analysis regions and checking the B0 eigenvector uncertainty.

tdub.rex.stability_test_parton_shower_impacts(herwig704, herwig713, outdir=None)[source]

Perform a battery of parton shower impact stability tests.

This function expects a rigid pair of Herwig 7.0.4 and 7.1.3 directories based on the output of results that are generated by rexpy.

Parameters:
  • herwig704 (pathlib.Path) – Path of the Herwig 7.1.4 fit results

  • herwig713 (pathlib.Path) – Path of the Herwig 7.1.3 fit results

  • outdir (pathlib.Path, optional) – Directory to save results (defaults to current working directory).

tdub.rex.stack_canvas(rex_dir, region, stage='pre', fit_name='tW', show_chisq=True, meta_table=None, log_patterns=None, internal=True, thesis=False, combine_minor=True)[source]

Create a pre- or post-fit plot canvas for a TRExFitter region.

Parameters:
  • rex_dir (str or pathlib.Path) – Path of the TRExFitter result directory.

  • region (str) – Region to get error band for.

  • stage (str) – Drawing fit stage, (“pre” or “post”).

  • fit_name (str) – Name of the Fit

  • show_chisq (bool) – Print \(\chi^2\) information on ratio canvas.

  • meta_table (dict, optional) – Table of metadata for labeling plotting axes.

  • log_patterns (list, optional) – List of region patterns to use a log scale on y-axis.

  • internal (bool) – Flag for internal label.

  • thesis (bool) – Flag for thesis label.

  • combine_minor (bool) – Combine minor backgrounds into a single contribution (Zjets, Diboson, and MCNP will be labeled “Minor Backgrounds”).

Returns:

tdub.root

A module for working with ROOT-like objects (without ROOT itself).

Class Summary

TH1(root_object)

Wrapper around uproot's interpretation of ROOT's TH1.

TGraphAsymmErrors(root_object)

Wrapper around uproot's interpretation of ROOT's TGraphAsymmErrors.

Reference

class tdub.root.TH1(root_object)[source]

Wrapper around uproot’s interpretation of ROOT’s TH1.

This class interprets the histogram in a way that ignores under and overflow bins. We expect the treatment of those values to already be accounted for.

Parameters:

root_object (uproot.behaviors.TH1.Histogram) – Object from reading ROOT file with uproot.

property bin_width

Width of a bin.

Type:

float

property centers

Histogram bin centers.

Type:

numpy.ndarray

property counts

Histogram bin counts.

Type:

numpy.ndarray

property edges

Histogram bin edges.

Type:

numpy.ndarray

property errors

Histogram bin errors.

Type:

numpy.ndarray

class tdub.root.TGraphAsymmErrors(root_object)[source]

Wrapper around uproot’s interpretation of ROOT’s TGraphAsymmErrors.

Parameters:

root_object (uproot.model.Model) – Object from reading ROOT file with uproot.

property xhi

X-axis high errors.

Type:

numpy.ndarray

property xlo

X-axis low errors.

Type:

numpy.ndarray

property yhi

Y-axis high errors.

Type:

numpy.ndarray

property ylo

Y-axis low errors.

Type:

numpy.ndarray