tdub¶
tdub
is a Python project for handling some downstsream steps in
the ATLAS Run 2 \(tW\) inclusive cross section analysis. The
project provides a simple command line interface for performing
standard analysis tasks including:
BDT feature selection and hyperparameter optimization.
Training BDT models on our Monte Carlo.
Applying trained BDT models to our data and Monte Carlo.
Generating plots from various raw sources (our ROOT files and Classifier training output).
Generating plots from the output of TRExFitter.
For potentially finer-grained tasks the API is fully documented. The API mainly provides quick and easy access to pythonic representations (i.e. dataframes or NumPy arrays) of our datasets (which of course originate from ROOT files), modularized ML tasks, and a set of utilities tailored for interacting with our specific datasets.
Click Based CLI¶
The command line interface provides a way to execute a handful of common tasks without touching any Python code. The CLI is implemented using click.
tdub¶
Top Level CLI function.
tdub [OPTIONS] COMMAND [ARGS]...
apply¶
Tasks to apply machine learning models to data.
tdub apply [OPTIONS] COMMAND [ARGS]...
all¶
Generate BDT response arrays for all ROOT files in DATADIR.
tdub apply all [OPTIONS] DATADIR ARRNAME OUTDIR WORKSPACE
Options
-
-f
,
--fold-results
<fold_results>
¶ fold output directories
-
-s
,
--single-results
<single_results>
¶ single result dirs
-
--and-submit
¶
submit the condor jobs
Arguments
-
DATADIR
¶
Required argument
-
ARRNAME
¶
Required argument
-
OUTDIR
¶
Required argument
-
WORKSPACE
¶
Required argument
single¶
Generate BDT response array for INFILE and save to .npy file.
We generate the .npy files using either single training results (-s flag) or folded training results (-f flag).
tdub apply single [OPTIONS] INFILE ARRNAME OUTDIR
Options
-
-f
,
--fold-results
<fold_results>
¶ fold output directories
-
-s
,
--single-results
<single_results>
¶ single result dirs
Arguments
-
INFILE
¶
Required argument
-
ARRNAME
¶
Required argument
-
OUTDIR
¶
Required argument
misc¶
Tasks under a miscellaneous umbrella.
tdub misc [OPTIONS] COMMAND [ARGS]...
drdscomps¶
Generate plots comparing DR and DS (with BDT cuts shown).
tdub misc drdscomps [OPTIONS] DATADIR
Options
-
-o
,
--outdir
<outdir>
¶ Output directory.
-
--thesis
¶
Flag for thesis label.
Arguments
-
DATADIR
¶
Required argument
soverb¶
Get signal over background using data in DATADIR and a SELECTIONS file.
the format of the JSON entries should be “region”: “numexpr selection”.
tdub misc soverb [OPTIONS] DATADIR SELECTIONS
Options
-
-t
,
--use-tptrw
¶
use top pt reweighting
Arguments
-
DATADIR
¶
Required argument
-
SELECTIONS
¶
Required argument
rex¶
Tasks interacting with TRExFitter results.
tdub rex [OPTIONS] COMMAND [ARGS]...
grimpacts¶
Print summary of grouped impacts.
tdub rex grimpacts [OPTIONS] REX_DIR
Options
-
--tablefmt
<tablefmt>
¶ Format passed to tabulate.
-
--include-total
¶
Include FullSyst entry
Arguments
-
REX_DIR
¶
Required argument
impact¶
Generate impact plot from TRExFitter result.
tdub rex impact [OPTIONS] REX_DIR
Options
-
--thesis
¶
Flat to use thesis label.
Arguments
-
REX_DIR
¶
Required argument
impstabs¶
Generate impact stability tests based on rexpy output.
tdub rex impstabs [OPTIONS] HERWIG704 HERWIG713
Options
-
-o
,
--outdir
<outdir>
¶ Output directory.
Arguments
-
HERWIG704
¶
Required argument
-
HERWIG713
¶
Required argument
index¶
Generate index.html file for the workspace.
tdub rex index [OPTIONS] REX_DIR
Arguments
-
REX_DIR
¶
Required argument
stabs¶
Generate stability tests based on rexpy output.
tdub rex stabs [OPTIONS] UMBRELLA
Options
-
-o
,
--outdir
<outdir>
¶ Output directory.
-
-t
,
--tests
<tests>
¶ Tests to run.
Arguments
-
UMBRELLA
¶
Required argument
stacks¶
Generate plots from TRExFitter result.
tdub rex stacks [OPTIONS] REX_DIR
Options
-
--chisq
,
--no-chisq
¶
Do or don’t print chi-square information.
-
--internal
,
--no-internal
¶
Do or don’t include internal label.
-
--thesis
,
--no-thesis
¶
Use thesis label
-
--png
,
--no-png
¶
Also save PNG version of plots.
-
-n
,
--n-test
<n_test>
¶ Test only n plots (for stacks).
Arguments
-
REX_DIR
¶
Required argument
train¶
Tasks to perform machine learning steps.
tdub train [OPTIONS] COMMAND [ARGS]...
check¶
Check the results of a parameter scan WORKSPACE.
tdub train check [OPTIONS] WORKSPACE
Options
-
-p
,
--print-top
¶
Print the top results
-
-n
,
--n-res
<n_res>
¶ Number of top results to print
- Default
10
Arguments
-
WORKSPACE
¶
Required argument
fold¶
Perform a folded training based on a hyperparameter scan result.
tdub train fold [OPTIONS] SCANDIR DATADIR
Options
-
-t
,
--use-tptrw
¶
use top pt reweighting
-
-n
,
--n-splits
<n_splits>
¶ number of splits for folding
- Default
3
Arguments
-
SCANDIR
¶
Required argument
-
DATADIR
¶
Required argument
itables¶
Generate importance tables.
tdub train itables [OPTIONS] SUMMARY_FILE
Arguments
-
SUMMARY_FILE
¶
Required argument
prep¶
Prepare data for training.
tdub train prep [OPTIONS] DATADIR [1j1b|2j1b|2j2b] OUTDIR
Options
-
-p
,
--pre-exec
<pre_exec>
¶ Python code to pre-execute
-
-n
,
--nlo-method
<nlo_method>
¶ tW simluation NLO method
- Default
DR
-
-x
,
--override-selection
<override_selection>
¶ override selection with contents of file
-
-t
,
--use-tptrw
¶
apply top pt reweighting
-
-r
,
--use-trrw
¶
apply top recursive reweighting
-
-i
,
--ignore-list
<ignore_list>
¶ variable ignore list file
-
-m
,
--multiple-ttbar-samples
¶
use multiple ttbar MC samples
-
-a
,
--use-inc-af2
¶
use inclusive af2 samples
-
-f
,
--bkg-sample-frac
<bkg_sample_frac>
¶ use a fraction of the background
-
-d
,
--use-dilep
¶
train with dilepton samples
Arguments
-
DATADIR
¶
Required argument
-
REGION
¶
Required argument
-
OUTDIR
¶
Required argument
scan¶
Perform a parameter scan via condor jobs.
DATADIR points to the intput ROOT files, training is performed on the REGION and all output is saved to WORKSPACE.
$ tdub train scan /data/path 2j2b scan_2j2b
tdub train scan [OPTIONS] DATADIR WORKSPACE
Options
-
-p
,
--pre-exec
<pre_exec>
¶ Python code to pre-execute
-
-e
,
--early-stop
<early_stop>
¶ number of early stopping rounds
- Default
10
-
-s
,
--test-size
<test_size>
¶ training test size
- Default
0.4
-
--overwrite
¶
overwrite existing workspace
-
--and-submit
¶
submit the condor jobs
Arguments
-
DATADIR
¶
Required argument
-
WORKSPACE
¶
Required argument
shapes¶
Generate shape comparion plots.
tdub train shapes [OPTIONS] DATADIR
Options
-
-o
,
--outdir
<outdir>
¶ Directory to save output.
Arguments
-
DATADIR
¶
Required argument
single¶
Execute single training round.
tdub train single [OPTIONS] DATADIR OUTDIR
Options
-
-p
,
--pre-exec
<pre_exec>
¶ Python code to pre-execute
-
-s
,
--test-size
<test_size>
¶ training test size
- Default
0.4
-
-e
,
--early-stop
<early_stop>
¶ number of early stopping rounds
- Default
10
-
-k
,
--use-sklearn
¶
use sklearn instead of lgbm
-
-g
,
--use-xgboost
¶
use xgboost instead of lgbm
-
-l
,
--learning-rate
<learning_rate>
¶ learning_rate model parameter
- Default
0.1
-
-n
,
--num-leaves
<num_leaves>
¶ num_leaves model parameter
- Default
16
-
-m
,
--min-child-samples
<min_child_samples>
¶ min_child_samples model parameter
- Default
500
-
-d
,
--max-depth
<max_depth>
¶ max_depth model parameter
- Default
5
-
-r
,
--reg-lambda
<reg_lambda>
¶ lambda (L2) regularization
- Default
0
Arguments
-
DATADIR
¶
Required argument
-
OUTDIR
¶
Required argument
tdub.art¶
A module for art (plots)
Function Summary¶
|
Create a plot canvas given a dictionary of counts and bin edges. |
|
Draw the ATLAS label text, with extra lines if desired. |
|
Draw the impact plot. |
|
Draw uncertainty bands on both axes in stack plot with a ratio. |
|
Move the last element of the legend to first. |
|
Create plot for one sided systematic comparison. |
Modify matplotlib’s rcParams to our preference. |
Reference¶
-
tdub.art.
canvas_from_counts
(counts, errors, bin_edges, uncertainty=None, total_mc=None, logy=False, **subplots_kw)[source]¶ Create a plot canvas given a dictionary of counts and bin edges.
The
counts
anderrors
dictionaries are expected to have the following keys:“Data”
“tW_DR” or “tW”
“ttbar”
“Zjets”
“Diboson”
“MCNP”
- Parameters
counts (dict(str, np.ndarray)) – a dictionary pairing samples to bin counts.
errors (dict(str, np.ndarray)) – a dictionray pairing samples to bin count errors.
bin_edges (array_like) – the histogram bin edges.
uncertainty (tdub.root.TGraphAsymmErrors) – Uncertainty (TGraphAsym).
total_mc (tdub.root.TH1) – Total MC histogram (TH1D).
subplots_kw (dict) – remaining keyword arguments passed to
matplotlib.pyplot.subplots()
.
- Returns
matplotlib.figure.Figure
– Matplotlib figure.matplotlib.axes.Axes
– Matplotlib axes for the histogram stack.matplotlib.axes.Axes
– Matplotlib axes for the ratio comparison.
-
tdub.art.
draw_atlas_label
(ax, follow='Internal', cme_and_lumi=True, extra_lines=None, cme=13, lumi=139, x=0.04, y=0.905, follow_shift=0.17, s1=18, s2=14, thesis=False)[source]¶ Draw the ATLAS label text, with extra lines if desired.
- Parameters
ax (matplotlib.axes.Axes) – Axes to draw the text on.
follow (str) – Text to follow the ATLAS label (usually ‘Internal’).
extra_lines (list(str), optional) – Set of extra lines to draw below ATLAS label.
x (float) – x-location of the text.
y (float) – y-location of the text.
follow_shift (float) – x-shift of the text following the ATLAS label.
s1 (int) – Size of the main label.
s2 (int) – Size of the extra text
thesis (bool) – Flag for is thesis
-
tdub.art.
draw_impact_barh
(ax, df, hi_color='steelblue', lo_color='mediumturquoise', height_fill=0.8, height_line=0.8)[source]¶ Draw the impact plot.
- Parameters
ax (matplotlib.axes.Axes) – Axes for the “delta mu” impact.
df (pandas.DataFrame) – Dataframe containing impact information.
hi_color (str) – Up variation color.
lo_color (str) – Down variation color.
height_fill (float) – Height for the filled bars (post-fit).
height_line (float) – Height for the line (unfilled) bars (pre-fit).
- Returns
matplotlib.axes.Axes – Axes for the impact: “delta mu”.
matplotlib.axes.Axes – Axes for the nuisance parameter pull.
-
tdub.art.
draw_uncertainty_bands
(uncertainty, total_mc, ax, axr, label='Uncertainty', edgecolor='mediumblue', zero_threshold=0.25)[source]¶ Draw uncertainty bands on both axes in stack plot with a ratio.
- Parameters
uncertainty (tdub.root.TGraphAsymmErrors) – ROOT TGraphAsymmErrors with full systematic uncertainty.
total_mc (tdub.root.TH1) – ROOT TH1 providing the full Monte Carlo prediction.
ax (matplotlib.axes.Axes) – Main axis (where histogram stack is painted)
axr (matplotlib.axes.Axes) – Ratio axis
label (str) – Legend label for the uncertainty.
zero_threshold (float) – When total MC events are below threshold, zero contents and error.
-
tdub.art.
legend_last_to_first
(ax, **kwargs)[source]¶ Move the last element of the legend to first.
- Parameters
ax (matplotlib.axes.Axes) – Matplotlib axes to create a legend on.
kwargs (dict) – Arguments passed to
matplotlib.axes.Axes.legend
.
-
tdub.art.
one_sided_comparison_plot
(nominal, one_up, edges, thesis=False)[source]¶ Create plot for one sided systematic comparison.
- Parameters
nominal (numpy.ndarray) – Nominal histogram bin counts.
one_up (numpy.ndarray) – One \(\sigma\) up variation.
edges (numpy.ndarray) – Array defining bin edges.
thesis (bool) – Label for thesis figure.
- Returns
matplotlib.figure.Figure
– Matplotlib figure.matplotlib.axes.Axes
– Matplotlib axes for the histograms.matplotlib.axes.Axes
– Matplotlib axes for the percent difference comparison.
tdub.batch¶
A module for running batch jobs (currently targets the US ATLAS BNL cluster).
Function Summary¶
|
Add an arguments line to a condor submission script. |
|
Create the preamble of a condor submission script. |
|
Create a condor workspace given a name. |
Reference¶
-
tdub.batch.
add_condor_arguments
(arguments, to_file)[source]¶ Add an arguments line to a condor submission script.
the arguments argument is prefixed with “Arguments = “ and written to to_file.
- Parameters
arguments (str) – the arguments line
to_file (typing.TextIO) – the open file stream
Examples
>>> import tdub.batch as tb >>> import shutil >>> ws = tb.create_condor_workspace("./some/ws") >>> with open(ws / "condor.sub", "w") as f: ... preamble = tb.condor_preamble(ws, shutil.which("tdub"), to_file=f) ... tb.add_condor_arguments("train-single ......", f)
-
tdub.batch.
condor_preamble
(workspace, exe, universe='vanilla', memory='2GB', email='ddavis@phy.duke.edu', notification='Error', getenv='True', to_file=None, **kwargs)[source]¶ Create the preamble of a condor submission script.
Extra kwargs create additional preamble entries. See the HTCondor documentation for more details on all parameters.
- Parameters
workspace (str or os.PathLike) – the filesystem directry where the workspace is
exe (str or os.PathLike) – the path of the executable that condor will run
universe (str) – the HTCondor universe
memory (str) – the requested memory
email (str) – the email to send updates to (if any)
notification (str) – the condor notification argument
to_file (typing.TextIO, optional) – if not None, write the string to file
- Returns
the submission script preamble
- Return type
Examples
>>> import tdub.batch as tb >>> import shutil >>> ws = tb.create_condor_workspace("./some/ws") >>> with open(ws / "condor.sub", "w") as f: ... preamble = tb.condor_preamble(ws, shutil.which("tdub"), to_file=f) ... tb.add_condor_arguments("train-single ......", f)
-
tdub.batch.
create_condor_workspace
(name, overwrite=False)[source]¶ Create a condor workspace given a name.
This will create a new directory containing log, out, and err directories inside. The workspace argument to the
condor_preamble()
function assumes creation of a workspace via this function.Missing parent directories will always be created.
- Parameters
name (str or os.PathLike) – the desired filesystem path for the workspace
overwrite (bool) – if True, an existing workspace will be overwritten
- Raises
OSError – if the filesystem path exists and exist_ok is False
- Returns
filesystem path to the workspace
- Return type
Examples
>>> import tdub.batch as tb >>> import shutil >>> ws = tb.create_condor_workspace("./some/ws") >>> with open(ws / "condor.sub", "w") as f: ... preamble = tb.condor_preamble(ws, shutil.which("tdub"), to_file=f) ... tb.add_condor_arguments("train-single ......", f)
tdub.config¶
Analysis configuration module.
tdub is a Python library for physics analysis. Naturally some properties of the analysis need to be easily modifiable for various studies. This module houses a handful of variables that can be modified simply by importing the module.
For example, we can call tdub.data.features_for()
and expect
different results without changing the API usage, just changing the
configuration module FEATURESET_foo
constants:
>>> from tdub.data import features_for
>>> features_for("2j2b")
['mass_lep1jet1', 'mass_lep2jet1', 'pT_jet2', ...]
>>> import tdub.config
>>> tdub.config.FEATURESET_2j2b = ["pT_jet1", "met"]
>>> features_for("2j2b")
['pT_jet1', 'met']
Similarly, we can modify the selection via this module:
>>> from tdub.data import selection_for
>>> selection_for("2j2b")
'(reg2j2b == True) & (OS == True)'
>>> import tdub.config
>>> tdub.config.SELECTION_2j2b = "(reg2j2b == True) & (OS == True) & (mass_lep1jet1 < 155)"
>>> selection_for("2j2b")
'(reg2j2b == True) & (OS == True) & (mass_lep1jet1 < 155)'
This module also contains some convenience functions for helping to automate the process of providing some sensible defaults for some configuration options, but not at import time (i.e. if the default requires importing a module or parsing some data from the web).
Constant Summary¶
List of features to avoid in classifiers. |
|
List of features to avoid specifically in 1j1b classifiers. |
|
List of features to avoid specifically in 2j1b classifiers. |
|
List of features to avoid specifically in 2j2b classifiers. |
|
|
The default grid to perform a parameter scan. |
List of features we use for classifiers in the 1j1b region. |
|
List of features we use for classifiers in the 2j1b region. |
|
List of features we use for classifiers in the 2j2b region. |
|
Plots (defined as TRExFitter Regions) to use log scale. |
|
Plotting metadata table. |
|
Seed for various random tasks requiring reproducibility. |
|
The numexpr selection string for the 1j1b region. |
|
The numexpr selection string for the 2j1b region. |
|
The numexpr selection string for the 2j2b region. |
Function Summary¶
Load metadata from network to define PLOTTING_META_TABLE. |
|
Set a sensible default PLOTTING_LOGY value. |
Constant Reference¶
-
tdub.config.
AVOID_IN_CLF_1j1b
¶ List of features to avoid specifically in 1j1b classifiers.
-
tdub.config.
AVOID_IN_CLF_2j1b
¶ List of features to avoid specifically in 2j1b classifiers.
-
tdub.config.
AVOID_IN_CLF_2j2b
¶ List of features to avoid specifically in 2j2b classifiers.
-
tdub.config.
FEATURESET_1j1b
¶ List of features we use for classifiers in the 1j1b region.
-
tdub.config.
FEATURESET_2j1b
¶ List of features we use for classifiers in the 2j1b region.
-
tdub.config.
FEATURESET_2j2b
¶ List of features we use for classifiers in the 2j2b region.
tdub.data¶
A module for handling our data.
Class Summary¶
|
A simple enum class for easily using region information. |
|
Describes a sample’s attritubes given it’s name. |
Function Summary¶
|
Convert input to |
|
Get the features to avoid for the given region. |
|
Get a list of branches from a data source. |
|
Categorize branches into a separated lists. |
|
Get the feature list for a region. |
|
Get a dictionary connecting sample processes to file lists. |
|
Get the numexpr selection string from an arbitrary selection. |
|
Get the ROOT selection string from an arbitrary selection. |
|
Construct the minimal set of branches required for a selection. |
|
Get the selection for a given region. |
Reference¶
-
class
tdub.data.
Region
(value)[source]¶ A simple enum class for easily using region information.
-
r1j1b
¶ Label for our 1j1b region.
-
r2j1b
¶ Label for our 2j1b region.
-
r2j2b
¶ Label for our 2j2b region.
Examples
Using this enum for grabing the
2j2b
region from a set of files:>>> from tdub.data import Region, selection_for >>> from tdub.frames import iterative_selection >>> df = iterative_selection(files, selection_for(Region.r2j2b))
-
static
from_str
(s)[source]¶ Get enum value for the given string.
This function supports three ways to define a region; prefixed with “r”, prefixed with “reg”, or no prefix at all. For example,
Region.r2j2b
can be retrieved like so:Region.from_str("r2j2b")
Region.from_str("reg2j2b")
Region.from_str("2j2b")
- Parameters
s (str) – String representation of the desired region
- Returns
Enum version
- Return type
Examples
>>> from tdub.data import Region >>> Region.from_str("1j1b") <Region.r1j1b: 0>
-
-
class
tdub.data.
SampleInfo
(input_file)[source]¶ Describes a sample’s attritubes given it’s name.
- Parameters
input_file (str) – File stem containing the necessary groups to parse.
Examples
>>> from tdub.data import SampleInfo >>> sampinfo = SampleInfo("ttbar_410472_AFII_MC16d_nominal.root") >>> sampinfo.phy_process ttbar >>> sampinfo.dsid 410472 >>> sampinfo.sim_type AFII >>> sampinfo.campaign MC16d >>> sampinfo.tree nominal
-
tdub.data.
as_region
(region)[source]¶ Convert input to
Region
.Meant to be similar to
numpy.asarray()
function.- Parameters
region (str or Region) – Region already as a Region or as a str
- Returns
Region representation.
- Return type
Examples
>>> from tdub.data import as_region, Region >>> as_region("r2j1b") <Region.r2j1b: 1> >>> as_region(Region.r2j2b) <Region.r2j2b: 2>
-
tdub.data.
avoids_for
(region)[source]¶ Get the features to avoid for the given region.
See the
tdub.config
module for definition of the variables to avoid (and how to modify them).- Parameters
region (str or tdub.data.Region) – Region to get the associated avoided branches.
- Returns
Features to avoid for the region.
- Return type
Examples
>>> from tdub.data import avoids_for, Region >>> avoids_for(Region.r2j1b) ['HT_jet1jet2', 'deltaR_lep1lep2_jet1jet2met', 'mass_lep2jet1', 'pT_jet2'] >>> avoids_for("2j2b") ['deltaR_jet1_jet2']
-
tdub.data.
branches_from
(source, tree='WtLoop_nominal', ignore_weights=False)[source]¶ Get a list of branches from a data source.
If the source is a list of files, the first file is the only file that is parsed.
- Parameters
source (str, list(str), os.PathLike, list(os.PathLike), or uproot File/Tree) – What to parse to get the branch information.
tree (str) – Name of the tree to get branches from
ignore_weights (bool) – Flag to ignore all branches starting with weight_.
- Returns
Branches from the source.
- Return type
- Raises
TypeError – If source can’t be used to find a list of branches.
Examples
>>> from tdub.data import branches_from >>> branches_from("/path/to/file.root", ignore_weights=True) ["pT_lep1", "pT_lep2"] >>> branches_from("/path/to/file.root") ["pT_lep1", "pT_lep2", "weight_nominal", "weight_tptrw"]
-
tdub.data.
categorize_branches
(source)[source]¶ Categorize branches into a separated lists.
The categories:
kinematics: for kinematic features (used for classifiers)
weights: for any branch that starts or ends with
weight
meta: for meta information (final state information)
- Parameters
source (list(str)) – Complete list of branches to be categorized.
- Returns
Dictionary connecting categories to their associated list of branchess.
- Return type
Examples
>>> from tdub.data import categorize_branches, branches_from >>> branches = ["pT_lep1", "pT_lep2", "weight_nominal", "weight_sys_jvt", "reg2j2b"] >>> cated = categorize_branches(branches) >>> cated["weights"] ['weight_sys_jvt', 'weight_nominal'] >>> cated["meta"] ['reg2j2b'] >>> cated["kinematics"] ['pT_lep1', 'pT_lep2']
Using a ROOT file:
>>> root_file = PosixPath("/path/to/file.root") >>> cated = categorize_branches(branches_from(root_file))
-
tdub.data.
features_for
(region)[source]¶ Get the feature list for a region.
See the
tdub.config
module for the definitions of the feature lists (and how to modify them).- Parameters
region (str or tdub.data.Region) – Region as a string or enum entry. Using
"ALL"
returns a list of unique features from all regions.- Returns
Features for that region (or all regions).
- Return type
Examples
>>> from pprint import pprint >>> from tdub.data import features_for >>> pprint(features_for("reg2j1b")) ['mass_lep1jet1', 'mass_lep1jet2', 'mass_lep2jet1', 'mass_lep2jet2', 'pT_jet2', 'pTsys_lep1lep2jet1jet2met', 'psuedoContTagBin_jet1', 'psuedoContTagBin_jet2']
-
tdub.data.
quick_files
(datapath, campaign=None, tree='nominal')[source]¶ Get a dictionary connecting sample processes to file lists.
The lists of files are sorted alphabetically. These types of samples are currently tested:
tW_DR (410648, 410649 full sim)
tW_DR_AFII (410648, 410649 fast sim)
tW_DR_PS (411038, 411039 fast sim)
tW_DR_inc (410646, 410647 full sim)
tW_DR_inc_AFII (410646, 410647 fast sim)
tW_DS (410656, 410657 full sim)
tW_DS_inc (410654, 410655 ful sim)
ttbar (410472 full sim)
ttbar_AFII (410472 fast sim)
ttbar_PS (410558 fast sim)
ttbar_PS713 (411234 fast sim)
ttbar_hdamp (410482 fast sim)
ttbar_inc (410470 full sim)
ttbar_inc_AFII (410470 fast sim)
Diboson
Zjets
MCNP
Data
- Parameters
datapath (str or os.PathLike) – Path where all of the ROOT files live.
campaign (str, optional) – Enforce a single campaign (“MC16a”, “MC16d”, or “MC16e”).
tree (str) – Upstream AnalysisTop ntuple tree.
- Returns
The dictionary of processes and their associated files.
- Return type
Examples
>>> from pprint import pprint >>> from tdub.data import quick_files >>> qf = quick_files("/path/to/some_files") ## has 410472 ttbar samples >>> pprint(qf["ttbar"]) ['/path/to/some/files/ttbar_410472_FS_MC16a_nominal.root', '/path/to/some/files/ttbar_410472_FS_MC16d_nominal.root', '/path/to/some/files/ttbar_410472_FS_MC16e_nominal.root'] >>> qf = quick_files("/path/to/some/files", campaign="MC16d") >>> pprint(qf["tW_DR"]) ['/path/to/some/files/tW_DR_410648_FS_MC16d_nominal.root', '/path/to/some/files/tW_DR_410649_FS_MC16d_nominal.root'] >>> qf = quick_files("/path/to/some/files", campaign="MC16a") >>> pprint(qf["Data"]) ['/path/to/some/files/Data15_data15_Data_Data_nominal.root', '/path/to/some/files/Data16_data16_Data_Data_nominal.root']
-
tdub.data.
selection_as_numexpr
(selection)[source]¶ Get the numexpr selection string from an arbitrary selection.
- Parameters
selection (str) – Selection string in ROOT or numexpr
- Returns
Selection in numexpr format.
- Return type
Examples
>>> selection = "reg1j1b == true && OS == true && mass_lep1jet1 < 155" >>> from tdub.data import selection_as_numexpr >>> selection_as_numexpr(selection) '(reg1j1b == True) & (OS == True) & (mass_lep1jet1 < 155)'
-
tdub.data.
selection_as_root
(selection)[source]¶ Get the ROOT selection string from an arbitrary selection.
- Parameters
selection (str) – The selection string in ROOT or numexpr
- Returns
The same selection in ROOT format.
- Return type
Examples
>>> selection = "(reg1j1b == True) & (OS == True) & (mass_lep1jet1 < 155)" >>> from tdub.data import selection_as_root >>> selection_as_root(selection) '(reg1j1b == true) && (OS == true) && (mass_lep1jet1 < 155)'
-
tdub.data.
selection_branches
(selection)[source]¶ Construct the minimal set of branches required for a selection.
- Parameters
selection (str) – Selection string in ROOT or numexpr
- Returns
Necessary branches/variables
- Return type
Examples
>>> from tdub.data import minimal_selection_branches >>> selection = "(reg1j1b == True) & (OS == True) & (mass_lep1lep2 > 100)" >>> minimal_branches(selection) {'OS', 'mass_lep1lep2', 'reg1j1b'} >>> selection = "reg2j1b == true && OS == true && (mass_lep1jet1 < 155)" >>> minimal_branches(selection) {'OS', 'mass_lep1jet1', 'reg2j1b'}
-
tdub.data.
selection_for
(region, additional=None)[source]¶ Get the selection for a given region.
We have three regions with a default selection (1j1b, 2j1b, and 2j2b), these are the possible argument options (in str or Enum form). See the
tdub.config
module for the definitions of the selections (and how to modify them).- Parameters
- Returns
Selection string in numexpr format.
- Return type
Examples
>>> from tdub.data import Region, selection_for >>> selection_for(Region.r2j1b) '(reg2j1b == True) & (OS == True)' >>> selection_for("reg1j1b") '(reg1j1b == True) & (OS == True)' >>> selection_for("2j2b") '(reg2j2b == True) & (OS == True)' >>> selection_for("2j2b", additional="minimaxmbl < 155") '((reg2j2b == True) & (OS == True)) & (minimaxmbl < 155)' >>> selection_for("2j1b", additional="mass_lep1jetb < 155 && mass_lep2jetb < 155") '((reg1j1b == True) & (OS == True)) & ((mass_lep1jetb < 155) & (mass_lep2jetb < 155))'
tdub.frames¶
A module for handling dataframes.
Factory Function Summary¶
|
Build a selected dataframe via uproot’s iterate. |
|
Construct a raw pandas flavored Dataframe with help from uproot. |
Helper Function Summary¶
|
Apply (multiply) a weight to all other weights in the DataFrame. |
|
Multiply nominal and systematic weights by the campaign weight. |
|
Multiply nominal and systematic weights by the top pt reweight term. |
|
Multiply nominal and systematic weights by the top recursive reweight term. |
|
Drop columns that we avoid in classifiers. |
|
Drop some columns from a dataframe. |
|
Drop all columns with jet2 properties. |
|
Get subsets of dataframes that satisfy a selection. |
Reference¶
-
tdub.frames.
iterative_selection
(files, selection, tree='WtLoop_nominal', weight_name='weight_nominal', branches=None, keep_category=None, exclude_avoids=False, use_campaign_weight=False, use_tptrw=False, use_trrw=False, sample_frac=None, **kwargs)[source]¶ Build a selected dataframe via uproot’s iterate.
If we want to build a memory-hungry dataframe and apply a selection this helps us avoid crashing due to using all of our RAM. Constructing a dataframe with this function is useful when we want to grab many branches in a large dataset that won’t fit in memory before the selection.
The selection can be in either numexpr or ROOT form, we ensure that a ROOT style selection is converted to numexpr for use with
pandas.eval()
.- Parameters
files (list(str) or str) – A single ROOT file or list of ROOT files.
selection (str) – Selection string (numexpr or ROOT form accepted).
tree (str) – Tree name to turn into a dataframe.
weight_name (str) – Weight branch to preserve.
branches (list(str), optional) – List of branches to include as columns in the dataframe, default is
None
(all branches).keep_category (str, optional) – If not
None
, the selected dataframe(s) will only include columns which are part of the given category (seetdub.data.categorize_branches()
). The weight branch is always kept.exclude_avoids (bool) – Exclude branches defined by
tdub.config.AVOID_IN_CLF
.use_campaign_weight (bool) – Multiply the nominal weight by the campaign weight. this is potentially necessary if the samples were prepared without the campaign weight included in the product which forms the nominal weight.
use_tptrw (bool) – Apply the top pt reweighting factor.
use_trrw (bool) – Apply the top recursive reweighting factor.
sample_frac (float, optional) – Sample a fraction of the available data.
- Returns
The final selected dataframe(s) from the files.
- Return type
Examples
Creating a
ttbar_df
dataframe a singletW_df
dataframe:>>> from tdub.frames import iterative_selection >>> from tdub.data import quick_files >>> from tdub.data import selection_for >>> qf = quick_files("/path/to/files") >>> ttbar_dfs = iterative_selection(qf["ttbar"], selection_for("2j2b"), ... entrysteps="1 GB") >>> tW_df = iterative_selection(qf["tW_DR"], selection_for("2j2b"))
Keep only kinematic branches after selection and ignore avoided columns:
>>> tW_df = iterative_selection(qf["tW_DR"], ... selection_for("2j2b"), ... exclue_avoids=True, ... keep_category="kinematics")
-
tdub.frames.
raw_dataframe
(files, tree='WtLoop_nominal', weight_name='weight_nominal', branches=None, drop_weight_sys=False, **kwargs)[source]¶ Construct a raw pandas flavored Dataframe with help from uproot.
We call this dataframe “raw” because it hasn’t been parsed by any other tdub.frames functionality (no selection performed, kinematic and weight branches won’t be separated, etc.) – just a pure raw dataframe from some ROOT files.
Extra kwargs are fed to uproot’s
arrays()
interface.- Parameters
files (list(str) or str) – Single ROOT file or list of ROOT files.
tree (str) – The tree name to turn into a dataframe.
weight_name (str) – Weight branch (we make sure to grab it if you give something other than
None
tobranches
).branches (list(str), optional) – List of branches to include as columns in the dataframe, default is
None
, includes all branches.drop_weight_sys (bool) – Drop all weight systematics from the being grabbed.
- Returns
The pandas flavored DataFrame with all requested branches
- Return type
Examples
>>> from tdub.data import quick_files >>> from tdub.frames import raw_dataframe >>> files = quick_files("/path/to/files")["ttbar"] >>> df = raw_dataframe(files)
-
tdub.frames.
apply_weight
(df, weight_name, exclude=None)[source]¶ Apply (multiply) a weight to all other weights in the DataFrame.
This will multiply the nominal weight and all systematic weights in the DataFrame by the
weight_name
column. We augmentpandas.DataFrame
with this function.- Parameters
df (pandas.DataFrame) – Dataaframe to operate on.
weight_name (str) – Column name to multiple all other weight columns by.
exclude (list(str), optional) – List of columns ot exclude when determining the other weight columns to operate on.
Examples
>>> import tdub.frames >>> df = tdub.frames.raw_dataframe("/path/to/file.root") >>> df.apply_weight("weight_campaign")
-
tdub.frames.
apply_weight_campaign
(df, exclude=None)[source]¶ Multiply nominal and systematic weights by the campaign weight.
This is useful for samples that were produced without the campaign weight term already applied to all other weights. We augment
pandas.DataFrame
with this function.- Parameters
df (pandas.DataFrame) – Dataframe to operate on.
exclude (list(str), optional) – List of columns to exclude when determining the other weight columns to operate on.
Examples
>>> import tdub.frames >>> df = tdub.frames.raw_dataframe("/path/to/file.root") >>> df.weight_nominal[5] 0.003 >>> df.weight_campaign[5] 0.4 >>> df.apply_weight_campaign() >>> df.weight_nominal[5] 0.0012
-
tdub.frames.
apply_weight_tptrw
(df, exclude=None)[source]¶ Multiply nominal and systematic weights by the top pt reweight term.
This is useful for samples that were produced without the top pt reweighting term already applied to all other weights. We augment
pandas.DataFrame
with this function.- Parameters
df (pandas.DataFrame) – Dataframe to operate on.
exclude (list(str), optional) – List of columns to exclude when determining the other weight columns to operate on.
Examples
>>> import tdub.frames >>> df = tdub.frames.raw_dataframe("/path/to/file.root") >>> df.weight_nominal[5] 0.002 >>> df.weight_tptrw_tool[5] 0.98 >>> df.apply_weight_tptrw() >>> df.weight_nominal[5] 0.00196
-
tdub.frames.
apply_weight_trrw
(df, exclude=None)[source]¶ Multiply nominal and systematic weights by the top recursive reweight term.
This is useful for samples that were produced without the top recursive reweighting term already applied to all other weights. We augment
pandas.DataFrame
with this function.- Parameters
df (pandas.DataFrame) – Dataframe to operate on.
exclude (list(str), optional) – List of columns to exclude when determining the other weight columns to operate on.
Examples
>>> import tdub.frames >>> df = tdub.frames.raw_dataframe("/path/to/file.root") >>> df.weight_nominal[5] 0.002 >>> df.weight_trrw_tool[5] 0.98 >>> df.apply_weight_trrw() >>> df.weight_nominal[5] 0.00196
-
tdub.frames.
drop_avoid
(df, region=None)[source]¶ Drop columns that we avoid in classifiers.
Uses
tdub.frames.drop_cols()
with a predefined set of columns (tdub.config.AVOID_IN_CLF
). We augmentpandas.DataFrame
with this function.- Parameters
df (pandas.DataFrame) – Dataframe that you want to slim.
region (optional, str or tdub.data.Region) – Region to augment the list of dropped columns (see the region specific AVOID constants in the config module).
Examples
>>> from tdub.frames import drop_avoid >>> import pandas as pd >>> df = pd.read_parquet("some_file.parquet") >>> "E_jetL1" in df.columns: True >>> drop_avoid(df) >>> "E_jetL1" in df.columns: False
-
tdub.frames.
drop_cols
(df, *cols)[source]¶ Drop some columns from a dataframe.
This is a convenient function because it just ignores branches that don’t exist in the dataframe that are present in
cols
. We augmentpandas.DataFrame
with this function- Parameters
df (
pandas.DataFrame
) – Dataframe which we want to slim.*cols (sequence of strings) – Columns to remove
Examples
>>> import pandas as pd >>> from tdub.data import drop_cols >>> df = pd.read_parquet("some_file.parquet") >>> "E_jet1" in df.columns: True >>> "mass_jet1" in df.columns: True >>> "mass_jet2" in df.columns: True >>> drop_cols(df, "E_jet1", "mass_jet1") >>> "E_jet1" in df.columns: False >>> "mass_jet1" in df.columns: False >>> df.drop_cols("mass_jet2") # use augmented df class >>> "mass_jet2" in df.columns: False
-
tdub.frames.
drop_jet2
(df)[source]¶ Drop all columns with jet2 properties.
In the 1j1b region we obviously don’t have a second jet; so this lets us get rid of all columns dependent on jet2 kinematic properties. We augment
pandas.DataFrame
with this function.- Parameters
df (pandas.DataFrame) – Dataframe that we want to slim.
Examples
>>> from tdub.frames import drop_jet2 >>> import pandas as pd >>> df = pd.read_parquet("some_file.parquet") >>> "pTsys_lep1lep2jet1jet2met" in df.columns: True >>> drop_jet2(df) >>> "pTsys_lep1lep2jet1jet2met" in df.columns: False
-
tdub.frames.
satisfying_selection
(*dfs, selection)[source]¶ Get subsets of dataframes that satisfy a selection.
The selection string can be in either ROOT or numexpr form (we ensure to convert ROOT to numexpr).
- Parameters
*dfs (sequence of
pandas.DataFrame
) – Dataframes to apply the selection to.selection (str) – Selection string (in numexpr or ROOT form).
- Returns
Dataframes satisfying the selection string.
- Return type
Examples
>>> from tdub.data import quick_files >>> from tdub.frames import raw_dataframe, satisfying_selection >>> qf = quick_files("/path/to/files") >>> df_tW_DR = raw_dataframe(qf["tW_DR"]) >>> df_ttbar = raw_dataframe(qf["ttbar"]) >>> low_bdt = "(bdt_response < 0.4)" >>> tW_DR_selected, ttbar_selected = satisfying_selection( ... dfim_tW_DR.df, dfim_ttbar.df, selection=low_bdt ... )
tdub.features¶
A module for performing feature selection
Class Summary¶
|
A class to steer the steps of feature selection. |
Function Summary¶
|
Create slimmed and selected parquet files from ROOT files. |
|
Prepare feature selection data from parquet files. |
Reference¶
-
class
tdub.features.
FeatureSelector
(df, labels, weights, importance_type='gain', corr_threshold=0.85, name=None)[source]¶ A class to steer the steps of feature selection.
- Parameters
df (pandas.DataFrame) – The dataframe which contains signal and background events; it should also only contain features we wish to test for (it is expected to be “clean” from non-kinematic information, like metadata and weights).
weights (numpy.ndarray) – the weights array compatible with the dataframe
importance_type (str) – the importance type (“gain” or “split”)
labels (numpy.ndarray) – array of labels compatible with the dataframe (
1
for \(tW\) and0
for \(t\bar{t}\).corr_threshold (float) – the threshold for excluding features based on correlations
name (str, optional) – give the selector a name
-
data
¶ the raw dataframe as fed to the class instance
- Type
-
weights
¶ the raw weights array compatible with the dataframe
- Type
-
labels
¶ the raw labels array compatible with the dataframe (we expect
1
for signal, \(tW\), and0
for background, \(t\bar{t}\)).- Type
-
corr_matrix
¶ the raw correlation matrix for the features (requires calling the
check_collinearity
function)- Type
a dataframe matching features that satisfy the correlation threshold
- Type
-
importances
¶ the importances as determined by a vanilla GBDT (requires calling the
check_importances
function)- Type
-
candidates
¶ list of candiate featurese (sorted by importance) as determined by calling the
check_candidates
-
iterative_remove_aucs
¶ a dictionary of the form
{feature : auc}
providing the AUC value for a BDT trained _without_ the feature given in the key. The keys are built from thecandidates
list.
-
iterative_add_aucs
¶ an array of AUC values built by iteratively adding the next best feature in the candidates list. (the first entry is calculated using only the top feature, the second entry uses the top 2 features, and so on).
- Type
Examples
>>> from tdub.features import FeatureSelector, prepare_from_parquet >>> df, labels, weights = prepare_from_parquet("/path/to/pq/output", "2j1b", "DR") >>> fs = FeatureSelector(df=df, labels=labels, weights=weights, corr_threshold=0.90)
-
check_candidates
(n=20)[source]¶ Get the top uncorrelated features.
This will parse the correlations and most important features and build a list of ordered important features. When a feature that should be dropped due to a collinear feature is found, we ensure that the more important member of the pair is included in the resulting list and drop the other member of the pair. This will populate the
candidates
attribute for the class.- Parameters
n (int) – the total number of features to retrieve
Examples
>>> from tdub.features import FeatureSelector, prepare_from_parquet >>> df, labels, weights = prepare_from_parquet("/path/to/pq/output", "2j1b", "DR") >>> fs = FeatureSelector(df=df, labels=labels, weights=weights, corr_threshold=0.90) >>> fs.check_for_uniques(and_drop=True) >>> fs.check_collinearity() >>> fs.check_importances(extra_fit_opts=dict(verbose=40, early_stopping_round=15)) >>> fs.check_candidates(n=25)
-
check_collinearity
(threshold=None)[source]¶ Calculate the correlations of the features.
Given a correlation threshold this will construct a list of features that should be dropped based on the correlation values. This also adds a new property to the instance.
If the
threshold
argument is not None then the class instance’scorr_threshold
property is updated.- Parameters
threshold (float, optional) – Override the existing correlations threshold.
Examples
Overriding the exclusion threshold:
>>> from tdub.features import FeatureSelector, prepare_from_parquet >>> df, labels, weights = prepare_from_parquet("/path/to/pq/output", "2j1b", "DR") >>> fs = FeatureSelector(df=df, labels=labels, weights=weights, corr_threshold=0.90) >>> fs.check_for_uniques(and_drop=True) >>> fs.corr_threshold 0.90 >>> fs.check_collinearity(threshold=0.85) >>> fs.corr_threshold 0.85
-
check_for_uniques
(and_drop=True)[source]¶ Check the dataframe for features that have a single unique value.
- Parameters
and_drop (bool) – If
True
, and_drop any unique columns.
Examples
>>> from tdub.features import FeatureSelector, prepare_from_parquet >>> df, labels, weights = prepare_from_parquet("/path/to/pq/output", "2j1b", "DR") >>> fs = FeatureSelector(df=df, labels=labels, weights=weights, corr_threshold=0.90) >>> fs.check_for_uniques(and_drop=True)
-
check_importances
(extra_clf_opts=None, extra_fit_opts=None, n_fits=5, test_size=0.5)[source]¶ Train vanilla GBDT to calculate feature importance.
some default options are used for the
lightgbm.LGBMClassifier
instance and fit (see implementation); you can provide extras via function some arguments.- Parameters
extra_clf_opts (dict) – extra arguments forwarded to
lightgbm.LGBMClassifier
.extra_fit_opts (dict) – extra arguments forwarded to
lightgbm.LGBMClassifier.fit()
.n_fits (int) – number of models to fit to determine importances
test_size (float) – forwarded to
sklearn.model_selection.train_test_split()
Examples
>>> from tdub.features import FeatureSelector, prepare_from_parquet >>> df, labels, weights = prepare_from_parquet("/path/to/pq/output", "2j1b", "DR") >>> fs = FeatureSelector(df=df, labels=labels, weights=weights, corr_threshold=0.90) >>> fs.check_for_uniques(and_drop=True) >>> fs.check_collinearity() >>> fs.check_importances(extra_fit_opts=dict(verbose=40, early_stopping_round=15))
-
check_iterative_add_aucs
(max_features=None, extra_clf_opts=None, extra_fit_opts=None)[source]¶ Calculate aucs iteratively adding the next best feature.
After calling the check_candidates function we have a good set of candidate features; this function will train vanilla BDTs iteratively including one more feater at a time starting with the most important.
- Parameters
max_features (int) – the maximum number of features to allow to be checked. default will be the length of the
candidates
list.extra_clf_opts (dict) – extra arguments forwarded to
lightgbm.LGBMClassifier
.extra_fit_opts (dict) – extra arguments forwarded to
lightgbm.LGBMClassifier.fit()
.
Examples
>>> from tdub.features import FeatureSelector, prepare_from_parquet >>> df, labels, weights = prepare_from_parquet("/path/to/pq/output", "2j1b", "DR") >>> fs = FeatureSelector(df=df, labels=labels, weights=weights, corr_threshold=0.90) >>> fs.check_for_uniques(and_drop=True) >>> fs.check_collinearity() >>> fs.check_importances(extra_fit_opts=dict(verbose=40, early_stopping_round=15)) >>> fs.check_candidates(n=25) >>> fs.check_iterative_add_aucs(max_features=20)
-
check_iterative_remove_aucs
(max_features=None, extra_clf_opts=None, extra_fit_opts=None)[source]¶ Calculate the aucs iteratively removing one feature at a time.
After calling the check_candidates function we have a good sete of candidate features; this function will train vanilla BDTs one at a time removing one of the candidate features. We rank the feature based on how impactful its removal is.
- Parameters
max_features (int) – the maximum number of features to allow to be checked. default will be the length of the
candidates
list.extra_clf_opts (dict) – extra arguments forwarded to
lightgbm.LGBMClassifier
.extra_fit_opts (dict) – extra arguments forwarded to
lightgbm.LGBMClassifier.fit()
.
Examples
>>> from tdub.features import FeatureSelector, prepare_from_parquet >>> df, labels, weights = prepare_from_parquet("/path/to/pq/output", "2j1b", "DR") >>> fs = FeatureSelector(df=df, labels=labels, weights=weights, corr_threshold=0.90) >>> fs.check_for_uniques(and_drop=True) >>> fs.check_collinearity() >>> fs.check_importances(extra_fit_opts=dict(verbose=40, early_stopping_round=15)) >>> fs.check_candidates(n=25) >>> fs.check_iterative_remove_aucs(max_features=20)
-
save_result
()[source]¶ Save the results to a directory.
- Parameters
output_dir (str or os.PathLike) – the directory to save relevant results to
Examples
>>> from tdub.features import FeatureSelector, prepare_from_parquet >>> df, labels, weights = prepare_from_parquet("/path/to/pq/output", "2j1b", "DR") >>> fs = FeatureSelector(df=df, labels=labels, weights=weights, corr_threshold=0.90) >>> fs.check_for_uniques(and_drop=True) >>> fs.check_collinearity() >>> fs.check_importances(extra_fit_opts=dict(verbose=40, early_stopping_round=15)) >>> fs.check_candidates(n=25) >>> fs.check_iterative_add_aucs(max_features=20) >>> fs.name = "2j1b_DR" >>> fs.save_result()
-
tdub.features.
create_parquet_files
(qf_dir, out_dir=None, entrysteps=None, use_campaign_weight=False)[source]¶ Create slimmed and selected parquet files from ROOT files.
this function requires pyarrow.
- Parameters
qf_dir (str or os.PathLike) – directory to run
tdub.data.quick_files()
out_dir (str or os.PathLike, optional) – directory to save output files
entrysteps (any, optional) – entrysteps option forwarded to
tdub.frames.iterative_selection()
use_campaign_weight (bool) – multiply the nominal weight by the campaign weight. this is potentially necessary if the samples were prepared without the campaign weight included in the product which forms the nominal weight
Examples
>>> from tdub.features import create_parquet_files >>> create_parquet_files("/path/to/root/files", "/path/to/pq/output", entrysteps="250 MB")
-
tdub.features.
prepare_from_parquet
(data_dir, region, nlo_method='DR', ttbar_frac=None, weight_mean=None, weight_scale=None, scale_sum_weights=True, test_case_size=None)[source]¶ Prepare feature selection data from parquet files.
this function requires pyarrow.
- Parameters
data_dir (str or os.PathLike) – directory where the parquet files live
region (str or tdub.data.Region) – the region where we’re going to select features
nlo_method (str) – the \(tW\) sample (
DR
orDS
)ttbar_frac (str or float, optional) – if not
None
, this is the fraction of \(t\bar{t}\) events to use, “auto” (the default) uses some sensible defaults to fit in memory: 0.70 for 2j2b and 0.60 for 2j1b.weight_mean (float, optional) – scale all weights such that the mean weight is this value. Cannot be used with
weight_scale
.weight_scale (float, optional) – value to scale all weights by, cannot be used with
weight_mean
.scale_sum_weights (bool) – scale sum of weights of signal to be sum of weights of background
test_case_size (int, optional) – if we want to perform a quick test, we use a subset of the data, for
test_case_size=N
we useN
events from both signal and background. Cannot be used withttbar_frac
.
- Returns
pandas.DataFrame – the dataframe which contains kinematic features
numpy.ndarray – the labels array for the events
numpy.ndarray – the weights array for the events
Examples
>>> from tdub.features import prepare_from_parquet >>> df, labels, weights = prepare_from_parquet("/path/to/pq/output", "2j1b", "DR")
tdub.hist¶
A module for histogramming
Class Summary¶
|
Systematic template histogram comparison. |
Function Summary¶
|
Get bin centers given bin edges. |
|
Create arrays for edges and bin centers. |
|
Convert a set of variable width bins to arbitrary uniform bins. |
Reference¶
-
class
tdub.hist.
SystematicComparison
(nominal, up, down)[source]¶ Systematic template histogram comparison.
-
nominal
¶ Nominal histogram bin counts.
- Type
-
up
¶ Up variation histogram bin counts.
- Type
-
down
¶ Down variation histogram bin counts.
- Type
-
percent_diff_up
¶ Percent difference between nominal and up varation.
- Type
-
percent_diff_down
¶ Percent difference between nominald and down variation.
- Type
numpy.ndaray
-
static
one_sided
(nominal, up)[source]¶ Generate components of a systematic comparion plot.
- Parameters
nominal (numpy.ndarray) – Histogram bin counts for the nominal template.
up (numpy.ndarray) – Histogram bin counts for the “up” variation.
- Returns
The complete description of the comparison
- Return type
-
-
tdub.hist.
bin_centers
(bin_edges)[source]¶ Get bin centers given bin edges.
- Parameters
bin_edges (numpy.ndarray) – edges defining binning
- Returns
the centers associated with the edges
- Return type
Examples
>>> import numpy as np >>> from tdub.hist import bin_centers >>> bin_edges = np.linspace(25, 225, 11) >>> centers = bin_centers(bin_edges) >>> bin_edges array([ 25., 45., 65., 85., 105., 125., 145., 165., 185., 205., 225.]) >>> centers array([ 35., 55., 75., 95., 115., 135., 155., 175., 195., 215.])
-
tdub.hist.
edges_and_centers
(bins, range=None)[source]¶ Create arrays for edges and bin centers.
- Parameters
- Returns
numpy.ndarray
– the bin edgesnumpy.ndarray
– the bin centers
Examples
from bin multiplicity and a range
>>> from tdub.hist import edges_and_centers >>> edges, centers = edges_and_centers(bins=20, range=(25, 225))
from pre-existing edges
>>> edges, centers = edges_and_centers(np.linspace(0, 10, 21))
-
tdub.hist.
to_uniform_bins
(bin_edges)[source]¶ Convert a set of variable width bins to arbitrary uniform bins.
This will create a set of bin edges such that the bin centers are at whole numbers, i.e. 5 variable width bins will return an array from 0.5 to 5.5: [0.5, 1.5, 2.5, 3.5, 4.5, 5.5].
- Parameters
bin_edges (numpy.ndarray) – Array of bin edges.
- Returns
The new set of uniform bins
- Return type
Examples
>>> import numpy as np >>> from tdub.hist import to_uniform_bins >>> var_width = [0, 1, 3, 7, 15] >>> to_uniform_bins(var_width) array([0.5, 1.5, 2.5, 3.5, 4.5])
tdub.math¶
A module with math utilities
Function Summary¶
|
Calculate \(\chi^2\) probability from the value and NDF. |
|
Perform \(\chi^2\) test on two histograms. |
Calculate the Kolmogorov distribution function. |
|
|
Calculate KS statistic and p-value for two binned distributions. |
Reference¶
-
tdub.math.
chisquared_cdf_c
(chi2, ndf)[source]¶ Calculate \(\chi^2\) probability from the value and NDF.
See ROOT’s
TMath::Prob
&ROOT::Math::chisquared_cdf_c
. Quoting the ROOT documentation:Computation of the probability for a certain \(\chi^2\) and number of degrees of freedom (ndf). Calculations are based on the incomplete gamma function \(P(a,x)\), where \(a=\mathrm{ndf}/2\) and \(x=\chi^2/2\).
\(P(a,x)\) represents the probability that the observed \(\chi^2\) for a correct model should be less than the value \(\chi^2\). The returned probability corresponds to \(1-P(a,x)\), which denotes the probability that an observed \(\chi^2\) exceeds the value \(\chi^2\) by chance, even for a correct model.
-
tdub.math.
chisquared_test
(h1, err1, h2, err2)[source]¶ Perform \(\chi^2\) test on two histograms.
- Parameters
h1 (
numpy.ndarray
) – the first histogram bin contentsh2 (
numpy.ndarray
) – the second histogram bin contentserr1 (
numpy.ndarray
) – the first histogram bin errorserr2 (
numpy.ndarray
) – the second histogram bin errors
- Returns
the \(\chi^2\) test value, the degrees of freedom, and the probability
- Return type
-
tdub.math.
kolmogorov_prob
(z)[source]¶ Calculate the Kolmogorov distribution function.
See ROOT’s implementation in TMath (TMath::KolmogorovProb).
- Parameters
z (float) – the value to test
- Returns
the probability that the test statistic exceeds \(z\) (assuming the null hypothesis).
- Return type
Examples
>>> from tdub.math import kolmogorov_prob >>> kolmogorov_prob(1.13) 0.15549781841748692
-
tdub.math.
ks_twosample_binned
(hist1, hist2, err1, err2)[source]¶ Calculate KS statistic and p-value for two binned distributions.
See ROOT’s implementation in TH1 (TH1::KolmogorovTest).
- Parameters
hist1 (numpy.ndarray) – the histogram counts for the first distribution
hist2 (numpy.ndarray) – the histogram counts for the second distribution
err1 (numpy.ndarray) – the error on the histogram counts for the first distribution
err2 (numpy.ndarray) – the error on the histogram counts for the second distribution
- Returns
first: the test-statistic; second: the probability of the test (much less than 1 means distributions are incompatible)
- Return type
Examples
>>> import pygram11 >>> from tdub.math import ks_twosample_binned >>> data1, data2, w1, w2 = some_function_to_get_data() >>> h1, err1 = pygram11.histogram(data1, weights=w1, bins=40, range=(-3, 3)) >>> h2, err2 = pygram11.histogram(data2, weights=w2, bins=40, range=(-3, 3)) >>> kst, ksp = ks_twosample_binned(h1, h2, err1, err2)
tdub.ml_apply¶
A module for applying trained models
Class Summary¶
Base class for describing a completed training to apply to other data. |
|
|
Provides access to the properties of a folded training. |
|
Provides access to the properties of a single result. |
Function Summary¶
|
Get a NumPy array which is the response for all events in df. |
Reference¶
-
class
tdub.ml_apply.
BaseTrainSummary
[source]¶ Base class for describing a completed training to apply to other data.
-
apply_to_dataframe
(df, column_name, do_query)[source]¶ Apply trained model(s) to events in a dataframe df.
All BaseTrainSummary classes must implement this function.
-
property
features
¶ Features used by the model.
-
parse_summary_json
(summary_file)[source]¶ Parse a traning’s summary json file.
This populates the class properties with values and the resulting dictionary is saved to be accessible via the summary property. The common class properties (which all BaseTrainSummarys have by defition) besides summary are features, region, and selecton_used. This function will define those, so all BaseTrainSummary inheriting classes should call the super implementation of this method if a daughter implementation is necessary to add additional summary properties.
- Parameters
summary_file (os.PathLike) – The summary json file.
-
property
region
¶ Region where the training was executed.
-
property
selection_used
¶ Numexpr selection used on the trained datasets.
-
property
summary
¶ Training summary dictionary from the training json.
-
-
class
tdub.ml_apply.
FoldedTrainSummary
(fold_output)[source]¶ Bases:
tdub.ml_apply.BaseTrainSummary
Provides access to the properties of a folded training.
- Parameters
fold_output (str) – Directory with the folded training output.
Examples
>>> from tdub.apply import FoldedTrainSummary >>> fr_1j1b = FoldedTrainSummary("/path/to/folded_training_1j1b")
-
apply_to_dataframe
(df, column_name='unnamed_response', do_query=False)[source]¶ Apply trained models to an arbitrary dataframe.
This function will augment the dataframe with a new column (with a name given by the
column_name
argument) if it doesn’t already exist. If the dataframe is empty this function does nothing.- Parameters
df (pandas.DataFrame) – Dataframe to read and augment.
column_name (str) – Name to give the BDT response variable.
do_query (bool) – Perform a query on the dataframe to select events belonging to the region associated with training result; necessary if the dataframe hasn’t been pre-filtered.
Examples
>>> from tdub.apply import FoldedTrainSummary >>> from tdub.frames import raw_dataframe >>> df = raw_dataframe("/path/to/file.root") >>> fr_1j1b = FoldedTrainSummary("/path/to/folded_training_1j1b") >>> fr_1j1b.apply_to_dataframe(df, do_query=True)
-
property
folder
¶ Folding object used during training.
-
property
model0
¶ Model for the 0th fold.
-
property
model1
¶ Model for the 1st fold.
-
property
model2
¶ Model for the 2nd fold.
-
parse_summary_json
(summary_file)[source]¶ Parse a training’s summary json file.
- Parameters
summary_file (str or os.PathLike) – the summary json file
-
class
tdub.ml_apply.
SingleTrainSummary
(training_output)[source]¶ Bases:
tdub.ml_apply.BaseTrainSummary
Provides access to the properties of a single result.
- Parameters
training_output (str) – Directory containing the training result.
Examples
>>> from tdub.apply import SingleTrainSummary >>> res_1j1b = SingleTrainSummary("/path/to/some_1j1b_training_outdir")
-
apply_to_dataframe
(df, column_name='unnamed_response', do_query=False)[source]¶ Apply trained model to an arbitrary dataframe.
This function will augment the dataframe with a new column (with a name given by the
column_name
argument) if it doesn’t already exist. If the dataframe is empty this function does nothing.- Parameters
df (pandas.DataFrame) – Dataframe to read and augment.
column_name (str) – Name to give the BDT response variable.
do_query (bool) – Perform a query on the dataframe to select events belonging to the region associated with training result; necessary if the dataframe hasn’t been pre-filtered.
Examples
>>> from tdub.apply import FoldedTrainSummary >>> from tdub.frames import raw_dataframe >>> df = raw_dataframe("/path/to/file.root") >>> sr_1j1b = SingleTrainSummary("/path/to/single_training_1j1b") >>> sr_1j1b.apply_to_dataframe(df, do_query=True)
-
property
model
¶ Trained model.
-
tdub.ml_apply.
build_array
(summaries, df)[source]¶ Get a NumPy array which is the response for all events in df.
This will use the
apply_to_dataframe()
function from the list of summaries. We query the input dataframe to ensure that we apply to the correct events. If the input dataframe is empty then an empty array is written to disk.- Parameters
summaries (list(BaseTrainSummary)) – Sequence of training summaries to use.
df (pandas.DataFrame) – Dataframe of events to use to calculate the response.
Examples
Using folded summaries:
>>> from tdub.apply import FoldedTrainSummary, build_array >>> from tdub.frames import raw_dataframe >>> df = raw_dataframe("/path/to/file.root") >>> fr_1j1b = FoldedTrainSummary("/path/to/folded_training_1j1b") >>> fr_2j1b = FoldedTrainSummary("/path/to/folded_training_2j1b") >>> fr_2j2b = FoldedTrainSummary("/path/to/folded_training_2j2b") >>> res = build_array([fr_1j1b, fr_2j1b, fr_2j2b], df)
Using single summaries:
>>> from tdub.apply import SingleTrainSummary, build_array >>> from tdub.frames import raw_dataframe >>> df = raw_dataframe("/path/to/file.root") >>> sr_1j1b = SingleTrainSummary("/path/to/single_training_1j1b") >>> sr_2j1b = SingleTrainSummary("/path/to/single_training_2j1b") >>> sr_2j2b = SingleTrainSummary("/path/to/single_training_2j2b") >>> res = build_array([sr_1j1b, sr_2j1b, sr_2j2b], df)
tdub.ml_train¶
A module for handling training
Class Summary¶
|
Create and use histogrammed model response information. |
|
Describes some properties of a single training. |
Function Summary¶
|
Persist prepared data to disk. |
|
Prepare the data to train in a region with signal and background ROOT files. |
|
Execute a folded training. |
|
Create a classifier using LightGBM. |
|
Train a LGBMClassifier. |
|
Execute a single training with some parameters. |
Create a classifier using scikit-learn. |
|
|
Train a Scikit-learn classifier. |
|
Construct a dictionary of default tdub training tune. |
Reference¶
-
class
tdub.ml_train.
ResponseHistograms
(response_type, model, X_train, X_test, y_train, y_test, w_train, w_test, nbins=30)[source]¶ Create and use histogrammed model response information.
- Parameters
response_type (str) –
Models provide different types of response, like a raw prediction or a probability of signal. This class supports:
”predict” (for LGBM),
”decision_function” (for Scikit-learn)
”proba” (for either).
model (BaseEstimator) – The trained model.
X_train (array_like) – Training data feature matrix.
X_test (array_like) – Testing data feature matrix.
y_train (array_like) – Training data labels.
y_test (array_like) – Testing data labels.
w_train (array_like) – Training data event weights
w_test (array_like) – Testing data event weights
nbins (int) – Number of bins to use.
-
draw
(ax=None, xlabel=None)[source]¶ Draw the response histograms.
- Parameters
ax (matplotlib.axes.Axes, optional) – Predefined matplotlib axes to use.
xlabel (str, optional) – Override the automated xlabel definition.
- Returns
matplotlib.figure.Figure – The matplotlib figure object.
matplotlib.axes.Axes – The matplotlib axes object.
-
class
tdub.ml_train.
SingleTrainingSummary
(*, auc=- 1.0, ks_test_sig=- 1.0, ks_pvalue_sig=- 1.0, ks_test_bkg=- 1.0, ks_pvalue_bkg=- 1.0, **kwargs)[source]¶ Describes some properties of a single training.
- Parameters
auc (float) – the AUC value for the model
ks_test_sig (float) – the binned KS test value for signal
ks_pvalue_sig (float) – the binned KS test p-value for signal
ks_test_bkg (float) – the binned KS test value for background
ks_pvalue_bkg (float) – the binned KS test p-value for background
kwargs (dict) – currently unused
-
tdub.ml_train.
persist_prepared_data
(out_dir, df, labels, weights)[source]¶ Persist prepared data to disk.
The product of
tdub.ml_train.prepare_from_root()
is easily persistable to disk; this function performs that task. If the same prepared data is going to be used for multiple training executations, one can save CPU cycles by saving the prepared data instead of starting higher upstream with our ROOT ntuples.- Parameters
out_dir (str or os.PathLike) – Directory to save output to.
df (pandas.DataFrame) – Prepared DataFrame object.
labels (numpy.ndarray) – Prepared labels.
weights (numpy.ndarray) – Prepared weights.
Examples
>>> from tdub.data import quick_files >>> from tdub.train import prepare_from_root, persist_prepared_data >>> qfiles = quick_files("/path/to/data") >>> df, y, w = prepare_from_root(qfiles["tW_DR"], qfiles["ttbar"], "2j2b") >>> persist_prepared_data("/path/to/output/data", df, y, w)
-
tdub.ml_train.
prepare_from_root
(sig_files, bkg_files, region, branches=None, override_selection=None, weight_mean=None, weight_scale=None, scale_sum_weights=True, use_campaign_weight=False, use_tptrw=False, use_trrw=False, test_case_size=None, bkg_sample_frac=None)[source]¶ Prepare the data to train in a region with signal and background ROOT files.
- Parameters
region (Region or str) – Region where we’re going to perform the training.
branches (list(str), optional) – Override the list of features (usually defined by the region).
override_selection (str, optional) – Manual selection string to apply to the dataset (this will override the region defined selection).
weight_mean (float, optional) – Scale all weights such that the mean weight is this value. Cannot be used with weight_scale.
weight_scale (float, optional) – Value to scale all weights by, cannot be used with weight_mean.
scale_sum_weights (bool) – Scale sum of weights of signal to be sum of weights of background.
use_campaign_weight (bool) – See the parameter description for
tdub.frames.iterative_selection()
.use_tptrw (bool) – Apply the top pt reweighting factor.
use_trrw (bool) – Apply the top recursive reweighting factor.
test_case_size (int, optional) – Prepare a small test case dataset using this many training and testing samples.
bkg_sample_frac (float, optional) – Sample a fraction of the background data.
- Returns
pandas.DataFrame
– Event feature matrix.numpy.ndarray
– Event labels (0 for background; 1 for signal).numpy.ndarray
– Event weights.
Examples
>>> from tdub.data import quick_files >>> from tdub.train import prepare_from_root >>> qfiles = quick_files("/path/to/data") >>> df, labels, weights = prepare_from_root(qfiles["tW_DR"], qfiles["ttbar"], "2j2b")
-
tdub.ml_train.
folded_training
(df, labels, weights, params, fit_kw, output_dir, region, kfold_kw=None)[source]¶ Execute a folded training.
Train a
lightgbm.LGBMClassifier
model using \(k\)-fold cross validation using the given input data and parameters. The models resulting from the training (and other important training information) are saved tooutput_dir
. The entries in thekfold_kw
argument are forwarded to thesklearn.model_selection.KFold
class for data preprocessing. The default arguments that we use are (random_state is controlled by the tdub.config module):n_splits
: 3shuffle
:True
- Parameters
df (pandas.DataFrame) – Feature matrix in dataframe format
labels (numpy.ndarray) – Event labels (
1
for signal;0
for background)weights (
numpy.ndarray
) – Event weightsparams (dict(str, Any)) – Dictionary of
lightgbm.LGBMClassifier
parametersfit_kw (dict(str, Any)) – Dictionary of arguments forwarded to
lightgbm.LGBMClassifier.fit()
.output_dir (str or os.PathLike) – Directory to save results of training
region (str) – String representing the region
kfold_kw (optional dict(str, Any)) – Arguments passed to
sklearn.model_selection.KFold
- Returns
Negative mean area under the ROC curve (AUC)
- Return type
Examples
>>> from tdub.data import quick_files >>> from tdub.train import prepare_from_root >>> from tdub.train import folded_training >>> qfiles = quick_files("/path/to/data") >>> df, labels, weights = prepare_from_root(qfiles["tW_DR"], qfiles["ttbar"], "2j2b") >>> params = dict( ... boosting_type="gbdt", ... num_leaves=42, ... learning_rate=0.05 ... reg_alpha=0.2, ... reg_lambda=0.8, ... max_depth=5, ... ) >>> folded_training( ... df, ... labels, ... weights, ... params, ... {"verbose": 20}, ... "/path/to/train/output", ... "2j2b", ... kfold_kw={"n_splits": 5, "shuffle": True}, ... )
-
tdub.ml_train.
lgbm_gen_classifier
(train_axes=None, **clf_params)[source]¶ Create a classifier using LightGBM.
- Parameters
- Returns
The classifier.
- Return type
-
tdub.ml_train.
lgbm_train_classifier
(clf, X_train, y_train, w_train, validation_fraction=0.2, early_stopping_rounds=10, **fit_params)[source]¶ Train a LGBMClassifier.
- Parameters
clf (lightgbm.LGBMClassifier) – The classifier
X_train (array_like) – Training events matrix
y_train (array_like) – Training event labels
w_train (array_like) – Training event weights
validation_fraction (float) – Fraction of training events to use in validation set.
early_stopping_rounds (int) – Number of early stopping rounds to use in training.
fit_params (keyword arguments) – Extra keyword arguments passed to the classifier.
- Returns
The same classifier object passed to the function
- Return type
-
tdub.ml_train.
single_training
(df, labels, weights, train_axes, output_dir, test_size=0.4, early_stopping_rounds=None, extra_summary_entries=None, use_sklearn=False, use_xgboost=False, save_lgbm_txt=False)[source]¶ Execute a single training with some parameters.
The model and some useful information (mostly plots) are saved to output_dir.
- Parameters
df (pandas.DataFrame) – Feature matrix in dataframe format
labels (numpy.ndarray) – Event labels (1 for signal; 0 for background)
weights (numpy.ndarray) – Event weights
train_axes (dict(str, Any)) – Dictionary of parameters defining the tdub train axes.
output_dir (str or os.PathLike) – Directory to save results of training
test_size (float) – Test size for splitting into training and testing sets
early_stopping_rounds (int, optional) – Number of rounds to have no improvement for stopping training.
extra_summary_entries (dict, optional) – Extra entries to save in the JSON output summary.
use_sklearn (bool) – Use Scikit-learn’s HistGradientBoostingClassifier.
use_xgboost (bool) – Use XGBoost’s XGBClassifier.
save_lgbm_txt (bool) – Save fitted LGBM model to text file (ignored if either
use_sklearn
oruse_xgboost
isTrue
).
- Returns
Useful information about the training
- Return type
Examples
>>> from tdub.data import quick_files >>> from tdub.train import prepare_from_root, single_round, tdub_train_axes >>> qfiles = quick_files("/path/to/data") >>> df, labels, weights = prepare_from_root(qfiles["tW_DR"], qfiles["ttbar"], "2j2b") >>> train_axes = tdub_train_axes() ... learning_rate=0.05 ... max_depth=5, ... ) >>> single_round( ... df, ... labels, ... weights, ... tdub_train_axes, ... "training_output", ... )
-
tdub.ml_train.
sklearn_gen_classifier
(early_stopping_rounds=10, validation_fraction=0.2, train_axes=None, **clf_params)[source]¶ Create a classifier using scikit-learn.
This uses Scikit-learn’s
sklearn.ensemble.HistGradientBoostingClassifier
.The constructor to define early stopping rounds. Extra keyword arguments passed to the classifier initialization
- Parameters
early_stopping_rounds (int) – Passed as the n_iter_no_change argument to scikit-learn’s HistGradientBoostingClassifier.
validation_fraction (float) – Passed to the validation_fraction argument in scikit-learn’s HistGradientBoostingClassifier.
train_axes (dict[str, Any]) – Values of required tdub training parameters.
clf_params (kwargs) – Extra arguments passed to the constructor.
- Returns
The classifier.
- Return type
-
tdub.ml_train.
sklearn_train_classifier
(clf, X_train, y_train, w_train, **fit_params)[source]¶ Train a Scikit-learn classifier.
- Parameters
clf (sklearn.ensemble.HistGradientBoostingClassifier) – The classifier
X_train (array_like) – Training events matrix
y_train (array_like) – Training event labels
w_train (array_like) – Training event weights
fit_params (kwargs) – Extra keyword arguments passed to the classifier.
- Returns
The same classifier object passed to the function.
- Return type
-
tdub.ml_train.
tdub_train_axes
(learning_rate=0.1, max_depth=5, min_child_samples=50, num_leaves=31, reg_lambda=0.0, **kwargs)[source]¶ Construct a dictionary of default tdub training tune.
Extra keyword arguments are swallowed but never used.
- Parameters
- Returns
The argument names and values
- Return type
tdub.rex¶
A module for parsing TRExFitter results and producing additional plots/tables.
Class Summary¶
|
Fit parameter description as a dataclass. |
|
Fit grouped impact summary. |
Function Summary¶
|
Get a list of available regions from a TRExFitter result directory. |
|
Get prefit \(\chi^2\) information from TRExFitter region. |
|
Generate nicely formatted text for \(\chi^2\) information. |
|
Compare nuisance parameter info between two fits. |
|
Compare uncertainty between two fits. |
|
Summarize a comparison of two fits. |
|
Get the histogram for the Data in a region from a TRExFitter result. |
|
Calculate difference between two fit parameters. |
|
Calculate difference of a POI between two results directories. |
|
Retrieve a parameter from fit result text file. |
|
Grab grouped impacts from a fit workspace. |
|
Construct a table of grouped impacts. |
|
Construct an axis label from metadata table. |
|
Construct a piece of text based on the region and fit stage. |
|
Extract a specific nuisance parameter from a fit. |
|
Extract a list of nuisance parameter impacts from a fit. |
|
Construct a DataFrame to organize impact plot ingredients. |
|
Plot the top 20 nuisance parameters based on impact. |
|
Plot all regions discovered in a TRExFitter result directory. |
|
Free (multiprocessing compatible) function to plot a region + stage. |
|
Get the prefit total MC prediction and uncertainty band for a region. |
|
Get a prefit histogram from a file. |
|
Retrieve sample prefit histograms for a region. |
|
Fix parameter label to look nice for plots. |
|
Check if TRExFitter result directory contains postFit information. |
|
Get the postfit total MC prediction and uncertainty band for a region. |
|
Get a postfit histogram from a file. |
|
Retrieve sample postfit histograms for a region. |
|
Perform a battery of standard stability tests. |
Perform a battery of parton shower impact stability tests. |
|
|
Create a pre- or post-fit plot canvas for a TRExFitter region. |
Reference¶
-
class
tdub.rex.
FitParam
(name='', label='', pre_down=0.0, pre_up=0.0, post_down=0.0, post_up=0.0, central=0.0, sig_lo=0.0, sig_hi=0.0, post_max=0.0)[source]¶ Fit parameter description as a dataclass.
-
class
tdub.rex.
GroupedImpact
(name='', avg=0.0, sig_lo=0.0, sig_hi=0.0)[source]¶ Fit grouped impact summary.
-
tdub.rex.
available_regions
(rex_dir)[source]¶ Get a list of available regions from a TRExFitter result directory.
- Parameters
rex_dir (str or pathlib.Path) – Path of the TRExFitter result directory
- Returns
Regions discovered in the TRExFitter result directory.
- Return type
-
tdub.rex.
chisq
(rex_dir, region, stage='pre')[source]¶ Get prefit \(\chi^2\) information from TRExFitter region.
- Parameters
rex_dir (str or pathlib.Path) – Path of the TRExFitter result directory
region (str) – TRExFitter region name.
stage (str) – Drawing fit stage, (‘pre’ or ‘post’).
- Returns
-
tdub.rex.
chisq_text
(rex_dir, region, stage='pre')[source]¶ Generate nicely formatted text for \(\chi^2\) information.
Deploys
tdub.rex.chisq()
for grab the info.- Parameters
rex_dir (str or pathlib.Path) – Path of the TRExFitter result directory
region (str) – TRExFitter region name.
stage (str) – Drawing fit stage, (‘pre’ or ‘post’).
- Returns
Formatted string showing the \(\chi^2\) information.
- Return type
-
tdub.rex.
compare_nuispar
(name, rex_dir1, rex_dir2, label1=None, label2=None, np_label=None, print_to=None)[source]¶ Compare nuisance parameter info between two fits.
- Parameters
name (str) – Name of the nuisance parameter.
rex_dir1 (str or pathlib.Path) – Path of the first TRExFitter result directory.
rex_dir2 (str or pathlib.Path) – Path of the second TRExFitter result directory.
label1 (str, optional) – Define label for the first fit (defaults to rex_dir1).
label2 (str, optional) – Define label for the second fit (defaults to rex_dir2).
np_label (str, optional) – Give the nuisance parameter a label other than its name.
print_to (io.TextIOBase, optional) – Where to print results (defaults to sys.stdout).
-
tdub.rex.
compare_uncertainty
(rex_dir1, rex_dir2, fit_name1='tW', fit_name2='tW', label1=None, label2=None, poi='SigXsecOverSM', print_to=None)[source]¶ Compare uncertainty between two fits.
- Parameters
rex_dir1 (str or pathlib.Path) – Path of the first TRExFitter result directory.
rex_dir2 (str or pathlib.Path) – Path of the second TRExFitter result directory.
fit_name1 (str) – Name of the first fit.
fit_name2 (str) – Name of the second fit.
label1 (str, optional) – Define label for the first fit (defaults to rex_dir1).
label2 (str, optional) – Define label for the second fit (defaults to rex_dir2).
poi (str) – Name of the parameter of interest.
print_to (io.TextIOBase, optional) – Where to print results (defaults to sys.stdout).
-
tdub.rex.
comparison_summary
(rex_dir1, rex_dir2, fit_name1='tW', fit_name2='tW', label1=None, label2=None, fit_poi='SigXsecOverSM', nuispars=None, nuispar_labels=None, print_to=None)[source]¶ Summarize a comparison of two fits.
- Parameters
rex_dir1 (str or pathlib.Path) – Path of the first TRExFitter result directory.
rex_dir2 (str or pathlib.Path) – Path of the second TRExFitter result directory.
fit_name1 (str) – Name of the first fit.
fit_name2 (str) – Name of the second fit.
label1 (str, optional) – Define label for the first fit (defaults to rex_dir1).
label2 (str, optional) – Define label for the second fit (defaults to rex_dir2).
fit_poi (str) – Name of the parameter of interest.
nuispars (list(str), optional) – Nuisance parameters to compare.
nuispar_labels (list(str), optional) – Labels to give each nuisance parameter other than the default name.
print_to (io.TextIOBase, optional) – Where to print results (defaults to sys.stdout).
-
tdub.rex.
data_histogram
(rex_dir, region, fit_name='tW')[source]¶ Get the histogram for the Data in a region from a TRExFitter result.
- Parameters
rex_dir (str or pathlib.Path) – Path of the TRExFitter result directory
region (str) – TRExFitter region name.
fit_name (str) – Name of the Fit
- Returns
Histogram for the Data sample.
- Return type
-
tdub.rex.
delta_param
(param1, param2)[source]¶ Calculate difference between two fit parameters.
- Parameters
param1 (tdub.rex.FitParam) – First fit parameter.
param2 (tdub.rex.FitParam) – Second fit parameter.
- Returns
-
tdub.rex.
delta_poi
(rex_dir1, rex_dir2, fit_name1='tW', fit_name2='tW', poi='SigXsecOverSM')[source]¶ Calculate difference of a POI between two results directories.
The default arguments will perform a calculation of \(\Delta\mu\) between two different fits. Standard error propagation is performed on both the up and down uncertainties.
- Parameters
rex_dir1 (str or pathlib.Path) – Path of the first TRExFitter result directory.
rex_dir2 (str or pathlib.Path) – Path of the second TRExFitter result directory.
fit_name1 (str) – Name of the first fit.
fit_name2 (str) – Name of the second fit.
poi (str) – Name of the parameter of interest.
- Returns
-
tdub.rex.
fit_parameter
(fit_file, name, prettify=False)[source]¶ Retrieve a parameter from fit result text file.
- Parameters
fit_file (pathlib.Path) – Path of the TRExFitter fit result text file.
name (str) – Name of desired parameter.
prettify (bool) – Prettify the parameter label using
tdub.rex.prettify_label()
.
- Raises
ValueError – If the parameter name isn’t discovered.
- Returns
Fit parameter description.
- Return type
-
tdub.rex.
grouped_impacts
(rex_dir, include_total=False)[source]¶ Grab grouped impacts from a fit workspace.
- Parameters
rex_dir (str or pathlib.Path) – Path of the TRExFitter result directory.
include_total (bool) – Include the FullSyst entry.
- Yields
GroupedImpact
– Iterator of grouped impacts in the fit.
-
tdub.rex.
grouped_impacts_table
(rex_dir, tablefmt='orgtbl', descending=False, **kwargs)[source]¶ Construct a table of grouped impacts.
Uses the https://pypi.org/project/tabulate project.
- Parameters
rex_dir (str or pathlib.Path) – Path of the TRExFitter result directory
tablefmt (str) – Format passed to tabulate.
descending (bool) – Sort by descending order
**kwargs (dict) – Passed to
grouped_impacts()
- Returns
Table representation.
- Return type
-
tdub.rex.
plot_all_regions
(rex_dir, outdir, stage='pre', fit_name='tW', show_chisq=True, n_test=- 1, internal=True, thesis=False, save_png=False)[source]¶ Plot all regions discovered in a TRExFitter result directory.
- Parameters
rex_dir (str or pathlib.Path) – Path of the TRExFitter result directory
outdir (str or pathlib.Path) – Path to save resulting files to
stage (str) – Fitting stage (“pre” or “post”).
fit_name (str) – Name of the Fit
show_chisq (bool) – Print \(\chi^2\) information on ratio canvas.
n_test (int) – Maximum number of regions to plot (for quick tests).
internal (bool) – Flag for internal label.
thesis (bool) – Flag for thesis label.
save_png (bool) – Save png versions along with the pdf versions of plots.
-
tdub.rex.
plot_region_stage_ff
(args)[source]¶ Free (multiprocessing compatible) function to plot a region + stage.
This function is designed to be used internally by
plot_all_regions()
, where it is sent to a multiprocessing pool. Not meant for generic usage.- Parameters
args (list(Any)) – Arguments passed to
stack_canvas()
.
-
tdub.rex.
meta_axis_label
(region, bin_width, meta_table=None)[source]¶ Construct an axis label from metadata table.
- Parameters
- Returns
str – x-axis label for the region.
str – y-axis label for the region.
-
tdub.rex.
meta_text
(region, stage)[source]¶ Construct a piece of text based on the region and fit stage.
-
tdub.rex.
nuispar_impact
(rex_dir, name, label=None)[source]¶ Extract a specific nuisance parameter from a fit.
- Parameters
rex_dir (str or pathlib.Path) – Path of the TRExFitter result directory.
name (str) – Name of the nuisance parameter.
label (str, optional) – Give the nuisance parameter a label other than its name.
- Returns
Desired nuisance parameter summary.
- Return type
-
tdub.rex.
nuispar_impacts
(rex_dir, sort=True)[source]¶ Extract a list of nuisance parameter impacts from a fit.
- Parameters
rex_dir (str or pathlib.Path) – Path of the TRExFitter result directory.
- Returns
The nuisance parameters.
- Return type
-
tdub.rex.
nuispar_impact_plot_df
(nuispars)[source]¶ Construct a DataFrame to organize impact plot ingredients.
- Parameters
- Returns
DataFrame describing the plot ingredients.
- Return type
-
tdub.rex.
nuispar_impact_plot_top20
(rex_dir, thesis=False)[source]¶ Plot the top 20 nuisance parameters based on impact.
- Parameters
rex_dir (str, pathlib.Path) – Path of the TRExFitter result directory.
thesis (: bool) – Flag for thesis label.
-
tdub.rex.
prefit_total_and_uncertainty
(rex_dir, region)[source]¶ Get the prefit total MC prediction and uncertainty band for a region.
- Parameters
rex_dir (str or pathlib.Path) – Path of the TRExFitter result directory.
region (str) – Region to get error band for.
- Returns
tdub.root.TH1
– The total MC expectation histogram.tdub.root.TGraphAsymmErrors
– The error TGraph.
-
tdub.rex.
prefit_histogram
(root_file, sample, region)[source]¶ Get a prefit histogram from a file.
- Parameters
root_file (uproot.reading.ReadOnlyDirectory) – File containing the desired prefit histogram.
sample (str) – Physics sample name.
region (str) – TRExFitter region name.
- Returns
Desired histogram.
- Return type
-
tdub.rex.
prefit_histograms
(rex_dir, samples, region, fit_name='tW')[source]¶ Retrieve sample prefit histograms for a region.
- Parameters
rex_dir (str or pathlib.Path) – Path of the TRExFitter result directory
samples (Iterable(str)) – Physics samples of the desired histograms
region (str) – Region to get histograms for
fit_name (str) – Name of the Fit
- Returns
Prefit histograms.
- Return type
-
tdub.rex.
prettify_label
(label)[source]¶ Fix parameter label to look nice for plots.
Replace underscores with whitespace, TeXify some stuff, remove unnecessary things, etc.
-
tdub.rex.
postfit_available
(rex_dir)[source]¶ Check if TRExFitter result directory contains postFit information.
- Parameters
rex_dir (str or pathlib.Path) – Path of the TRExFitter result directory
- Returns
True of postFit discovered
- Return type
-
tdub.rex.
postfit_total_and_uncertainty
(rex_dir, region)[source]¶ Get the postfit total MC prediction and uncertainty band for a region.
- Parameters
rex_dir (str or pathlib.Path) – Path of the TRExFitter result directory.
region (str) – Region to get error band for.
- Returns
tdub.root.TH1
– The total MC expectation histogram.tdub.root.TGraphAsymmErrors
– The error TGraph.
-
tdub.rex.
postfit_histogram
(root_file, sample)[source]¶ Get a postfit histogram from a file.
- Parameters
root_file (uproot.reading.ReadOnlyDirectory) – File containing the desired postfit histogram.
sample (str) – Physics sample name.
- Returns
Desired histogram.
- Return type
-
tdub.rex.
postfit_histograms
(rex_dir, samples, region)[source]¶ Retrieve sample postfit histograms for a region.
- Parameters
rex_dir (str or pathlib.Path) – Path of the TRExFitter result directory
region (str) – Region to get histograms for
samples (Iterable(str)) – Physics samples of the desired histograms
- Returns
Postfit histograms detected in the TRExFitter result directory.
- Return type
-
tdub.rex.
stability_test_standard
(umbrella, outdir=None, tests='all')[source]¶ Perform a battery of standard stability tests.
This function expects a rigid umbrella directory structure, based on the output of results that are generated by rexpy.
- Parameters
umbrella (pathlib.Path) – Umbrella directory containing all fits run via rexpy’s standard fits.
outdir (pathlib.Path, optional) – Directory to save results (defaults to current working directory).
Which tests to execute. (default is “all”). The possible tests include:
"sys-drops"
, which shows the stability test for dropping some systematics."indiv-camps"
, which shows the stability test for limiting the fit to individual campaigns."regions"
, which shows the stability test for limiting the fit to subsets of the analysis regions."b0-check"
, which shows the stability test for limiting the fit to individual analysis regions and checking the B0 eigenvector uncertainty.
-
tdub.rex.
stability_test_parton_shower_impacts
(herwig704, herwig713, outdir=None)[source]¶ Perform a battery of parton shower impact stability tests.
This function expects a rigid pair of Herwig 7.0.4 and 7.1.3 directories based on the output of results that are generated by rexpy.
- Parameters
herwig704 (pathlib.Path) – Path of the Herwig 7.1.4 fit results
herwig713 (pathlib.Path) – Path of the Herwig 7.1.3 fit results
outdir (pathlib.Path, optional) – Directory to save results (defaults to current working directory).
-
tdub.rex.
stack_canvas
(rex_dir, region, stage='pre', fit_name='tW', show_chisq=True, meta_table=None, log_patterns=None, internal=True, thesis=False)[source]¶ Create a pre- or post-fit plot canvas for a TRExFitter region.
- Parameters
rex_dir (str or pathlib.Path) – Path of the TRExFitter result directory.
region (str) – Region to get error band for.
stage (str) – Drawing fit stage, (“pre” or “post”).
fit_name (str) – Name of the Fit
show_chisq (bool) – Print \(\chi^2\) information on ratio canvas.
meta_table (dict, optional) – Table of metadata for labeling plotting axes.
log_patterns (list, optional) – List of region patterns to use a log scale on y-axis.
internal (bool) – Flag for internal label.
thesis (bool) – Flag for thesis label.
- Returns
matplotlib.figure.Figure
– Figure for housing the plot.matplotlib.axes.Axes
– Main axes for the histogram stack.matplotlib.axes.Axes
– Ratio axes to show Data/MC.
tdub.root¶
A module for working with ROOT-like objects (without ROOT itself).
Class Summary¶
|
Wrapper around uproot’s interpretation of ROOT’s TH1. |
|
Wrapper around uproot’s interpretation of ROOT’s TGraphAsymmErrors. |
Reference¶
-
class
tdub.root.
TH1
(root_object)[source]¶ Wrapper around uproot’s interpretation of ROOT’s TH1.
This class interprets the histogram in a way that ignores under and overflow bins. We expect the treatment of those values to already be accounted for.
- Parameters
root_object (uproot.behaviors.TH1.Histogram) – Object from reading ROOT file with uproot.
-
property
centers
¶ Histogram bin centers.
- Type
-
property
counts
¶ Histogram bin counts.
- Type
-
property
edges
¶ Histogram bin edges.
- Type
-
property
errors
¶ Histogram bin errors.
- Type
-
class
tdub.root.
TGraphAsymmErrors
(root_object)[source]¶ Wrapper around uproot’s interpretation of ROOT’s TGraphAsymmErrors.
- Parameters
root_object (uproot.model.Model) – Object from reading ROOT file with uproot.
-
property
xhi
¶ X-axis high errors.
- Type
-
property
xlo
¶ X-axis low errors.
- Type
-
property
yhi
¶ Y-axis high errors.
- Type
-
property
ylo
¶ Y-axis low errors.
- Type