tdub.frames¶

A module for handling dataframes.

Factory Function Summary¶

`iterative_selection`(files, selection[, …])	Build a selected dataframe via uproot’s iterate.
`raw_dataframe`(files[, tree, weight_name, …])	Construct a raw pandas flavored Dataframe with help from uproot.

Helper Function Summary¶

`apply_weight`(df, weight_name[, exclude])	Apply (multiply) a weight to all other weights in the DataFrame.
`apply_weight_campaign`(df[, exclude])	Multiply nominal and systematic weights by the campaign weight.
`apply_weight_tptrw`(df[, exclude])	Multiply nominal and systematic weights by the top pt reweight term.
`apply_weight_trrw`(df[, exclude])	Multiply nominal and systematic weights by the top recursive reweight term.
`drop_avoid`(df[, region])	Drop columns that we avoid in classifiers.
`drop_cols`(df, *cols)	Drop some columns from a dataframe.
`drop_jet2`(df)	Drop all columns with jet2 properties.
`satisfying_selection`(*dfs, selection)	Get subsets of dataframes that satisfy a selection.

Reference¶

tdub.frames.iterative_selection(files, selection, tree='WtLoop_nominal', weight_name='weight_nominal', branches=None, keep_category=None, exclude_avoids=False, use_campaign_weight=False, use_tptrw=False, use_trrw=False, sample_frac=None, **kwargs)[source]¶

Build a selected dataframe via uproot’s iterate.

If we want to build a memory-hungry dataframe and apply a selection this helps us avoid crashing due to using all of our RAM. Constructing a dataframe with this function is useful when we want to grab many branches in a large dataset that won’t fit in memory before the selection.

The selection can be in either numexpr or ROOT form, we ensure that a ROOT style selection is converted to numexpr for use with pandas.eval().

Parameters

files (list(str) or str) – A single ROOT file or list of ROOT files.
selection (str) – Selection string (numexpr or ROOT form accepted).
tree (str) – Tree name to turn into a dataframe.
weight_name (str) – Weight branch to preserve.
branches (list(str), optional) – List of branches to include as columns in the dataframe, default is None (all branches).
keep_category (str, optional) – If not None, the selected dataframe(s) will only include columns which are part of the given category (see tdub.data.categorize_branches()). The weight branch is always kept.
exclude_avoids (bool) – Exclude branches defined by tdub.config.AVOID_IN_CLF.
use_campaign_weight (bool) – Multiply the nominal weight by the campaign weight. this is potentially necessary if the samples were prepared without the campaign weight included in the product which forms the nominal weight.
use_tptrw (bool) – Apply the top pt reweighting factor.
use_trrw (bool) – Apply the top recursive reweighting factor.
sample_frac (float, optional) – Sample a fraction of the available data.

Returns

The final selected dataframe(s) from the files.

Return type

pandas.DataFrame

Examples

Creating a ttbar_df dataframe a single tW_df dataframe:

>>> from tdub.frames import iterative_selection
>>> from tdub.data import quick_files
>>> from tdub.data import selection_for
>>> qf = quick_files("/path/to/files")
>>> ttbar_dfs = iterative_selection(qf["ttbar"], selection_for("2j2b"),
...                                 entrysteps="1 GB")
>>> tW_df = iterative_selection(qf["tW_DR"], selection_for("2j2b"))

Keep only kinematic branches after selection and ignore avoided columns:

>>> tW_df = iterative_selection(qf["tW_DR"],
...                             selection_for("2j2b"),
...                             exclue_avoids=True,
...                             keep_category="kinematics")

tdub.frames.raw_dataframe(files, tree='WtLoop_nominal', weight_name='weight_nominal', branches=None, drop_weight_sys=False, **kwargs)[source]¶

Construct a raw pandas flavored Dataframe with help from uproot.

We call this dataframe “raw” because it hasn’t been parsed by any other tdub.frames functionality (no selection performed, kinematic and weight branches won’t be separated, etc.) – just a pure raw dataframe from some ROOT files.

Extra kwargs are fed to uproot’s arrays() interface.

Parameters

files (list(str) or str) – Single ROOT file or list of ROOT files.
tree (str) – The tree name to turn into a dataframe.
weight_name (str) – Weight branch (we make sure to grab it if you give something other than None to branches).
branches (list(str), optional) – List of branches to include as columns in the dataframe, default is None, includes all branches.
drop_weight_sys (bool) – Drop all weight systematics from the being grabbed.

Returns

The pandas flavored DataFrame with all requested branches

Return type

pandas.DataFrame

Examples

>>> from tdub.data import quick_files
>>> from tdub.frames import raw_dataframe
>>> files = quick_files("/path/to/files")["ttbar"]
>>> df = raw_dataframe(files)

tdub.frames.apply_weight(df, weight_name, exclude=None)[source]¶

Apply (multiply) a weight to all other weights in the DataFrame.

This will multiply the nominal weight and all systematic weights in the DataFrame by the weight_name column. We augment pandas.DataFrame with this function.

Parameters

df (pandas.DataFrame) – Dataaframe to operate on.
weight_name (str) – Column name to multiple all other weight columns by.
exclude (list(str), optional) – List of columns ot exclude when determining the other weight columns to operate on.

Examples

>>> import tdub.frames
>>> df = tdub.frames.raw_dataframe("/path/to/file.root")
>>> df.apply_weight("weight_campaign")

tdub.frames.apply_weight_campaign(df, exclude=None)[source]¶

Multiply nominal and systematic weights by the campaign weight.

This is useful for samples that were produced without the campaign weight term already applied to all other weights. We augment pandas.DataFrame with this function.

Parameters

df (pandas.DataFrame) – Dataframe to operate on.
exclude (list(str), optional) – List of columns to exclude when determining the other weight columns to operate on.

Examples

>>> import tdub.frames
>>> df = tdub.frames.raw_dataframe("/path/to/file.root")
>>> df.weight_nominal[5]
0.003
>>> df.weight_campaign[5]
0.4
>>> df.apply_weight_campaign()
>>> df.weight_nominal[5]
0.0012

tdub.frames.apply_weight_tptrw(df, exclude=None)[source]¶

Multiply nominal and systematic weights by the top pt reweight term.

This is useful for samples that were produced without the top pt reweighting term already applied to all other weights. We augment pandas.DataFrame with this function.

Parameters

df (pandas.DataFrame) – Dataframe to operate on.
exclude (list(str), optional) – List of columns to exclude when determining the other weight columns to operate on.

Examples

>>> import tdub.frames
>>> df = tdub.frames.raw_dataframe("/path/to/file.root")
>>> df.weight_nominal[5]
0.002
>>> df.weight_tptrw_tool[5]
0.98
>>> df.apply_weight_tptrw()
>>> df.weight_nominal[5]
0.00196

tdub.frames.apply_weight_trrw(df, exclude=None)[source]¶

Multiply nominal and systematic weights by the top recursive reweight term.

This is useful for samples that were produced without the top recursive reweighting term already applied to all other weights. We augment pandas.DataFrame with this function.

Parameters

df (pandas.DataFrame) – Dataframe to operate on.
exclude (list(str), optional) – List of columns to exclude when determining the other weight columns to operate on.

Examples

>>> import tdub.frames
>>> df = tdub.frames.raw_dataframe("/path/to/file.root")
>>> df.weight_nominal[5]
0.002
>>> df.weight_trrw_tool[5]
0.98
>>> df.apply_weight_trrw()
>>> df.weight_nominal[5]
0.00196

tdub.frames.drop_avoid(df, region=None)[source]¶

Drop columns that we avoid in classifiers.

Uses tdub.frames.drop_cols() with a predefined set of columns (tdub.config.AVOID_IN_CLF). We augment pandas.DataFrame with this function.

Parameters

df (pandas.DataFrame) – Dataframe that you want to slim.
region (optional, str or tdub.data.Region) – Region to augment the list of dropped columns (see the region specific AVOID constants in the config module).

Examples

>>> from tdub.frames import drop_avoid
>>> import pandas as pd
>>> df = pd.read_parquet("some_file.parquet")
>>> "E_jetL1" in df.columns:
True
>>> drop_avoid(df)
>>> "E_jetL1" in df.columns:
False

tdub.frames.drop_cols(df, *cols)[source]¶

Drop some columns from a dataframe.

This is a convenient function because it just ignores branches that don’t exist in the dataframe that are present in cols. We augment pandas.DataFrame with this function

Parameters

df (pandas.DataFrame) – Dataframe which we want to slim.
*cols (sequence of strings) – Columns to remove

Examples

>>> import pandas as pd
>>> from tdub.data import drop_cols
>>> df = pd.read_parquet("some_file.parquet")
>>> "E_jet1" in df.columns:
True
>>> "mass_jet1" in df.columns:
True
>>> "mass_jet2" in df.columns:
True
>>> drop_cols(df, "E_jet1", "mass_jet1")
>>> "E_jet1" in df.columns:
False
>>> "mass_jet1" in df.columns:
False
>>> df.drop_cols("mass_jet2") # use augmented df class
>>> "mass_jet2" in df.columns:
False

tdub.frames.drop_jet2(df)[source]¶

Drop all columns with jet2 properties.

In the 1j1b region we obviously don’t have a second jet; so this lets us get rid of all columns dependent on jet2 kinematic properties. We augment pandas.DataFrame with this function.

Parameters: df (pandas.DataFrame) – Dataframe that we want to slim.

Examples

>>> from tdub.frames import drop_jet2
>>> import pandas as pd
>>> df = pd.read_parquet("some_file.parquet")
>>> "pTsys_lep1lep2jet1jet2met" in df.columns:
True
>>> drop_jet2(df)
>>> "pTsys_lep1lep2jet1jet2met" in df.columns:
False

tdub.frames.satisfying_selection(*dfs, selection)[source]¶

Get subsets of dataframes that satisfy a selection.

The selection string can be in either ROOT or numexpr form (we ensure to convert ROOT to numexpr).

Parameters

*dfs (sequence of pandas.DataFrame) – Dataframes to apply the selection to.
selection (str) – Selection string (in numexpr or ROOT form).

Returns

Dataframes satisfying the selection string.

Return type

list(pandas.DataFrame)

Examples

>>> from tdub.data import quick_files
>>> from tdub.frames import raw_dataframe, satisfying_selection
>>> qf = quick_files("/path/to/files")
>>> df_tW_DR = raw_dataframe(qf["tW_DR"])
>>> df_ttbar = raw_dataframe(qf["ttbar"])
>>> low_bdt = "(bdt_response < 0.4)"
>>> tW_DR_selected, ttbar_selected = satisfying_selection(
...     dfim_tW_DR.df, dfim_ttbar.df, selection=low_bdt
... )