tdub.frames¶
A module for handling dataframes.
Factory Function Summary¶
|
Build a selected dataframe via uproot’s iterate. |
|
Construct a raw pandas flavored Dataframe with help from uproot. |
Helper Function Summary¶
|
Apply (multiply) a weight to all other weights in the DataFrame. |
|
Multiply nominal and systematic weights by the campaign weight. |
|
Multiply nominal and systematic weights by the top pt reweight term. |
|
Multiply nominal and systematic weights by the top recursive reweight term. |
|
Drop columns that we avoid in classifiers. |
|
Drop some columns from a dataframe. |
|
Drop all columns with jet2 properties. |
|
Get subsets of dataframes that satisfy a selection. |
Reference¶
-
tdub.frames.
iterative_selection
(files, selection, tree='WtLoop_nominal', weight_name='weight_nominal', branches=None, keep_category=None, exclude_avoids=False, use_campaign_weight=False, use_tptrw=False, use_trrw=False, sample_frac=None, **kwargs)[source]¶ Build a selected dataframe via uproot’s iterate.
If we want to build a memory-hungry dataframe and apply a selection this helps us avoid crashing due to using all of our RAM. Constructing a dataframe with this function is useful when we want to grab many branches in a large dataset that won’t fit in memory before the selection.
The selection can be in either numexpr or ROOT form, we ensure that a ROOT style selection is converted to numexpr for use with
pandas.eval()
.- Parameters
files (list(str) or str) – A single ROOT file or list of ROOT files.
selection (str) – Selection string (numexpr or ROOT form accepted).
tree (str) – Tree name to turn into a dataframe.
weight_name (str) – Weight branch to preserve.
branches (list(str), optional) – List of branches to include as columns in the dataframe, default is
None
(all branches).keep_category (str, optional) – If not
None
, the selected dataframe(s) will only include columns which are part of the given category (seetdub.data.categorize_branches()
). The weight branch is always kept.exclude_avoids (bool) – Exclude branches defined by
tdub.config.AVOID_IN_CLF
.use_campaign_weight (bool) – Multiply the nominal weight by the campaign weight. this is potentially necessary if the samples were prepared without the campaign weight included in the product which forms the nominal weight.
use_tptrw (bool) – Apply the top pt reweighting factor.
use_trrw (bool) – Apply the top recursive reweighting factor.
sample_frac (float, optional) – Sample a fraction of the available data.
- Returns
The final selected dataframe(s) from the files.
- Return type
Examples
Creating a
ttbar_df
dataframe a singletW_df
dataframe:>>> from tdub.frames import iterative_selection >>> from tdub.data import quick_files >>> from tdub.data import selection_for >>> qf = quick_files("/path/to/files") >>> ttbar_dfs = iterative_selection(qf["ttbar"], selection_for("2j2b"), ... entrysteps="1 GB") >>> tW_df = iterative_selection(qf["tW_DR"], selection_for("2j2b"))
Keep only kinematic branches after selection and ignore avoided columns:
>>> tW_df = iterative_selection(qf["tW_DR"], ... selection_for("2j2b"), ... exclue_avoids=True, ... keep_category="kinematics")
-
tdub.frames.
raw_dataframe
(files, tree='WtLoop_nominal', weight_name='weight_nominal', branches=None, drop_weight_sys=False, **kwargs)[source]¶ Construct a raw pandas flavored Dataframe with help from uproot.
We call this dataframe “raw” because it hasn’t been parsed by any other tdub.frames functionality (no selection performed, kinematic and weight branches won’t be separated, etc.) – just a pure raw dataframe from some ROOT files.
Extra kwargs are fed to uproot’s
arrays()
interface.- Parameters
files (list(str) or str) – Single ROOT file or list of ROOT files.
tree (str) – The tree name to turn into a dataframe.
weight_name (str) – Weight branch (we make sure to grab it if you give something other than
None
tobranches
).branches (list(str), optional) – List of branches to include as columns in the dataframe, default is
None
, includes all branches.drop_weight_sys (bool) – Drop all weight systematics from the being grabbed.
- Returns
The pandas flavored DataFrame with all requested branches
- Return type
Examples
>>> from tdub.data import quick_files >>> from tdub.frames import raw_dataframe >>> files = quick_files("/path/to/files")["ttbar"] >>> df = raw_dataframe(files)
-
tdub.frames.
apply_weight
(df, weight_name, exclude=None)[source]¶ Apply (multiply) a weight to all other weights in the DataFrame.
This will multiply the nominal weight and all systematic weights in the DataFrame by the
weight_name
column. We augmentpandas.DataFrame
with this function.- Parameters
df (pandas.DataFrame) – Dataaframe to operate on.
weight_name (str) – Column name to multiple all other weight columns by.
exclude (list(str), optional) – List of columns ot exclude when determining the other weight columns to operate on.
Examples
>>> import tdub.frames >>> df = tdub.frames.raw_dataframe("/path/to/file.root") >>> df.apply_weight("weight_campaign")
-
tdub.frames.
apply_weight_campaign
(df, exclude=None)[source]¶ Multiply nominal and systematic weights by the campaign weight.
This is useful for samples that were produced without the campaign weight term already applied to all other weights. We augment
pandas.DataFrame
with this function.- Parameters
df (pandas.DataFrame) – Dataframe to operate on.
exclude (list(str), optional) – List of columns to exclude when determining the other weight columns to operate on.
Examples
>>> import tdub.frames >>> df = tdub.frames.raw_dataframe("/path/to/file.root") >>> df.weight_nominal[5] 0.003 >>> df.weight_campaign[5] 0.4 >>> df.apply_weight_campaign() >>> df.weight_nominal[5] 0.0012
-
tdub.frames.
apply_weight_tptrw
(df, exclude=None)[source]¶ Multiply nominal and systematic weights by the top pt reweight term.
This is useful for samples that were produced without the top pt reweighting term already applied to all other weights. We augment
pandas.DataFrame
with this function.- Parameters
df (pandas.DataFrame) – Dataframe to operate on.
exclude (list(str), optional) – List of columns to exclude when determining the other weight columns to operate on.
Examples
>>> import tdub.frames >>> df = tdub.frames.raw_dataframe("/path/to/file.root") >>> df.weight_nominal[5] 0.002 >>> df.weight_tptrw_tool[5] 0.98 >>> df.apply_weight_tptrw() >>> df.weight_nominal[5] 0.00196
-
tdub.frames.
apply_weight_trrw
(df, exclude=None)[source]¶ Multiply nominal and systematic weights by the top recursive reweight term.
This is useful for samples that were produced without the top recursive reweighting term already applied to all other weights. We augment
pandas.DataFrame
with this function.- Parameters
df (pandas.DataFrame) – Dataframe to operate on.
exclude (list(str), optional) – List of columns to exclude when determining the other weight columns to operate on.
Examples
>>> import tdub.frames >>> df = tdub.frames.raw_dataframe("/path/to/file.root") >>> df.weight_nominal[5] 0.002 >>> df.weight_trrw_tool[5] 0.98 >>> df.apply_weight_trrw() >>> df.weight_nominal[5] 0.00196
-
tdub.frames.
drop_avoid
(df, region=None)[source]¶ Drop columns that we avoid in classifiers.
Uses
tdub.frames.drop_cols()
with a predefined set of columns (tdub.config.AVOID_IN_CLF
). We augmentpandas.DataFrame
with this function.- Parameters
df (pandas.DataFrame) – Dataframe that you want to slim.
region (optional, str or tdub.data.Region) – Region to augment the list of dropped columns (see the region specific AVOID constants in the config module).
Examples
>>> from tdub.frames import drop_avoid >>> import pandas as pd >>> df = pd.read_parquet("some_file.parquet") >>> "E_jetL1" in df.columns: True >>> drop_avoid(df) >>> "E_jetL1" in df.columns: False
-
tdub.frames.
drop_cols
(df, *cols)[source]¶ Drop some columns from a dataframe.
This is a convenient function because it just ignores branches that don’t exist in the dataframe that are present in
cols
. We augmentpandas.DataFrame
with this function- Parameters
df (
pandas.DataFrame
) – Dataframe which we want to slim.*cols (sequence of strings) – Columns to remove
Examples
>>> import pandas as pd >>> from tdub.data import drop_cols >>> df = pd.read_parquet("some_file.parquet") >>> "E_jet1" in df.columns: True >>> "mass_jet1" in df.columns: True >>> "mass_jet2" in df.columns: True >>> drop_cols(df, "E_jet1", "mass_jet1") >>> "E_jet1" in df.columns: False >>> "mass_jet1" in df.columns: False >>> df.drop_cols("mass_jet2") # use augmented df class >>> "mass_jet2" in df.columns: False
-
tdub.frames.
drop_jet2
(df)[source]¶ Drop all columns with jet2 properties.
In the 1j1b region we obviously don’t have a second jet; so this lets us get rid of all columns dependent on jet2 kinematic properties. We augment
pandas.DataFrame
with this function.- Parameters
df (pandas.DataFrame) – Dataframe that we want to slim.
Examples
>>> from tdub.frames import drop_jet2 >>> import pandas as pd >>> df = pd.read_parquet("some_file.parquet") >>> "pTsys_lep1lep2jet1jet2met" in df.columns: True >>> drop_jet2(df) >>> "pTsys_lep1lep2jet1jet2met" in df.columns: False
-
tdub.frames.
satisfying_selection
(*dfs, selection)[source]¶ Get subsets of dataframes that satisfy a selection.
The selection string can be in either ROOT or numexpr form (we ensure to convert ROOT to numexpr).
- Parameters
*dfs (sequence of
pandas.DataFrame
) – Dataframes to apply the selection to.selection (str) – Selection string (in numexpr or ROOT form).
- Returns
Dataframes satisfying the selection string.
- Return type
Examples
>>> from tdub.data import quick_files >>> from tdub.frames import raw_dataframe, satisfying_selection >>> qf = quick_files("/path/to/files") >>> df_tW_DR = raw_dataframe(qf["tW_DR"]) >>> df_ttbar = raw_dataframe(qf["ttbar"]) >>> low_bdt = "(bdt_response < 0.4)" >>> tW_DR_selected, ttbar_selected = satisfying_selection( ... dfim_tW_DR.df, dfim_ttbar.df, selection=low_bdt ... )