tdub.data¶

A module for handling our data.

Class Summary¶

`Region`(value)	A simple enum class for easily using region information.
`SampleInfo`(input_file)	Describes a sample’s attritubes given it’s name.

Function Summary¶

`as_region`(region)	Convert input to `Region`.
`avoids_for`(region)	Get the features to avoid for the given region.
`branches_from`(source[, tree, ignore_weights])	Get a list of branches from a data source.
`categorize_branches`(source)	Categorize branches into a separated lists.
`features_for`(region)	Get the feature list for a region.
`quick_files`(datapath[, campaign, tree])	Get a dictionary connecting sample processes to file lists.
`selection_as_numexpr`(selection)	Get the numexpr selection string from an arbitrary selection.
`selection_as_root`(selection)	Get the ROOT selection string from an arbitrary selection.
`selection_branches`(selection)	Construct the minimal set of branches required for a selection.
`selection_for`(region[, additional])	Get the selection for a given region.

Reference¶

class tdub.data.Region(value)[source]¶

A simple enum class for easily using region information.

r1j1b¶: Label for our 1j1b region.

r2j1b¶: Label for our 2j1b region.

r2j2b¶: Label for our 2j2b region.

Examples

Using this enum for grabing the 2j2b region from a set of files:

>>> from tdub.data import Region, selection_for
>>> from tdub.frames import iterative_selection
>>> df = iterative_selection(files, selection_for(Region.r2j2b))

static from_str(s)[source]¶

Get enum value for the given string.

This function supports three ways to define a region; prefixed with “r”, prefixed with “reg”, or no prefix at all. For example, Region.r2j2b can be retrieved like so:

Region.from_str("r2j2b")
Region.from_str("reg2j2b")
Region.from_str("2j2b")

Parameters: s (str) – String representation of the desired region
Returns: Enum version
Return type: Region

Examples

>>> from tdub.data import Region
>>> Region.from_str("1j1b")
<Region.r1j1b: 0>

class tdub.data.SampleInfo(input_file)[source]¶

Describes a sample’s attritubes given it’s name.

Parameters: input_file (str) – File stem containing the necessary groups to parse.

phy_process¶

Physics process (e.g. ttbar or tW_DR or Zjets)

Type: str

dsid¶

Dataset ID

Type: int

sim_type¶

Simulation type, “FS” or “AFII”

Type: str

campaign¶

Campaign, MC16{a,d,e}

Type: str

tree¶

Original tree (e.g. “nominal” or “EG_SCALE_ALL__1up”)

Type: str

Examples

>>> from tdub.data import SampleInfo
>>> sampinfo = SampleInfo("ttbar_410472_AFII_MC16d_nominal.root")
>>> sampinfo.phy_process
ttbar
>>> sampinfo.dsid
410472
>>> sampinfo.sim_type
AFII
>>> sampinfo.campaign
MC16d
>>> sampinfo.tree
nominal

tdub.data.as_region(region)[source]¶

Convert input to Region.

Meant to be similar to numpy.asarray() function.

Parameters: region (str or Region) – Region already as a Region or as a str
Returns: Region representation.
Return type: Region

Examples

>>> from tdub.data import as_region, Region
>>> as_region("r2j1b")
<Region.r2j1b: 1>
>>> as_region(Region.r2j2b)
<Region.r2j2b: 2>

tdub.data.avoids_for(region)[source]¶

Get the features to avoid for the given region.

See the tdub.config module for definition of the variables to avoid (and how to modify them).

Parameters: region (str or tdub.data.Region) – Region to get the associated avoided branches.
Returns: Features to avoid for the region.
Return type: list(str)

Examples

>>> from tdub.data import avoids_for, Region
>>> avoids_for(Region.r2j1b)
['HT_jet1jet2', 'deltaR_lep1lep2_jet1jet2met', 'mass_lep2jet1', 'pT_jet2']
>>> avoids_for("2j2b")
['deltaR_jet1_jet2']

tdub.data.branches_from(source, tree='WtLoop_nominal', ignore_weights=False)[source]¶

Get a list of branches from a data source.

If the source is a list of files, the first file is the only file that is parsed.

Parameters

source (str, list(str), os.PathLike, list(os.PathLike), or uproot File/Tree) – What to parse to get the branch information.
tree (str) – Name of the tree to get branches from
ignore_weights (bool) – Flag to ignore all branches starting with weight_.

Returns

Branches from the source.

Return type

list(str)

Raises

TypeError – If source can’t be used to find a list of branches.

Examples

>>> from tdub.data import branches_from
>>> branches_from("/path/to/file.root", ignore_weights=True)
["pT_lep1", "pT_lep2"]
>>> branches_from("/path/to/file.root")
["pT_lep1", "pT_lep2", "weight_nominal", "weight_tptrw"]

tdub.data.categorize_branches(source)[source]¶

Categorize branches into a separated lists.

The categories:

kinematics: for kinematic features (used for classifiers)
weights: for any branch that starts or ends with weight
meta: for meta information (final state information)

Parameters: source (list(str)) – Complete list of branches to be categorized.
Returns: Dictionary connecting categories to their associated list of branchess.
Return type: dict(str, list(str))

Examples

>>> from tdub.data import categorize_branches, branches_from
>>> branches = ["pT_lep1", "pT_lep2", "weight_nominal", "weight_sys_jvt", "reg2j2b"]
>>> cated = categorize_branches(branches)
>>> cated["weights"]
['weight_sys_jvt', 'weight_nominal']
>>> cated["meta"]
['reg2j2b']
>>> cated["kinematics"]
['pT_lep1', 'pT_lep2']

Using a ROOT file:

>>> root_file = PosixPath("/path/to/file.root")
>>> cated = categorize_branches(branches_from(root_file))

tdub.data.features_for(region)[source]¶

Get the feature list for a region.

See the tdub.config module for the definitions of the feature lists (and how to modify them).

Parameters: region (str or tdub.data.Region) – Region as a string or enum entry. Using "ALL" returns a list of unique features from all regions.
Returns: Features for that region (or all regions).
Return type: list(str)

Examples

>>> from pprint import pprint
>>> from tdub.data import features_for
>>> pprint(features_for("reg2j1b"))
['mass_lep1jet1',
 'mass_lep1jet2',
 'mass_lep2jet1',
 'mass_lep2jet2',
 'pT_jet2',
 'pTsys_lep1lep2jet1jet2met',
 'psuedoContTagBin_jet1',
 'psuedoContTagBin_jet2']

tdub.data.quick_files(datapath, campaign=None, tree='nominal')[source]¶

Get a dictionary connecting sample processes to file lists.

The lists of files are sorted alphabetically. These types of samples are currently tested:

tW_DR (410648, 410649 full sim)
tW_DR_AFII (410648, 410649 fast sim)
tW_DR_PS (411038, 411039 fast sim)
tW_DR_inc (410646, 410647 full sim)
tW_DR_inc_AFII (410646, 410647 fast sim)
tW_DS (410656, 410657 full sim)
tW_DS_inc (410654, 410655 ful sim)
ttbar (410472 full sim)
ttbar_AFII (410472 fast sim)
ttbar_PS (410558 fast sim)
ttbar_PS713 (411234 fast sim)
ttbar_hdamp (410482 fast sim)
ttbar_inc (410470 full sim)
ttbar_inc_AFII (410470 fast sim)
Diboson
Zjets
MCNP
Data

Parameters

datapath (str or os.PathLike) – Path where all of the ROOT files live.
campaign (str, optional) – Enforce a single campaign (“MC16a”, “MC16d”, or “MC16e”).
tree (str) – Upstream AnalysisTop ntuple tree.

Returns

The dictionary of processes and their associated files.

Return type

dict(str, list(str))

Examples

>>> from pprint import pprint
>>> from tdub.data import quick_files
>>> qf = quick_files("/path/to/some_files") ## has 410472 ttbar samples
>>> pprint(qf["ttbar"])
['/path/to/some/files/ttbar_410472_FS_MC16a_nominal.root',
 '/path/to/some/files/ttbar_410472_FS_MC16d_nominal.root',
 '/path/to/some/files/ttbar_410472_FS_MC16e_nominal.root']
>>> qf = quick_files("/path/to/some/files", campaign="MC16d")
>>> pprint(qf["tW_DR"])
['/path/to/some/files/tW_DR_410648_FS_MC16d_nominal.root',
 '/path/to/some/files/tW_DR_410649_FS_MC16d_nominal.root']
>>> qf = quick_files("/path/to/some/files", campaign="MC16a")
>>> pprint(qf["Data"])
['/path/to/some/files/Data15_data15_Data_Data_nominal.root',
 '/path/to/some/files/Data16_data16_Data_Data_nominal.root']

tdub.data.selection_as_numexpr(selection)[source]¶

Get the numexpr selection string from an arbitrary selection.

Parameters: selection (str) – Selection string in ROOT or numexpr
Returns: Selection in numexpr format.
Return type: str

Examples

>>> selection = "reg1j1b == true && OS == true && mass_lep1jet1 < 155"
>>> from tdub.data import selection_as_numexpr
>>> selection_as_numexpr(selection)
'(reg1j1b == True) & (OS == True) & (mass_lep1jet1 < 155)'

tdub.data.selection_as_root(selection)[source]¶

Get the ROOT selection string from an arbitrary selection.

Parameters: selection (str) – The selection string in ROOT or numexpr
Returns: The same selection in ROOT format.
Return type: str

Examples

>>> selection = "(reg1j1b == True) & (OS == True) & (mass_lep1jet1 < 155)"
>>> from tdub.data import selection_as_root
>>> selection_as_root(selection)
'(reg1j1b == true) && (OS == true) && (mass_lep1jet1 < 155)'

tdub.data.selection_branches(selection)[source]¶

Construct the minimal set of branches required for a selection.

Parameters: selection (str) – Selection string in ROOT or numexpr
Returns: Necessary branches/variables
Return type: set(str)

Examples

>>> from tdub.data import minimal_selection_branches
>>> selection = "(reg1j1b == True) & (OS == True) & (mass_lep1lep2 > 100)"
>>> minimal_branches(selection)
{'OS', 'mass_lep1lep2', 'reg1j1b'}
>>> selection = "reg2j1b == true && OS == true && (mass_lep1jet1 < 155)"
>>> minimal_branches(selection)
{'OS', 'mass_lep1jet1', 'reg2j1b'}

tdub.data.selection_for(region, additional=None)[source]¶

Get the selection for a given region.

We have three regions with a default selection (1j1b, 2j1b, and 2j2b), these are the possible argument options (in str or Enum form). See the tdub.config module for the definitions of the selections (and how to modify them).

Parameters

region (str or Region) – Region to get the selection for
additional (str, optional) – Additional selection (in ROOT or numexpr form). This will connect the region specific selection using and.

Returns

Selection string in numexpr format.

Return type

str

Examples

>>> from tdub.data import Region, selection_for
>>> selection_for(Region.r2j1b)
'(reg2j1b == True) & (OS == True)'
>>> selection_for("reg1j1b")
'(reg1j1b == True) & (OS == True)'
>>> selection_for("2j2b")
'(reg2j2b == True) & (OS == True)'
>>> selection_for("2j2b", additional="minimaxmbl < 155")
'((reg2j2b == True) & (OS == True)) & (minimaxmbl < 155)'
>>> selection_for("2j1b", additional="mass_lep1jetb < 155 && mass_lep2jetb < 155")
'((reg1j1b == True) & (OS == True)) & ((mass_lep1jetb < 155) & (mass_lep2jetb < 155))'