tdub.data

A module for handling our data.

Class Summary

Region(value)

A simple enum class for easily using region information.

SampleInfo(input_file)

Describes a sample’s attritubes given it’s name.

Function Summary

as_region(region)

Convert input to Region.

avoids_for(region)

Get the features to avoid for the given region.

branches_from(source[, tree, ignore_weights])

Get a list of branches from a data source.

categorize_branches(source)

Categorize branches into a separated lists.

features_for(region)

Get the feature list for a region.

quick_files(datapath[, campaign, tree])

Get a dictionary connecting sample processes to file lists.

selection_as_numexpr(selection)

Get the numexpr selection string from an arbitrary selection.

selection_as_root(selection)

Get the ROOT selection string from an arbitrary selection.

selection_branches(selection)

Construct the minimal set of branches required for a selection.

selection_for(region[, additional])

Get the selection for a given region.

Reference

class tdub.data.Region(value)[source]

A simple enum class for easily using region information.

r1j1b

Label for our 1j1b region.

r2j1b

Label for our 2j1b region.

r2j2b

Label for our 2j2b region.

Examples

Using this enum for grabing the 2j2b region from a set of files:

>>> from tdub.data import Region, selection_for
>>> from tdub.frames import iterative_selection
>>> df = iterative_selection(files, selection_for(Region.r2j2b))
static from_str(s)[source]

Get enum value for the given string.

This function supports three ways to define a region; prefixed with “r”, prefixed with “reg”, or no prefix at all. For example, Region.r2j2b can be retrieved like so:

  • Region.from_str("r2j2b")

  • Region.from_str("reg2j2b")

  • Region.from_str("2j2b")

Parameters

s (str) – String representation of the desired region

Returns

Enum version

Return type

Region

Examples

>>> from tdub.data import Region
>>> Region.from_str("1j1b")
<Region.r1j1b: 0>
class tdub.data.SampleInfo(input_file)[source]

Describes a sample’s attritubes given it’s name.

Parameters

input_file (str) – File stem containing the necessary groups to parse.

phy_process

Physics process (e.g. ttbar or tW_DR or Zjets)

Type

str

dsid

Dataset ID

Type

int

sim_type

Simulation type, “FS” or “AFII”

Type

str

campaign

Campaign, MC16{a,d,e}

Type

str

tree

Original tree (e.g. “nominal” or “EG_SCALE_ALL__1up”)

Type

str

Examples

>>> from tdub.data import SampleInfo
>>> sampinfo = SampleInfo("ttbar_410472_AFII_MC16d_nominal.root")
>>> sampinfo.phy_process
ttbar
>>> sampinfo.dsid
410472
>>> sampinfo.sim_type
AFII
>>> sampinfo.campaign
MC16d
>>> sampinfo.tree
nominal
tdub.data.as_region(region)[source]

Convert input to Region.

Meant to be similar to numpy.asarray() function.

Parameters

region (str or Region) – Region already as a Region or as a str

Returns

Region representation.

Return type

Region

Examples

>>> from tdub.data import as_region, Region
>>> as_region("r2j1b")
<Region.r2j1b: 1>
>>> as_region(Region.r2j2b)
<Region.r2j2b: 2>
tdub.data.avoids_for(region)[source]

Get the features to avoid for the given region.

See the tdub.config module for definition of the variables to avoid (and how to modify them).

Parameters

region (str or tdub.data.Region) – Region to get the associated avoided branches.

Returns

Features to avoid for the region.

Return type

list(str)

Examples

>>> from tdub.data import avoids_for, Region
>>> avoids_for(Region.r2j1b)
['HT_jet1jet2', 'deltaR_lep1lep2_jet1jet2met', 'mass_lep2jet1', 'pT_jet2']
>>> avoids_for("2j2b")
['deltaR_jet1_jet2']
tdub.data.branches_from(source, tree='WtLoop_nominal', ignore_weights=False)[source]

Get a list of branches from a data source.

If the source is a list of files, the first file is the only file that is parsed.

Parameters
  • source (str, list(str), os.PathLike, list(os.PathLike), or uproot File/Tree) – What to parse to get the branch information.

  • tree (str) – Name of the tree to get branches from

  • ignore_weights (bool) – Flag to ignore all branches starting with weight_.

Returns

Branches from the source.

Return type

list(str)

Raises

TypeError – If source can’t be used to find a list of branches.

Examples

>>> from tdub.data import branches_from
>>> branches_from("/path/to/file.root", ignore_weights=True)
["pT_lep1", "pT_lep2"]
>>> branches_from("/path/to/file.root")
["pT_lep1", "pT_lep2", "weight_nominal", "weight_tptrw"]
tdub.data.categorize_branches(source)[source]

Categorize branches into a separated lists.

The categories:

  • kinematics: for kinematic features (used for classifiers)

  • weights: for any branch that starts or ends with weight

  • meta: for meta information (final state information)

Parameters

source (list(str)) – Complete list of branches to be categorized.

Returns

Dictionary connecting categories to their associated list of branchess.

Return type

dict(str, list(str))

Examples

>>> from tdub.data import categorize_branches, branches_from
>>> branches = ["pT_lep1", "pT_lep2", "weight_nominal", "weight_sys_jvt", "reg2j2b"]
>>> cated = categorize_branches(branches)
>>> cated["weights"]
['weight_sys_jvt', 'weight_nominal']
>>> cated["meta"]
['reg2j2b']
>>> cated["kinematics"]
['pT_lep1', 'pT_lep2']

Using a ROOT file:

>>> root_file = PosixPath("/path/to/file.root")
>>> cated = categorize_branches(branches_from(root_file))
tdub.data.features_for(region)[source]

Get the feature list for a region.

See the tdub.config module for the definitions of the feature lists (and how to modify them).

Parameters

region (str or tdub.data.Region) – Region as a string or enum entry. Using "ALL" returns a list of unique features from all regions.

Returns

Features for that region (or all regions).

Return type

list(str)

Examples

>>> from pprint import pprint
>>> from tdub.data import features_for
>>> pprint(features_for("reg2j1b"))
['mass_lep1jet1',
 'mass_lep1jet2',
 'mass_lep2jet1',
 'mass_lep2jet2',
 'pT_jet2',
 'pTsys_lep1lep2jet1jet2met',
 'psuedoContTagBin_jet1',
 'psuedoContTagBin_jet2']
tdub.data.quick_files(datapath, campaign=None, tree='nominal')[source]

Get a dictionary connecting sample processes to file lists.

The lists of files are sorted alphabetically. These types of samples are currently tested:

  • tW_DR (410648, 410649 full sim)

  • tW_DR_AFII (410648, 410649 fast sim)

  • tW_DR_PS (411038, 411039 fast sim)

  • tW_DR_inc (410646, 410647 full sim)

  • tW_DR_inc_AFII (410646, 410647 fast sim)

  • tW_DS (410656, 410657 full sim)

  • tW_DS_inc (410654, 410655 ful sim)

  • ttbar (410472 full sim)

  • ttbar_AFII (410472 fast sim)

  • ttbar_PS (410558 fast sim)

  • ttbar_PS713 (411234 fast sim)

  • ttbar_hdamp (410482 fast sim)

  • ttbar_inc (410470 full sim)

  • ttbar_inc_AFII (410470 fast sim)

  • Diboson

  • Zjets

  • MCNP

  • Data

Parameters
  • datapath (str or os.PathLike) – Path where all of the ROOT files live.

  • campaign (str, optional) – Enforce a single campaign (“MC16a”, “MC16d”, or “MC16e”).

  • tree (str) – Upstream AnalysisTop ntuple tree.

Returns

The dictionary of processes and their associated files.

Return type

dict(str, list(str))

Examples

>>> from pprint import pprint
>>> from tdub.data import quick_files
>>> qf = quick_files("/path/to/some_files") ## has 410472 ttbar samples
>>> pprint(qf["ttbar"])
['/path/to/some/files/ttbar_410472_FS_MC16a_nominal.root',
 '/path/to/some/files/ttbar_410472_FS_MC16d_nominal.root',
 '/path/to/some/files/ttbar_410472_FS_MC16e_nominal.root']
>>> qf = quick_files("/path/to/some/files", campaign="MC16d")
>>> pprint(qf["tW_DR"])
['/path/to/some/files/tW_DR_410648_FS_MC16d_nominal.root',
 '/path/to/some/files/tW_DR_410649_FS_MC16d_nominal.root']
>>> qf = quick_files("/path/to/some/files", campaign="MC16a")
>>> pprint(qf["Data"])
['/path/to/some/files/Data15_data15_Data_Data_nominal.root',
 '/path/to/some/files/Data16_data16_Data_Data_nominal.root']
tdub.data.selection_as_numexpr(selection)[source]

Get the numexpr selection string from an arbitrary selection.

Parameters

selection (str) – Selection string in ROOT or numexpr

Returns

Selection in numexpr format.

Return type

str

Examples

>>> selection = "reg1j1b == true && OS == true && mass_lep1jet1 < 155"
>>> from tdub.data import selection_as_numexpr
>>> selection_as_numexpr(selection)
'(reg1j1b == True) & (OS == True) & (mass_lep1jet1 < 155)'
tdub.data.selection_as_root(selection)[source]

Get the ROOT selection string from an arbitrary selection.

Parameters

selection (str) – The selection string in ROOT or numexpr

Returns

The same selection in ROOT format.

Return type

str

Examples

>>> selection = "(reg1j1b == True) & (OS == True) & (mass_lep1jet1 < 155)"
>>> from tdub.data import selection_as_root
>>> selection_as_root(selection)
'(reg1j1b == true) && (OS == true) && (mass_lep1jet1 < 155)'
tdub.data.selection_branches(selection)[source]

Construct the minimal set of branches required for a selection.

Parameters

selection (str) – Selection string in ROOT or numexpr

Returns

Necessary branches/variables

Return type

set(str)

Examples

>>> from tdub.data import minimal_selection_branches
>>> selection = "(reg1j1b == True) & (OS == True) & (mass_lep1lep2 > 100)"
>>> minimal_branches(selection)
{'OS', 'mass_lep1lep2', 'reg1j1b'}
>>> selection = "reg2j1b == true && OS == true && (mass_lep1jet1 < 155)"
>>> minimal_branches(selection)
{'OS', 'mass_lep1jet1', 'reg2j1b'}
tdub.data.selection_for(region, additional=None)[source]

Get the selection for a given region.

We have three regions with a default selection (1j1b, 2j1b, and 2j2b), these are the possible argument options (in str or Enum form). See the tdub.config module for the definitions of the selections (and how to modify them).

Parameters
  • region (str or Region) – Region to get the selection for

  • additional (str, optional) – Additional selection (in ROOT or numexpr form). This will connect the region specific selection using and.

Returns

Selection string in numexpr format.

Return type

str

Examples

>>> from tdub.data import Region, selection_for
>>> selection_for(Region.r2j1b)
'(reg2j1b == True) & (OS == True)'
>>> selection_for("reg1j1b")
'(reg1j1b == True) & (OS == True)'
>>> selection_for("2j2b")
'(reg2j2b == True) & (OS == True)'
>>> selection_for("2j2b", additional="minimaxmbl < 155")
'((reg2j2b == True) & (OS == True)) & (minimaxmbl < 155)'
>>> selection_for("2j1b", additional="mass_lep1jetb < 155 && mass_lep2jetb < 155")
'((reg1j1b == True) & (OS == True)) & ((mass_lep1jetb < 155) & (mass_lep2jetb < 155))'