HggAnalysisDev is a repository of tools for fast and flexible development of H->gg analyses. The main components are:

  1. Preselection: loop over nanoAODs and write events to a pandas dataframe, make plots and tables from these dataframes
  2. MVA Training & Evaluation: starting from output of 1 (pandas dataframe), train and evaluate BDTs (using xgboost) and deep neural networks (using tensorflow and keras), plot performance metrics, and zip MVA scores back into dataframes.
  3. Signal Region Optimization: scan a variety of cuts on one or more variables (typically MVAs) to define signal regions. Within each signal region, fit m_gg distribution to create signal and background (non-resonant and resonant) models, and evaluate a figure of merit (significance, upper limit, etc) in these signal regions.

Note: step 3 currently relies on old code from the ttH (HIG-19-013/HIG-19-015) and FCNC (TOP-20-007) analyses which has CMSSW and combine dependencies. Planned to be updated to a pure-python implementation using zfit and hepstats.

Tutorial: ttH Leptonic analysis

This provides a walkthrough for developing a new H->gg analysis from end-to-end. It assumes you already have custom nanoAOD files with the diphoton preselection applied (see Hgg common tools AN-2019/149) and relevant photon branches for performing the diphoton preselection (these are not present in default nanoAOD at the time of writing).

As an example, we will develop an analysis for measuring ttH (H->gg) in the leptonic (semi-leptonic and di-leptonic decays of ttbar) channel.


1) Identify relevant samples: start by constructing a json file with all of the relevant samples for your analysis. For ttH, we can start with signal samples (ttH), data, and a couple relevant backgrounds: gamma + jets, diphoton + jets, and ttbar + 0-2 photons. The json file will have an entry for each sample you want to run on. We can construct it like this:

    "ttH" : {
        "resonant" : true,
        "fpo" : 10,
        "process_id" : -1,
        "2016" : {
            "paths" : ["/hadoop/cms/store/user/legianni/skimNano-Hggselection/ttHJetToGG_M125_13TeV_amcatnloFXFX_madspin_pythia8_private_mc17/"],
            "metadata" : { "xs" : 0.001151117 }
        "2017" : {
            "paths" : ["/hadoop/cms/store/user/legianni/skimNano-Hggselection/ttHJetToGG_M125_13TeV_amcatnloFXFX_madspin_pythia8_private_mc17/"],
            "metadata" : { "xs" : 0.001151117 }
        "2018" : {
            "paths" : ["/hadoop/cms/store/user/legianni/skimNano-Hggselection/ttHJetToGG_M125_13TeV_amcatnloFXFX_madspin_pythia8_private_mc18/"],
            "metadata" : { "xs" : 0.001151117 }
    "Data" : {
The full `json` file is available [here](https://github.com/cmstas/HggAnalysisDev/blob/3d00f19482a93fa6bf824c32d54bb3e9cfe0bad7/Preselection/data/samples_ttH.json).

2) Calculate scale1fb and other relevant metadata for the samples. This can be done using the script Preselection/scripts/scale1fb.py:

python scale1fb.py --input <path_to_above_json> --output "data/samples_and_scale1fb.json" --debug 1

3) Implement a preselection. This can be done by adding a function to Preselection/selections/analysis_selections.py:

def tth_leptonic_preselection(events, photons, electrons, muons, jets, options, debug):
    Performs tth leptonic preselection, requiring >= 1 lepton and >= 1 jet
    Assumes diphoton preselection has already been applied.
    Also calculates relevant event-level variables.

    # The CutDiagnostics is an optional (but recommended) way to track the efficiencies of each cut you implement.
    cut_diagnostics = utils.CutDiagnostics(events = events, debug = debug, cut_set = "[analysis_selections.py : tth_leptonic_preselection]")

    # Many generic functions for selecting leptons, jets, etc are contained in various other files within `Preselection/selections`, as shown below.
    # Get number of electrons, muons
    selected_electrons = electrons[lepton_selections.select_electrons(events, photons, electrons, options, debug)]
    selected_muons = muons[lepton_selections.select_muons(events, photons, muons, options, debug)]

    n_electrons = awkward.num(selected_electrons)
    n_muons = awkward.num(selected_muons)
    n_leptons = n_electrons + n_muons

    # Get number of jets
    selected_jets = jets[jet_selections.select_jets(events, photons, selected_electrons, selected_muons, None, jets, options, debug)]
    n_jets = awkward.num(selected_jets)

    lep_cut = n_leptons >= 1
    jet_cut = n_jets >= 1

    all_cuts = lep_cut & jet_cut
    cut_diagnostics.add_cuts([lep_cut, jet_cut, all_cuts], ["N_leptons >= 1", "N_jets >= 1", "all"])

    # Keep only selected events
    selected_events = events[all_cuts]
    selected_photons = photons[all_cuts]
    selected_electrons = selected_electrons[all_cuts]
    selected_muons = selected_muons[all_cuts]
    selected_jets = selected_jets[all_cuts]

    # Calculate event-level variables
    selected_events = lepton_selections.set_electrons(selected_events, selected_electrons, debug)
    selected_events = lepton_selections.set_muons(selected_events, selected_muons, debug)
    selected_events = jet_selections.set_jets(selected_events, selected_jets, options, debug)

    return selected_events

You next need to associate this function with a string so you can specify the ttH leptonic preselection from the command line. This is done in Preselection/helpers/loop_helper.py, in the select_events function here, like so:

elif self.selections == "ttH_LeptonicPresel":
    selected_events = analysis_selections.tth_leptonic_preselection(diphoton_events, selected_photons, diphoton_events.Electron, diphoton_events.Muon, diphoton_events.Jet, options, self.debug)

4) Loop over samples and perform the preselection, writing events to a pandas dataframe. First we will need to construct an options json for the ttH leptonic preselection. An example is available here.

The important fields in this json are:

/bin/nice -n 19 python loop.py --nCores 16 --selections "ttH_LeptonicPresel" --debug 1 --options "data/ttH_Leptonic.json" --samples "data/samples_and_scale1fb_ttH.json" --output_tag "test"

Prepending the loop.py call with /bin/nice -n 19 allows us to run on a large number of cores locally without hogging resources and negatively affecting other users. The other relevant arguments are:

