motifSeqletCutoffs

Reads in a modisco h5 and prepares to scan for seqlets.

In order to see where seqlets are found on the genome, we need to scan the cwms derived from modiscolite. The first step of this process is to look at the seqlets that MoDISco called for each pattern it identified, and establish cutoff values. This program does that. It reads in a modiscolite hdf5 file and calculates cutoff values of seqlet similarity for what constitutes a hit. It requires one JSON-format input, and generates two outputs: First, it generates a tsv file containing all of the seqlets in the modisco h5 with some helpful metadata, like how well they match the pattern they are identified as being a part of. Second, it produces a JSON-format file that will be needed by motifScan.

BNF

<motif-seqlet-cutoffs-configuration> ::=
    {<seqlet-scanning-settings>,
    <verbosity-section>}
<seqlet-scanning-settings> ::=
    <seqlet-tsv-section>
    "modisco-h5" : <file-name>,
    <seqlet-contrib-section>
    <pattern-spec-section>,
    <quantile-cutoff-section>,
    "trim-threshold" : <number>,
    "trim-padding" : <integer>,
    "background-probs" : <vector-or-genome>,
    <quantile-json-section>
<quantile-cutoff-section> ::=
    "seq-match-quantile" : <number-or-null>,
    "contrib-match-quantile" : <number-or-null>,
    "contrib-magnitude-quantile" : <number-or-null>
<pattern-spec-section> ::=
    "patterns" : "all"
  | "patterns" : [<list-of-pattern-specs>]
<list-of-pattern-specs> ::=
    <pattern-spec>
  | <pattern-spec>, <list-of-pattern-specs>
<pattern-spec> ::=
    {<optional-quantile-cutoff-section>
     "metacluster-name" : <string>,
     <pattern-option>}
<pattern-option> ::=
     "pattern-name" : <string>
  |  "pattern-name" : <string>,
     "short-name" : <string>}
  |  "pattern-names" : [<list-of-string>]
  |  "pattern-names" : [<list-of-string>],
     "short-names" : [<list-of-string>]
<optional-quantile-cutoff-section> ::=
    <quantile-cutoff-section>,
  | <empty>
<vector-or-genome> ::=
    "danRer11" | "hg38" | "dm6" | "mm10" | "sacCer3"
  | [<number>, <number>, <number>, <number>],
<quantile-json-section> ::=
    <empty>
  | "quantile-json" : <file-name>,
<seqlet-contrib-section> ::=
    <empty>
  | "modisco-contrib-h5" : <file-name>,
    "modisco-window": <integer>,
<seqlet-tsv-section> ::=
    <empty>
  | "seqlets-tsv" : <file-name>,

Parameter Notes

seqlets-tsv

(Optional) The name of the file that should be written containing the scanned seqlets. See motifAddQuantiles for the structure of this file.

modisco-h5

This is the hdf5 file generated by modisco.

modisco-contrib-h5

(Optional) The contribution score file generated by interpretFlat, which is necessary to recover the genomic coordinates of the seqlets, since the Modisco hdf5 doesn’t contain that info. The contribution scores are not extracted from this file, just coordinates. THIS DOES NOT CURRENTLY WORK, SINCE SEQLET INDEXES ARE RESET BY MODISCO

modisco-window

(Optional, will become mandatory in 6.0.0) The window size used when running modiscolite. This is needed because the coordinates reported by modisco are relative to the scanned window. If not provided, coordinate data will not be loaded.

There are two ways of specifying patterns, either by giving each pattern and metacluster pair individually, or by listing multiple patterns under a single metacluster. The short-names, if provided, will be used to populate the name field in the generated tsv. You could use this to give a particular pattern the name of its binding protein.

seq-match-quantile

Given the PSSM score of each mapped hit to the original TF-MoDISco PSSM, calculate the quantile value of this score, given the distribution of seqlets corresponding to the TF-MoDISco pattern.

contrib-match-quantile

Given the CWM score (i.e. Jaccardian-similarity) of each mapped hit’s contribution to the original TF-MoDISco CWM, calculate the quantile value of this score, given the distribution of seqlets corresponding to the TF-MoDISco pattern.

contrib-magnitude-quantile

Given the total L1 magnitude of contribution across a mapped hit, calculate the quantile value of this magnitude, given the distribution of seqlets corresponding to theTF-MoDISco pattern.

trim-threshold

For each pattern, acts as a threshold for trimming non-contributing bases from TF-MoDISco’s specified pattern length. This trimming is a reimplementation of tfmodiscolite’s core.trim_to_support feature that trims off the edges of each pattern along the boundaries that are less than trim-threshold*max(contribution). Default: .3

trim-padding

After the pattern trimming step (see trim-threshold parameter), pad each pattern with trim-padding bases to prevent removal of important flanking sequence features. This padding is a reimplementation of tfmodiscolite. Default: 1

background-probs

Gives the genetic content of your genome. For example, if you had a genome with 60 percent GC content, this would be [0.2, 0.3, 0.3, 0.2]. The order of the bases is A, C, G, and T. This may also be a string naming a genome, such as sacCer3. BPReveal knows about danRer11, hg38, mm10, dm6, and sacCer3.

patterns

May be either a pattern spec (see below) or the string “all”, in which case every pattern will be used to scan.

metacluster-name

Will be something like pos_patterns or neg_patterns.

pattern-name

Will be something like pattern_0 or pattern_6. Note that this is the name in the modisco output file, not the actual name of the motif.

short-name

(Optional) If provided, this gives a more human-readable name to a particular pattern. This is just used to annotate the output, it has no effect on the actual scanning or quantiling operation.

pattern-names, short-names

Instead of providing a pattern-spec for each pattern, you may include a list of patterns within one metacluster. These are lists of strings.

Output Specification

See motifAddQuantiles for a description of the tsv file, and motifScan for a description of the generated JSON.

API

bpreveal.motifSeqletCutoffs.main(config)

Determine the cutoffs based on modisco outputs.

Parameters:

config (dict) – A JSON object based on the motifSeqletCutoffs specification.

Schema

{
    "$schema": "http://json-schema.org/draft-07/schema#",
    "title": "motifSeqletCutoffs",
    "description": "Schema for motifSeqletCutoffs.py",
    "definitions": {
        "seqlet-scanning-settings": {
            "type": "object",
            "properties": {
                "seqlets-tsv": {"type": "string"},
                "modisco-h5": {"type": "string"},
                "modisco-contrib-h5": {"type": "string"},
                "modisco-window": {"type": "integer"},
                "seq-match-quantile": {"$ref": "/schema/base#/definitions/fraction-or-null"},
                "contrib-match-quantile": {"$ref": "/schema/base#/definitions/fraction-or-null"},
                "contrib-magnitude-quantile": {"$ref": "/schema/base#/definitions/fraction-or-null"},
                "trim-threshold": {"$ref": "/schema/base#/definitions/fraction-or-null"},
                "trim-padding": {"type": "integer"},
                "background-probs": {
                    "oneOf": [
                        {"type": "array",
                         "minItems": 4,
                         "maxItems": 4,
                         "items": {"$ref": "/schema/base#/definitions/fraction"}},
                        {"type": "string",
                         "enum": ["danRer11", "hg38", "mm10", "dm6", "sacCer3"]}
                    ]
                },
                "quantile-json": {"type": "string"},
                "patterns": {
                    "oneOf":[
                        {
                            "type": "array",
                            "items": {"$ref": "#/definitions/pattern-spec-section"}
                        },
                        {"type": "string", "enum": ["all"]}
                    ]
                }
            },
            "required": ["modisco-h5", "seq-match-quantile", "contrib-match-quantile",
                "contrib-magnitude-quantile", "trim-threshold", "trim-padding",
                "background-probs", "patterns"]
        },
        "pattern-spec-section": {
            "type": "object",
            "properties": {
                "metacluster-name": {"type": "string"},
                "pattern-name": {"type": "string"},
                "pattern-names": {"type": "array", "items": {"type": "string"}},
                "short-name": {"type": "string"},
                "short-names": {"type": "array", "items": {"type": "string"}},
                "seq-match-quantile": {"$ref": "/schema/base#/definitions/fraction-or-null"},
                "contrib-match-quantile": {"$ref": "/schema/base#/definitions/fraction-or-null"},
                "contrib-magnitude-quantile": {"$ref": "/schema/base#/definitions/fraction-or-null"}
            },
            "required": ["metacluster-name"],
            "oneOf":[
                {"allOf": [
                        {"required": ["pattern-name"]},
                        {"not": {"anyOf": [{"required": ["pattern-names"]},
                                            {"required": ["short-names"]}]}}]},
                {"allOf": [
                        {"required": ["pattern-names"]},
                        {"not": {"anyOf": [{"required": ["pattern-name"]},
                                            {"required": ["short-name"]}]}}]}]
        }
    },

    "type": "object",
    "properties": {
        "verbosity": {"$ref": "/schema/base#/definitions/verbosity"}
    },
    "allOf": [
            {"$ref": "#/definitions/seqlet-scanning-settings"},
            {"required": ["verbosity"]}
    ]
}