motifSeqletCutoffs
Reads in a modisco h5 and prepares to scan for seqlets.
In order to see where seqlets are found on the genome, we need to scan the cwms
derived from modiscolite.
The first step of this process is to look at the seqlets that MoDISco called for
each pattern it identified, and establish cutoff values.
This program does that.
It reads in a modiscolite hdf5 file and calculates cutoff values of seqlet
similarity for what constitutes a hit.
It requires one JSON-format input, and generates two outputs: First, it
generates a tsv file containing all of the seqlets in the modisco h5 with some
helpful metadata, like how well they match the pattern they are identified
as being a part of.
Second, it produces a JSON-format file that will be needed
by motifScan.
BNF
<motif-seqlet-cutoffs-configuration> ::= {<seqlet-scanning-settings>, <verbosity-section>}
<seqlet-scanning-settings> ::= <seqlet-tsv-section> "modisco-h5" : <file-name>, <seqlet-contrib-section> <pattern-spec-section>, <quantile-cutoff-section>, "trim-threshold" : <number>, "trim-padding" : <integer>, "background-probs" : <vector-or-genome>, <quantile-json-section>
<quantile-cutoff-section> ::= "seq-match-quantile" : <number-or-null>, "contrib-match-quantile" : <number-or-null>, "contrib-magnitude-quantile" : <number-or-null>
<pattern-spec-section> ::= "patterns" : "all" | "patterns" : [<list-of-pattern-specs>]
<list-of-pattern-specs> ::= <pattern-spec> | <pattern-spec>, <list-of-pattern-specs>
<pattern-spec> ::= {<optional-quantile-cutoff-section> "metacluster-name" : <string>, <pattern-option>}
<pattern-option> ::= "pattern-name" : <string> | "pattern-name" : <string>, "short-name" : <string>} | "pattern-names" : [<list-of-string>] | "pattern-names" : [<list-of-string>], "short-names" : [<list-of-string>]
<optional-quantile-cutoff-section> ::= <quantile-cutoff-section>, | <empty>
<vector-or-genome> ::= "danRer11" | "hg38" | "dm6" | "mm10" | "sacCer3" | [<number>, <number>, <number>, <number>],
<quantile-json-section> ::= <empty> | "quantile-json" : <file-name>,
<seqlet-contrib-section> ::= <empty> | "modisco-contrib-h5" : <file-name>, "modisco-window": <integer>,
<seqlet-tsv-section> ::= <empty> | "seqlets-tsv" : <file-name>,
Parameter Notes
- seqlets-tsv
(Optional) The name of the file that should be written containing the scanned seqlets. See
motifAddQuantilesfor the structure of this file.- modisco-h5
This is the hdf5 file generated by modisco.
- modisco-contrib-h5
(Optional) The contribution score file generated by
interpretFlat, which is necessary to recover the genomic coordinates of the seqlets, since the Modisco hdf5 doesn’t contain that info. The contribution scores are not extracted from this file, just coordinates. THIS DOES NOT CURRENTLY WORK, SINCE SEQLET INDEXES ARE RESET BY MODISCO- modisco-window
(Optional, will become mandatory in 6.0.0) The window size used when running modiscolite. This is needed because the coordinates reported by modisco are relative to the scanned window. If not provided, coordinate data will not be loaded.
There are two ways of specifying patterns, either by giving each pattern and metacluster pair individually, or by listing multiple patterns under a single metacluster. The short-names, if provided, will be used to populate the name field in the generated tsv. You could use this to give a particular pattern the name of its binding protein.
- seq-match-quantile
Given the PSSM score of each mapped hit to the original TF-MoDISco PSSM, calculate the quantile value of this score, given the distribution of seqlets corresponding to the TF-MoDISco pattern.
- contrib-match-quantile
Given the CWM score (i.e. Jaccardian-similarity) of each mapped hit’s contribution to the original TF-MoDISco CWM, calculate the quantile value of this score, given the distribution of seqlets corresponding to the TF-MoDISco pattern.
- contrib-magnitude-quantile
Given the total L1 magnitude of contribution across a mapped hit, calculate the quantile value of this magnitude, given the distribution of seqlets corresponding to theTF-MoDISco pattern.
- trim-threshold
For each pattern, acts as a threshold for trimming non-contributing bases from TF-MoDISco’s specified pattern length. This trimming is a reimplementation of tfmodiscolite’s core.trim_to_support feature that trims off the edges of each pattern along the boundaries that are less than trim-threshold*max(contribution). Default: .3
- trim-padding
After the pattern trimming step (see trim-threshold parameter), pad each pattern with trim-padding bases to prevent removal of important flanking sequence features. This padding is a reimplementation of tfmodiscolite. Default: 1
- background-probs
Gives the genetic content of your genome. For example, if you had a genome with 60 percent GC content, this would be [0.2, 0.3, 0.3, 0.2]. The order of the bases is A, C, G, and T. This may also be a string naming a genome, such as
sacCer3. BPReveal knows about danRer11, hg38, mm10, dm6, and sacCer3.- patterns
May be either a pattern spec (see below) or the string “all”, in which case every pattern will be used to scan.
- metacluster-name
Will be something like
pos_patternsorneg_patterns.- pattern-name
Will be something like
pattern_0orpattern_6. Note that this is the name in the modisco output file, not the actual name of the motif.- short-name
(Optional) If provided, this gives a more human-readable name to a particular pattern. This is just used to annotate the output, it has no effect on the actual scanning or quantiling operation.
- pattern-names, short-names
Instead of providing a pattern-spec for each pattern, you may include a list of patterns within one metacluster. These are lists of strings.
Output Specification
See motifAddQuantiles for a description of the
tsv file, and motifScan for a description of the
generated JSON.
API
- bpreveal.motifSeqletCutoffs.main(config)
Determine the cutoffs based on modisco outputs.
- Parameters:
config (dict) – A JSON object based on the motifSeqletCutoffs specification.
- Return type:
None
Schema
{
"$schema": "http://json-schema.org/draft-07/schema#",
"title": "motifSeqletCutoffs",
"description": "Schema for motifSeqletCutoffs.py",
"definitions": {
"seqlet-scanning-settings": {
"type": "object",
"properties": {
"seqlets-tsv": {"type": "string"},
"modisco-h5": {"type": "string"},
"modisco-contrib-h5": {"type": "string"},
"modisco-window": {"type": "integer"},
"seq-match-quantile": {"$ref": "/schema/base#/definitions/fraction-or-null"},
"contrib-match-quantile": {"$ref": "/schema/base#/definitions/fraction-or-null"},
"contrib-magnitude-quantile": {"$ref": "/schema/base#/definitions/fraction-or-null"},
"trim-threshold": {"$ref": "/schema/base#/definitions/fraction-or-null"},
"trim-padding": {"type": "integer"},
"background-probs": {
"oneOf": [
{"type": "array",
"minItems": 4,
"maxItems": 4,
"items": {"$ref": "/schema/base#/definitions/fraction"}},
{"type": "ndarray"},
{"type": "string",
"enum": ["danRer11", "hg38", "mm10", "dm6", "sacCer3"]}
]
},
"quantile-json": {"type": "string"},
"patterns": {
"oneOf":[
{
"type": "array",
"items": {"$ref": "#/definitions/pattern-spec-section"}
},
{"type": "string", "enum": ["all"]}
]
}
},
"required": ["modisco-h5", "seq-match-quantile", "contrib-match-quantile",
"contrib-magnitude-quantile", "trim-threshold", "trim-padding",
"background-probs", "patterns"]
},
"pattern-spec-section": {
"type": "object",
"properties": {
"metacluster-name": {"type": "string"},
"pattern-name": {"type": "string"},
"pattern-names": {"type": "array", "items": {"type": "string"}},
"short-name": {"type": "string"},
"short-names": {"type": "array", "items": {"type": "string"}},
"seq-match-quantile": {"$ref": "/schema/base#/definitions/fraction-or-null"},
"contrib-match-quantile": {"$ref": "/schema/base#/definitions/fraction-or-null"},
"contrib-magnitude-quantile": {"$ref": "/schema/base#/definitions/fraction-or-null"}
},
"required": ["metacluster-name"],
"oneOf":[
{"allOf": [
{"required": ["pattern-name"]},
{"not": {"anyOf": [{"required": ["pattern-names"]},
{"required": ["short-names"]}]}}]},
{"allOf": [
{"required": ["pattern-names"]},
{"not": {"anyOf": [{"required": ["pattern-name"]},
{"required": ["short-name"]}]}}]}]
}
},
"type": "object",
"properties": {
"verbosity": {"$ref": "/schema/base#/definitions/verbosity"}
},
"allOf": [
{"$ref": "#/definitions/seqlet-scanning-settings"},
{"required": ["verbosity"]}
]
}