tools.bestMotifsOnly

Help Info

Read in a tsv file from the motif scanner and limit each position to only have one motif.

usage: bestMotifsOnly [-h] [--metric METRIC]
                      [--filter FILTER] [--in-tsv INTSV]
                      [--in-bed INBED] [--out-bed OUTBED]
                      [--out-tsv OUTTSV]
                      [--no-match-names]
                      [--max-offset MAXOFFSET] [--verbose]

Named Arguments

--metric

A python expression giving the value that should be used to compare motif instances to see which one is better. (Beds should use the ‘score’ column to compare.) See man page for examples.

Default: 'score'

--filter

Only consider motifs that satisfy this filter. Format: Any valid Python expression where the identifiers are column names in the tsv. (Don’t forget to quote comparison operators on the shell!) See the man page for examples.

Default: 'True'

--in-tsv

The input file, in TSV format. One of –input-tsv or –input-bed is required

--in-bed

The input file, in bed format. One of –input-tsv or –input-bed is required

--out-bed

(Optional) The name of the output file, in bed format.

--out-tsv

(Optional) The name of the output file, in tsv format. Cannot be used with –in-bed.

--no-match-names

If provided, then don’t require motif names to match. Default: motifs will only be removed if there is a better instance of a motif with the same name at that locus.

Default: True

--max-offset

Instead of removing motifs that overlap at all, only compare motif instances that are offset by this amount or less.

Default: 99999

--verbose

Show progress.

Default: False

Usage

Read in a tsv file from motif scanning and remove overlapping motifs.

For each position, if there are two motifs that claim it, remove the motif with a lower value in the specified score.

The metric and filter arguments are evaluated by the interpreter. These expressions are evaluated in an environment where each property of a record (either bed or tsv) is bound to a variable given by the column name. For example, if a bed file is provided, then chrom, name, start, and the other columns are variables in the environment for the expression. You can see the interpreter documentation for a list of all of the syntax available, but here are a few examples:

--metric score: would rank motifs based on their score column, assuming bed-format input.
--metric 'seq_match_quantile + 2 * contrib_match_quantile': would score motifs based mostly on their contribution match, but also include some weight for sequence match.
--filter 'contrib_match_quantile > 0.8': would only keep the top 20 percent of motifs no matter the motif name.
--filter '(pattern_name != "polyA") or (contrib_magnitude_quantile > 0.9)': would select all motifs not named polyA and only accept the top 10 percent of polyA motifs.

filter should return True or False, whereas metric should return a scalar.

Note that there is no support for comparing motifs as part of your metric, so there is no way to say --metric 'motif1.contrib_magnitude if motif1.end > motif2.start else motif1.contrib_match'. (The names motif1 and motif2 would be name errors, since they are not in scope.)

bpreveal.tools.bestMotifsOnly.getParser()

Generate the parser, but do not call parse_args().

Return type:: ArgumentParser

bpreveal.tools.bestMotifsOnly.removeOverlaps(entries, nameCol, maxOffset)

Scan over the (sorted) motif hits and keep the best ones.

For every pair of motifs (m1, m2):

If they don’t overlap at all, continue.
If the center of m1 is more than maxOffset bp away from the center of m2, continue.
If nameCol is given (i.e., is not none) and the name of m1 is different than the name of m2, continue.
We have established that m1 and m2 are competing. We must mark one of them as bad.
Select the motif that has the lower metric. Mark it as bad. If they have the same metric value, mark one at random.

Once all of the marking is complete, return a list of all motifs that are NOT marked as bad.

Parameters:

entries (list[dict]) – A list of motif tsv (or bed) entries. This is a dict keyed by column name.
nameCol (str | None) – The name of the column used to check to see if motifs have the same name, for example short_name. If None, then don’t compare motifs by name, and only return the single best hit at each locus.
maxOffset (int) – If two entries have midpoints that are separated by more than this distance, then they are considered non-overlapping. The larger this value, the more aggressive the culling will be.

Returns:

A list of entries, of the same type as the input entries.

Return type:

list[dict]

There are some quirks with this algorithm. It works based on the order of the bed file, and so the following situation gives a result you might not expect:

|---a:0.8---|     |---c:0.2---|
         |---b:0.7---|   |---d:0.1---|

Here we have four motif calls: a, b, c, and d. By the above algorithm, the motif pairs (a, b), (b, c), and (c, d) have overlap, and so we’ll mark the lower-scoring motif from each pair. Between a and b, b has the lower score, so b will be marked. For b and c, c will be marked. For c and d, d will be marked. This means that ONLY motif a will be returned because every other motif instance had overlap with a better instance.

bpreveal.tools.bestMotifsOnly.loadEntries(inTsv, inBed, matchNames)

Read in the bed or TSV file containing motif calls.

Parameters:

inTsv (str | None) – The name of the input tsv file, or None if one wasn’t provided.
inBed (str | None) – The name of the input bed file, or None if one wasn’t provided.
matchNames (bool) – Should the culling only consider motifs that have the same name?

Returns:

A tuple containing three items. The first is a list of dicts, containing all of the entries from the input data file. The second is a list of the field names in each entry. The third is a string giving the name of the field that should be used to get the name of the motif.

Return type:

tuple[list[dict], list[str], str | None]

bpreveal.tools.bestMotifsOnly.preprocessMotifs(entries, filterDef, metricDef)

Sort, filter, and score the given motif hits.

Parameters:

entries (list[dict]) – A list of dicts, each one representing one motif hit and the fields corresponding to the columns in the data file.
filterDef (str) – A string giving the filter to apply to each of the motifs.
metricDef (str) – A string giving the metric to be calculated for scoring.

Returns:

A list of entries that pass the filter. Each one will have a new field called metric that contains the calculated metric for that motif instance.

Return type:

list[dict]

bpreveal.tools.bestMotifsOnly.main()

Run the culling algorithm.

Return type:: None