motifAddQuantiles

Help Info

Given tsv files generated by motifSeqletCutoffsand motifScan, add quantile information about each motif hit for downstream analysis.

usage: motifAddQuantiles [-h] --seqlet-tsv SEQLETTSVFNAME
                         --scan-tsv SCANTSVFNAME
                         [--seqlet-out SEQLETOUTFNAME]
                         [--scan-out SCANOUTFNAME]
                         [--verbose]

Named Arguments

--seqlet-tsv

The name of the seqlet tsv file generated by motifSeqletCutoffs.py (or motifScan.py if run with pattern-cutoff-settings).

--scan-tsv

The name of the tsv file generated by motifScan.py

--seqlet-out

Instead of overwriting seqlet-tsv, write the results to this file. If omitted, edit seqlet-tsv in place.

--scan-out

Instead of overwriting scan-tsv write them to this file. If omitted, edit scan-tsv in place.

--verbose

Include more debugging information.

Default: False

Usage

Calculates quantile values for seqlets and called motif instances.

This little helper program calculates quantile values for seqlets and called motif instances. For each pattern (patterns in different metaclusters are distinct), it looks at the seqlets and determines where that seqlet’s importance magnitude, contribution match, and sequence match scores fall among other seqlets in that pattern. Then, for motif hits, it sees where each hit falls, in terms of quantile, among the seqlets in that pattern.

The quantile is based on a very simple definition. A particular seqlet’s quantile is calculated by sorting all of the seqlets for one pattern. The lowest-scoring seqlet gets a quantile of 0.0, the highest-scoring gets 1.0, and the seqlets in between get quantile values based on their order in the sorted metric.

For scanned hits, we take the sorted array of seqlet statistics (for the same pattern as the matched hit fell into) and ask, ‘where would the score of this hit rank among the sorted array of seqlet scores?’ The hit’s rank is then its quantile score. If a hit falls between the scores of two seqlets, then a linear interpolation is performed to assign a quantile value.

The input and output to this program are tsv files, and the only difference in format is that the outputs have three additional columns: contrib_magnitude_quantile, seq_match_quantile, and contrib_match_quantile. The remaining columns are described below:

chrom: The chromosome where the seqlet or hit was found.
start: The start position (inclusive, zero-based) of the seqlet or motif hit.
end: The end position (exclusive, zero-based) of the seqlet or motif hit.
short_name: The user-provided name for this motif. If you didn’t provide one in the configuration for motifSeqletCutoffs, then it will be something like pos_0 for the positive metacluster, pattern zero.
contrib_magnitude: The total contribution across this motif instance. A higher value means more motif contribution.
strand: A single character indicating if the motif was on the positive or negative strand.
metacluster_name: Straight from the modisco hdf5 file. It will be something like pos_patterns.
pattern_name: Also from the modisco hdf5. It will be something like pattern_5.
sequence: The DNA sequence of that motif instance.
index: Either the region index in the contribution hdf5 (from motifScan), or the seqlet index in the modisco hdf5 (from motifSeqletCutoffs).
seq_match: The information content of the sequence match to the motif’s pwm.
contrib_match: The continuous Jaccard similarity between the motif’s cwm and the contribution scores of this seqlet.
seq-match-quantile: Given the PSSM score of each mapped hit to the original TF-MoDISco PSSM, calculate the quantile value of this score, given the distribution of seqlets corresponding to the TF-MoDISco pattern.
contrib-match-quantile: Given the CWM score (i.e. Jaccardian-similarity) of each mapped hit’s contribution to the original TF-MoDISco CWM, calculate the quantile value of this score, given the distribution of seqlets corresponding to the TF-MoDISco pattern.
contrib-magnitude-quantile: Given the total L1 magnitude of contribution across a mapped hit, calculate the quantile value of this magnitude, given the distribution of seqlets corresponding to theTF-MoDISco pattern.

If a contribution hdf5 file was not provided to motifSeqletCutoffs, the chrom, start, and end columns are meaningless.

Additional Information

Converting to bed

The first six columns define a bed file, and a simple cut command can generate a viewable bed file from these tsvs:

cat scan.tsv | cut -f 1-6 | tail -n +2 > scan.bed

Removing duplicates

The hits from scanning can contain duplicates. This can happen if the same bases appear in multiple regions (i.e., there is overlap in the region set). In this case, it makes sense to only keep the best instance (highest importance magnitude) of that motif hit. This can be done with a little Unix-fu:

cat scan.tsv | \
    cut -f 1-6 | \
    tail -n +2 | \
    sort -k1,1 -k2,2n -k3,3n -k4,4 -k5,5nr | \
    awk '!_[$1,$2,$3,$4,$6]++' > scan.bed

API

bpreveal.motifAddQuantiles.recordToPatternID(record)

Come up with an identifier that uniquely identifies a particular pattern.

Since a single csv should only represent one modisco run, it’s safe to just mash the metacluster and pattern names together. We can’t use short-name because somebody could give the same name to different patterns.

bpreveal.motifAddQuantiles.readTsv(fname)

Reads in a tsv file generated by the motif seqlet cutoff and scanning tools.

Returns a three-tuple. The first contains the field names from the tsv file, in order. The second is a list of dicts, each dict corresponding to one field in the tsv. The dicts map field names (strings) to the contents of the corresponding column for that record. The third value returned is a list of the unique pattern identifiers among the records; these names are generated by recordToPatternID Each record in the returned list of records contains a field that was not present in the initial tsv, this field is called _TMPNAME. This field contains the combined pattern identifier.

Parameters:: fname (str)

bpreveal.motifAddQuantiles.addFieldNameQuantileMetadata(standardRecords, sampleRecords, patternID, readName, writeName)

For one pattern name, add its quantile data for one quantile type.

For each mapped hit, appends quantile values calculated from a seqlet distribution of the considered score. These scores usually will consist of seq-match, contribution-match or contribution-magnitude scores.

bpreveal.motifAddQuantiles.addFieldQuantileData(standardRecords, sampleRecords, recordNames, readName, writeName)

For one given field, populate the quantile data.

For each pattern, appends quantile values calculated from a seqlet distribution of the considered score. These scores usually will consist of seq-match, contribution-match or contribution-magnitude scores.

bpreveal.motifAddQuantiles.addAllMetadata(standardRecords, sampleRecords, recordNames, readNames, writeNames)

Add all of the quantile metadata.

For each mapped hit, appends ALL quantile values calculated from the seqlet distribution of the considered score. These scores usually will consist of seq-match, contribution-match and contribution-magnitude scores.

bpreveal.motifAddQuantiles.writeTsv(records, fieldNames, fname): Write a new set of mapped hits with the newly-appended quantile information.

bpreveal.motifAddQuantiles.getParser()

Return the parser for the CLI.

Return type:: ArgumentParser

bpreveal.motifAddQuantiles.main(): Add quantile information.