motifAddQuantiles
Help Info
Given tsv files generated by motifSeqletCutoffsand motifScan, add quantile information about each motif hit for downstream analysis.
usage: motifAddQuantiles [-h] --seqlet-tsv SEQLETTSVFNAME
--scan-tsv SCANTSVFNAME
[--seqlet-out SEQLETOUTFNAME]
[--scan-out SCANOUTFNAME]
[--verbose]
Named Arguments
- --seqlet-tsv
The name of the seqlet tsv file generated by motifSeqletCutoffs.py (or motifScan.py if run with pattern-cutoff-settings).
- --scan-tsv
The name of the tsv file generated by motifScan.py
- --seqlet-out
Instead of overwriting seqlet-tsv, write the results to this file. If omitted, edit seqlet-tsv in place.
- --scan-out
Instead of overwriting scan-tsv write them to this file. If omitted, edit scan-tsv in place.
- --verbose
Include more debugging information.
Default: False
Usage
Calculates quantile values for seqlets and called motif instances.
This little helper program calculates quantile values for seqlets and called motif instances. For each pattern (patterns in different metaclusters are distinct), it looks at the seqlets and determines where that seqlet’s importance magnitude, contribution match, and sequence match scores fall among other seqlets in that pattern. Then, for motif hits, it sees where each hit falls, in terms of quantile, among the seqlets in that pattern.
The quantile is based on a very simple definition. A particular seqlet’s quantile is calculated by sorting all of the seqlets for one pattern. The lowest-scoring seqlet gets a quantile of 0.0, the highest-scoring gets 1.0, and the seqlets in between get quantile values based on their order in the sorted metric.
For scanned hits, we take the sorted array of seqlet statistics (for the same pattern as the matched hit fell into) and ask, ‘where would the score of this hit rank among the sorted array of seqlet scores?’ The hit’s rank is then its quantile score. If a hit falls between the scores of two seqlets, then a linear interpolation is performed to assign a quantile value.
The input and output to this program are tsv files, and the only difference in
format is that the outputs have three additional columns:
contrib_magnitude_quantile, seq_match_quantile, and
contrib_match_quantile. The remaining columns are described below:
- chrom
The chromosome where the seqlet or hit was found.
- start
The start position (inclusive, zero-based) of the seqlet or motif hit.
- end
The end position (exclusive, zero-based) of the seqlet or motif hit.
- short_name
The user-provided name for this motif. If you didn’t provide one in the configuration for
motifSeqletCutoffs, then it will be something likepos_0for the positive metacluster, pattern zero.- contrib_magnitude
The total contribution across this motif instance. A higher value means more motif contribution.
- strand
A single character indicating if the motif was on the positive or negative strand.
- metacluster_name
Straight from the modisco hdf5 file. It will be something like
pos_patterns.- pattern_name
Also from the modisco hdf5. It will be something like
pattern_5.- sequence
The DNA sequence of that motif instance.
- index
Either the region index in the contribution hdf5 (from
motifScan), or the seqlet index in the modisco hdf5 (frommotifSeqletCutoffs).- seq_match
The information content of the sequence match to the motif’s pwm.
- contrib_match
The continuous Jaccard similarity between the motif’s cwm and the contribution scores of this seqlet.
- seq-match-quantile
Given the PSSM score of each mapped hit to the original TF-MoDISco PSSM, calculate the quantile value of this score, given the distribution of seqlets corresponding to the TF-MoDISco pattern.
- contrib-match-quantile
Given the CWM score (i.e. Jaccardian-similarity) of each mapped hit’s contribution to the original TF-MoDISco CWM, calculate the quantile value of this score, given the distribution of seqlets corresponding to the TF-MoDISco pattern.
- contrib-magnitude-quantile
Given the total L1 magnitude of contribution across a mapped hit, calculate the quantile value of this magnitude, given the distribution of seqlets corresponding to theTF-MoDISco pattern.
If a contribution hdf5 file was not provided to
motifSeqletCutoffs, the chrom, start,
and end columns are meaningless.
Additional Information
Converting to bed
The first six columns define a bed
file, and a simple cut command can generate a viewable bed file from these
tsvs:
cat scan.tsv | cut -f 1-6 | tail -n +2 > scan.bed
Removing duplicates
The hits from scanning can contain duplicates. This can happen if the same bases appear in multiple regions (i.e., there is overlap in the region set). In this case, it makes sense to only keep the best instance (highest importance magnitude) of that motif hit. This can be done with a little Unix-fu:
cat scan.tsv | \
cut -f 1-6 | \
tail -n +2 | \
sort -k1,1 -k2,2n -k3,3n -k4,4 -k5,5nr | \
awk '!_[$1,$2,$3,$4,$6]++' > scan.bed
API
- bpreveal.motifAddQuantiles.recordToPatternID(record)
Come up with an identifier that uniquely identifies a particular pattern.
Since a single csv should only represent one modisco run, it’s safe to just mash the metacluster and pattern names together. We can’t use short-name because somebody could give the same name to different patterns.
- bpreveal.motifAddQuantiles.readTsv(fname)
Reads in a tsv file generated by the motif seqlet cutoff and scanning tools.
Returns a three-tuple. The first contains the field names from the tsv file, in order. The second is a list of dicts, each dict corresponding to one field in the tsv. The dicts map field names (strings) to the contents of the corresponding column for that record. The third value returned is a list of the unique pattern identifiers among the records; these names are generated by recordToPatternID Each record in the returned list of records contains a field that was not present in the initial tsv, this field is called _TMPNAME. This field contains the combined pattern identifier.
- Parameters:
fname (str) –
- bpreveal.motifAddQuantiles.addFieldNameQuantileMetadata(standardRecords, sampleRecords, patternID, readName, writeName)
For one pattern name, add its quantile data for one quantile type.
For each mapped hit, appends quantile values calculated from a seqlet distribution of the considered score. These scores usually will consist of seq-match, contribution-match or contribution-magnitude scores.
- bpreveal.motifAddQuantiles.addFieldQuantileData(standardRecords, sampleRecords, recordNames, readName, writeName)
For one given field, populate the quantile data.
For each pattern, appends quantile values calculated from a seqlet distribution of the considered score. These scores usually will consist of seq-match, contribution-match or contribution-magnitude scores.
- bpreveal.motifAddQuantiles.addAllMetadata(standardRecords, sampleRecords, recordNames, readNames, writeNames)
Add all of the quantile metadata.
For each mapped hit, appends ALL quantile values calculated from the seqlet distribution of the considered score. These scores usually will consist of seq-match, contribution-match and contribution-magnitude scores.
- bpreveal.motifAddQuantiles.writeTsv(records, fieldNames, fname)
Write a new set of mapped hits with the newly-appended quantile information.
- bpreveal.motifAddQuantiles.getParser()
Return the parser for the CLI.
- Return type:
ArgumentParser
- bpreveal.motifAddQuantiles.main()
Add quantile information.