motifUtils

Provides a set of classes and functions to perform CWM-scanning.

bpreveal.motifUtils.arrayQuantileMap(standard, samples, standardSorted=False)

Get each sample’s quantile in standard array.

Parameters:

standard (ndarray[Any, dtype[_ScalarType_co]]) – The reference array.
samples (ndarray[Any, dtype[_ScalarType_co]]) – The values that will be placed among the standard.
standardSorted (bool) – Is the standard array already sorted?

Returns:

For each sample in samples, what quantile of standard would it fall in?

Return type:

ndarray[Any, dtype[float32]]

For each sample in samples (samples is an array of shape (N,)), in which quantile of the array of standards (of shape (M,)) would that sample fall?

For example, if standard were [1,2,3,5, 100], then a sample of 1 would be in the 0 quantile (since it’d fall at the beginning of the array of standards) while a sample of 10 would be somewhere the 0.75 quantile (corresponding to 5) and the 1.0 quantile (corresponding to the 100 standard). Its precise position is determined by interpolating the quantiles in the standards. So in the above example, 10 would be closer to 0.75 than 1.0, because 10 is closer to 5 than it is to 100. Returns a single array of the quantile positions for the samples (of shape (N,)). For a sample array of [2, 4, 6, 99], this would be something like [0.25, 0.625, 0.75001, 0.9999]. Values in samples that fall outside the bounds of the standard are clipped to have a quantile of exactly 0 or exactly 1.0.

standardSorted, if True, means that the standard array is already sorted. It doesn’t change the behavior of this function, but it can speed it up since this function will otherwise have to sort the array of standards.

bpreveal.motifUtils.slidingDotproduct(seqletValues, pssm)

Run the sliding dotproduct algorithm used in the original BPNet.

Simply put, it’s just a convolution.

Parameters:

seqletValues (ndarray[Any, dtype[_ScalarType_co]]) – is a ndarray of shape (numSequences NUM_BASES),
pssm (ndarray[Any, dtype[_ScalarType_co]]) – the pssm, an array of shape (motifLength, NUM_BASES)

Returns:

Returns the scores of comparing the seqlets to their PSSM, an array of shape (motifLength, NUM_BASES)

Return type:

ndarray[Any, dtype[float32]]

bpreveal.motifUtils.ppmToPwm(ppm, backgroundProbs)

Turn a position probability matrix into a position weight matrix.

Given a position probability matrix, which gives the probability of each base at each position, convert that into a position weight matrix, which gives the information (in bits) contained at each position.

ppm is an ndarray of shape (motifLength, NUM_BASES), representing the motif. backgroundProbs is an array of shape (NUM_BASES,), giving the background distribution of bases in the genome. For a genome with 60% AT and 40% GC, this would be [0.3, 0.2, 0.2, 0.3]

returns the pwm, which is an array of shape (motifLength, NUM_BASES)

Parameters:

ppm (ndarray[Any, dtype[_ScalarType_co]]) – is a ndarray of shape (pwmlength, NUM_BASES)
backgroundProbs (ndarray[Any, dtype[_ScalarType_co]]) – is an ndarray of shape (NUM_BASES) representing probabilities to normalize information context over occurring frequencies

Returns:

Returns the pwm, an array of shape (motifLength, NUM_BASES)

Return type:

ndarray[Any, dtype[float32]]

bpreveal.motifUtils.ppmToPssm(ppm, backgroundProbs)

Turn a position probability matrix into an information content matrix.

Given a position probability matrix, convert that to a pssm array, which is a measure of the information contained by each base. This method adds a small (1%) pseudocount at each position.

ppm is an ndarray of shape (motifLength, NUM_BASES), representing the motif. backgroundProbs is an array of shape (NUM_BASES,), with the same meaning as in ppmToPwm.

Parameters:

ppm (ndarray[Any, dtype[_ScalarType_co]]) – is a ndarray of shape (pwmlength, NUM_BASES)
backgroundProbs (ndarray[Any, dtype[_ScalarType_co]]) – is an ndarray of shape (NUM_BASES) representing probabilities to normalize information context over occurring frequencies

Returns:

Returns the pssm, an array of shape (motifLength, NUM_BASES)

Return type:

ndarray[Any, dtype[float32]]

bpreveal.motifUtils.cwmTrimPoints(cwm, trimThreshold, padding)

Find where the motif actually is inside a CWM.

Parameters:

cwm (ndarray[Any, dtype[_ScalarType_co]]) – is a ndarray of shape (cwmlength, NUM_BASES)
trimThreshold (float) – is a floating point number. The lower this is, the more flanking bases will be kept.
padding (int) – The number of bases that should be added back on each side of the trimmed motif.

Returns:

the start and stop indices of the trimmed motif

Return type:

tuple[int, int]

Given a cwm and a threshold, give the slice coordinates that should be used to remove bases with low contribution. For example:

       C
       C A  A
       C A CA
      AC ATCA
nnnnTnACGATCAGnnnnn
0123456789012345678

where the number of letters indicates the importance, should be trimmed to get rid of the noisy (n) bases, and also maybe the flanking T and G. We calculate the maximum contribution (in this case, the stack of 5 Cs) and trim off all bases on the outside that are below trimThreshold*maxContrib If trimThreshold were 0.25, then any bases with less than 1.25 contribution would be trimmed, like so:

 C
 C A  A
 C A CA
AC ATCA
ACGATCA
6789012

If padding is set to zero, then this function will return (6,13), the indices of the passing bases. (Note that it goes to 13, not 12, because Python slices go up to the second index. We add padding to each side. If padding were 3, then the returned indices would be:

    C
    C A  A
    C A CA
    AC ATCA
nTnACGATCAgnn
3456789012345

and the returned indexes would be (3,16)

The returned start and stop coordinates are a tuple, so:

start, stop = cwmTrimPoints(cwm, threshold)
newCwm = cwm[start:stop,:]

(the : in the second axis is because the cwm will have shape (length, NUM_BASES)

class bpreveal.motifUtils.Seqlet(start, end, index, revcomp, oneHot, contribs)

Represents a single seqlet inside a pattern.

Parameters:

start (int) – Relative to the modisco window, where does this seqlet start?
end (int) – Relative to the modisco window, where does this seqlet end?
index (int) – What is the example_idx for this seqlet? This is the row number in the importance score hdf5.
revcomp (bool) – Is this seqlet reverse-complemented?
oneHot (ndarray[Any, dtype[uint8]]) – The one-hot encoded sequence of this seqlet.
contribs (ndarray[Any, dtype[float16]]) – The contribution scores from this seqlet.

seqMatch: float32: The information content match for this seqlet to its pattern’s pssm.

contribMatch: float32: The continuous Jaccard similarity between this seqlet and its pattern’s cwm

contribMagnitude: float32: The sum of the absolute value of all contribution scores for this seqlet

chrom: str = 'UNDEFINED'

What chromosome is this seqlet on?

Populated when you load in coordinates from the contribution hdf5.

genomicStart: int = -10000: After trimming, where does the motif start? Populated when you load in coordinates from the contribution hdf5.

genomicEnd: int = -10000: After trimming, where does the motif end? Populated when you load in coordinates from the contribution hdf5.

start: int: Relative to the modisco window, where does this seqlet start?

end: int: Relative to the modisco window, where does this seqlet end?

index: int

This is example_idx in the modisco output file.

It is used to index into the contribution hdf5 to get coordinate information.

revcomp: bool: Is this seqlet reverse-complemented?

oneHot: ndarray[Any, dtype[uint8]]

The one-hot encoded sequence.

Shape (seqletLength, NUM_BASES)

contribs: ndarray[Any, dtype[float16]]

The contribution scores for each base.

Shape (seqletLength, NUM_BASES)

sequence: str: The sequence of this seqlet, as a string

calcMatches(trimLeft, trimRight, cwm, pssm)

See how this seqlet measures up against its pattern.

Parameters:

trimLeft (int) – How many bases should be trimmed from the left of the seqlet before measuring similarity?
trimRight (int) – How many bases should be trimmed from the right before measuring similarity?
cwm (ndarray[Any, dtype[float32]]) – The pattern’s cwm, used to calculate contribution similarity.
pssm (ndarray[Any, dtype[float32]]) – The pattern’s pssm, used to calculate sequence similarity.

Return type:

None

loadCoordinates(contribFp, modiscoWindow)

Use this seqlet’s index to figure out its original genomic coordinates.

Parameters:

contribFp (File) – The hdf5 file generated by interpretFlat.
modiscoWindow (int) – The size of the scanning window used by modisco.

Return type:

None

class bpreveal.motifUtils.Pattern(metaclusterName, patternName, shortName=None)

A pattern is a simple data storage class.

This class represents a TF-MoDISco pattern that contains metadata stored in the modisco.h5 file, re-represented in a helpful way to perform CWM-scanning and compute seqlet-derived quantile distribution information.

Parameters:

metaclusterName (str)
patternName (str)
shortName (str)

cwm: ndarray[Any, dtype[float32]]: A (length, NUM_BASES) array of the contribution weight matrix.

ppm: ndarray[Any, dtype[float32]]: A (length, NUM_BASES) array of the probability of each base at each position.

pwm: ndarray[Any, dtype[float32]]

The position weight matrix for this motif, the usual motif representation in logos.

(if you want the trimmed pwm, it’s pwm[cwmTrimLeftPoint:cwmTrimRightPoint].)

pssm: ndarray[Any, dtype[float32]]: The information content at each base in the motif.

cwmTrimLeftPoint: int: When trimming the motif using cwmTrimPoints, where should you start and stop?

cwmTrimRightPoint: int: When trimming the motif using cwmTrimPoints, where should you start and stop?

cwmTrim: ndarray[Any, dtype[float32]]: For quick reference, I store the trimmed cwm and pssm.

pssmTrim: ndarray[Any, dtype[float32]]

For quick reference, I store the trimmed cwm and pssm.

(if you want the trimmed pwm, it’s pwm[cwmTrimLeftPoint:cwmTrimRightPoint].)

numSeqlets: int: How many seqlets are in this pattern?

seqlets: list[Seqlet]: All of the seqlets that comprise this pattern.

quantileSeqMatch: float | None: The quantile cutoff for sequence similarity. None means the cutoff is not used.

quantileContribMatch: float | None: The quantile cutoff for contribution similarity. None means the cutoff is not used.

quantileContribMagnitude: float | None: The quantile cutoff for contribution magnitude. None means the cutoff is not used.

cutoffSeqMatch: float | None

What is the minimal information content (i.e., pssm) match score for a hit?

Stored when you give quantile bounds to getCutoffs

cutoffContribMatch: float | None

What is the minimal Jaccard similarity between a seqlet and the cwm for a hit?

Stored when you give quantile bounds to getCutoffs

cutoffContribMagnitude: float | None

What is the minimum total contribution a seqlet must have to be a hit?

Stored when you give quantile bounds to getCutoffs

metaclusterName: str: The name of the metacluster, like “neg_patterns” or “pos_patterns”

patternName: str: The name of the pattern (i.e., motif), like “pattern_0”

shortName: str: The human-readable name of this pattern.

setQuantiles(quantileSeqMatch, quantileContribMatch, quantileContribMagnitude)

Set up the quantile values that will be used to calculate cutoffs.

Parameters:

quantileSeqMatch (float | None) – is a float designating the minimum PSSM match quantile threshold that a mapped hit must meet based on the distribution of seqlet PSSM match scores. If value is None instead of float, skip this threshold.
quantileContribMatch (float | None) – is a float designating the minimum CWM match quantile threshold that a mapped hit must meet based on the distribution of seqlet CWM match scores. If value is None instead of float, skip this threshold.
quantileContribMagnitude (float | None) – is a float designating the minimum contribution quantile threshold that a mapped hit must meet based on the distribution of seqlet contribution scores. If value is None instead of float, skip this threshold.

Return type:

None

loadCwm(modiscoFp, trimThreshold, padding, backgroundProbs)

Given an opened hdf5 file object, load up the contribution scores for this pattern.

Parameters:

modiscoFp (File) – is a filepath referencing the modisco.h5 file generated by tfmodiscolite
trimThreshold (float) – is a floating point number. The lower this is, the more flanking bases will be kept.
padding (int) – is an integer. After trimming, pad each end of motif by this many flanking bases to reconstruct flank specificity.
backgroundProbs (ndarray[Any, dtype[float32]]) – is an ndarray of shape (NUM_BASES) representing probabilities to normalize information context over occurring frequencies

Return type:

None

trimThreshold and padding are used to trim the motifs, see cwmTrimPoints for documentation on those parameters.

backgroundProbs gives the average frequency of each base across the genome as an array of shape (NUM_BASES,). See ppmToPwm and ppmToPssm for details on this parameter.

After running this function, this Pattern object will contain a few arrays:

pwm, which contains the position weight matrix for the underlying seqlets pssm, the information content of the motif ppm, the frequency of each base at each position. cwm, the contribution scores at each position.

loadSeqlets(modiscoFp)

Load seqlets from the modisco hdf5.

Parameters:: modiscoFp (File) – is a filepath referencing the modisco.h5 file generated by tfmodiscolite
Return type:: None

This function loads up all the seqlet data from the modisco file and calculates quantile values for information content match, contribution jaccard match, and contribution L1 match.

getCutoffs()

Calculate cutoff values given target quantiles.

Given the quantile values you want to use as cutoffs, actually look at the seqlet data and pick the right parameters. If you want to choose your own quantile cutoffs, set the cutoffSeqMatch, cutoffContribMatch, and cutoffContribMagnitude members of this object.

Return type:: None

seqletInfoIterator()

Make an iterator to go over the seqlets.

An iterator (meaning you can use it as the range in a for loop) that goes over every seqlet and returns a dictionary of information about it. This is useful when you want to write those seqlets to a file. Returns a dictionary of:

{   "chrom" : <string>,
    "start" : <integer>,
    "end" : <integer>,
    "short-name" : <string>,
    "contrib-magnitude" : <float>,
    "strand" : <character>,

    "metacluster-name" : <string>,
    "pattern-name" : <string>
    "sequence" : <string>,
    "index" : <integer>,
    "seq-match" : <float>,
    "contrib-match" : <float>
}

Return type:: Iterable[dict]

loadSeqletCoordinates(contribH5fp, modiscoWindow)

Read in seqlet coordinates saved in the contribution hdf5.

If modiscoWindow is 0, store invalid data. In BPReveal 6.0.0, a modiscoWindow value of 0 will trigger an assertion failure. For now, it issues an error log.

There is a reported bug in tfmodisco-lite that resets the indexes of the seqlets. CHECK your outputs and make sure that the reported coordinates make sense!

This method creates three new fields in this Pattern object:

seqletChroms: A list of strings containing the chromosome name for each seqlet. Each chromosome will be UNDEFINED if modiscoWindow is 0.
seqletGenomicStarts: The coordinates in the genome where the seqlet starts, or -10000 if modiscoWindow is 0.
seqletGenomicEnds: The coordinates in the genome where the seqlet ends, or -10000 if modiscoWindow is 0.

Parameters:

contribH5fp (File)
modiscoWindow (int)

Return type:

None

getScanningInfo()

Get what you need to know to scan CWMs.

metacluster-name: Will be assigned as pos_patterns or neg_patterns to represent whether the sequence feature increases or decreases SHAP value
pattern-name: Identifier corresponding to the pattern returned by tf-modiscolite
short-name: Combined identifier of metacluster-name and pattern-name
cwm: The CWM, an array of shape (motifLength, NUM_BASES)
pssm: The PSSM, an array of shape (motifLength, NUM_BASES)
seq-match-cutoff: A float designating the minimum PSSM match quantile threshold that a mapped hit must meet based on the distribution of seqlet PSSM match scores. If value is None instead of float, skip this threshold.
contrib-match-cutoff: A float designating the minimum CWM match quantile threshold that a mapped hit must meet based on the distribution of seqlet CWM match scores. If value is None instead of float, skip this threshold.
contrib-magnitude-cutoff: A float designating the minimum contribution quantile threshold that a mapped hit must meet based on the distribution of seqlet contribution scores. If value is None instead of float, skip this threshold.

Get a dictionary (that can be converted to json) containing all the information needed to map motifs by cwm scanning.

Return type:: dict

property seqletSeqMatches: list[float32]

Returns the contribution match for each seqlet.

This is only present for backwards compatibility and returns a warning.

property seqletContribMatches: list[float32]

Returns the contribution match for each seqlet.

This is only present for backwards compatibility and returns a warning.

property seqletContribMagnitudes: list[float32]

Returns the contribution magnitude for each seqlet.

This is only present for backwards compatibility and returns a warning.

bpreveal.motifUtils.seqletCutoffs(modiscoH5Fname, contribH5Fname, patternSpec, quantileSeqMatch, quantileContribMatch, quantileContribMagnitude, trimThreshold, trimPadding, backgroundProbs, modiscoWindow, outputSeqletsFname=None)

Given a modisco hdf5 file, go over the seqlets and establish the quantile boundaries.

If you give hard cutoffs for information content and L1 norm match, this function need not be called.

Parameters:

modiscoH5Fname (str) – Gives the name of the modisco output hdf5.
contribH5Fname (str) – The name of the hdf5 file that was generated by interpretFlat.py. This file contains genomic coordinates of the seqlets, and is used to determine the location of each seqlet identified by modisco. If outputSeqletsFname is None, this parameter is ignored.
patternSpec (list[dict] | Literal['all']) – Either a list of dicts, or the string all. See makePatternObjects for how this parameter is interpreted.
quantileSeqMatch (float) – The information content shared between the PSSM (which is based on nucleotide frequency, NOT contribution scores. A lower value means allow sequences which are a worse match to the PSSM.
quantileContribMatch (float) – The similarity required between the contribution scores of each seqlet and the cwm of the pattern (i.e., motif). Lower means let through worse matches to the contribution scores.
quantileContribMagnitude (float) – gives the cutoff in terms of total importance for a seqlet for it to be considered. A low value means that seqlets that have low total contribution (sum(abs(contrib scores))) can still be considered hits.
trimThreshold (float) – Gives how aggressive the flank-trimming will be.
trimPadding (int) – Gives the padding to be added to the flanks. See bpreveal.motifUtils.cwmTrimPoints() for details of these parameters.
backgroundProbs (ndarray[Any, dtype[float32]]) – An array of shape (NUM_BASES,) of floats that gives the background distribution for each base in the genome. See ppmToPwm() and ppmToPssm() for details on this argument.
modiscoWindow (int) – The size of the window that was used by Modisco during scanning. (That’s the -w argument to modisco motifs.) Until BPReveal 6.0.0, passing in a modiscoWindow value of 0 will disable seqlet coordinate extraction and raise a warning. After 6.0.0, this will cause an assertion error. if outputSeqletsFname is None, this parameter is ignored.
outputSeqletsFname (str | None) – (Optional) Gives a name for a file where the all of the seqlets in the Modisco output should be saved as a tsv file.

Returns:

A list of dicts that will be needed by the cwm scanning utility.

Return type:

list[dict]

The returned list is structured as follows::: [{“metacluster-name”: <string>, “pattern-name”: <string>, “cwm”: <array of floats of shape (length, NUM_BASES)>, “pssm”: <array of floats of shape (length, NUM_BASES)>, “seq-match-cutoff”: <float-or-null>, “contrib-match-cutoff”: <float-or-null>, “contrib-magnitude-cutoff”: <float-or-null> }, … ]

This dictionary can be saved as a json for use with the motif scanning tool, or passed directly to the motif scanning Python functions.

bpreveal.motifUtils.makePatternObjects(patternSpec, modiscoH5Fname, quantileSeqMatch, quantileContribMatch, quantileContribMagnitude)

Get a list of patterns to scan.

Parameters:

patternSpec (list[dict] | "all") – The pattern specs from the config json.
modiscoH5Fname (str) – The name of the hdf5-format file generated by MoDISco.
quantileSeqMatch (float | None) – The default sequence match quantile if a pattern spec doesn’t specify its own.
quantileContribMatch (float | None) – The default contribution match quantile cutoff if a pattern spec doesn’t specify its own.
quantileContribMagnitude (float | None) – The default contribution magnitude cutoff to use if a pattern spec doesn’t specify its own.

Return type:

list[Pattern]

If patternSpec is a list, it may have one of two forms:

[ {"metacluster-name" : <string>,
"pattern-name" : <string>,
«"short-name" : <string> »
«"seq-match-quantile": <number-or-null>»
«"contrib-match-quantile": <number-or-null>»
«"contrib-magnitude-quantile": <number-or-null>»
},
...
]

where each entry gives one metacluster with ONE pattern, or it may give multiple patterns as a list:

[ {"metacluster-name" : <string>,
"pattern-names" : [<string>, ...],
«"short-names" : [<string>,...]»
},
...
]

(Note the ‘s’ in pattern-names, as opposed to the singular pattern-name above.) short-name, or short-names, is an optional parameter. If given, then instead of the patterns being named “pos_2”, they will have the name you give, which could be something much more human-readable, like “Sox2”.

If patternSpec is a list with one entry naming only a single pattern (the first form above), then the optional quantile cutoffs values, if provided, override the global cutoff value for scanning that pattern only.

Alternatively, patternSpec may be the string “all” in which case all patterns in the modisco hdf5 file will be scanned. (And the short names will be pos_0, pos_1, …)

class bpreveal.motifUtils.MiniPattern(config)

A smaller pattern object for the scanners to use.

metacluster-name: Will be assigned as pos_patterns or neg_patterns to represent whether the sequence feature increases or decreases SHAP value
pattern-name: Identifier corresponding to the pattern returned by tf-modiscolite
short-name: Combined identifier of metacluster-name and pattern-name
cwm: The CWM, an array of shape (motifLength, NUM_BASES)
rcwm: The reverse complement CWM, an array of shape (motifLength, NUM_BASES)
pssm: The PSSM, an array of shape (motifLength, NUM_BASES)
rpssm: The reverse complement PSSM, an array of shape (motifLength, NUM_BASES)
seqMatchCutoff: A float designating the minimum PSSM match quantile threshold that a mapped hit must meet based on the distribution of seqlet PSSM match scores. If value is None instead of float, skip this threshold.
contribMatchCutoff: A float designating the minimum CWM match quantile threshold that a mapped hit must meet based on the distribution of seqlet CWM match scores. If value is None instead of float, skip this threshold.
contribMagnitudeCutoff: A float designating the minimum contribution quantile threshold that a mapped hit must meet based on the distribution of seqlet contribution scores. If value is None instead of float, skip this threshold.

During the CWM scanning step, the full Pattern class is unnecessary, since it loads up a lot of seqlet data. This lightweight class is designed to be used inside the scanning threads. It represents one pattern, and can quickly scan sequences and contribution score tracks against its pattern.

Parameters:: config (dict)

scanWithoutCutoffs(sequence, scores)

See how this pattern reacts to an importance profile.

Instead of getting hits and putting them in the output queue, just return the arrays generated during scanning. This is useful to see how the scanner viewed your importance profile.

Parameters:

sequence (ndarray[Any, dtype[uint8]]) – The one-hot encoded sequence.
scores (ndarray[Any, dtype[float16]]) – The importance scores.

Returns:

A list of vectors from scanning this pattern against the sequence.

Return type:

list[ndarray[Any, dtype[float32]]]

The returned vectors will be, in order,

Positive strand contribution match scores
Positive strand contribution magnitudes
Positive strand sequence match scores
Negative strand contribution match scores
Negative strand contribution magnitudes
Negative strand sequence match scores

scan(sequence, scores)

Given a sequence and a contribution track, identify places where this pattern matches.

sequence is a (length, NUM_BASES) one-hot encoded DNA fragment, and scores is the actual (not hypothetical) contribution score data from that region. Scores is (length, NUM_BASES) in shape, but all of the bases that are not present should have zero contribution score. Returns a list of hits. A hit is a tuple with five elements. First, an integer giving the offset in the sequence where the hit starts. Second, the strand on which the hit was found. Third, the contribution magnitude of the pattern at that position. Fourth, the contribution match score of the importance scores at that position. Finally, the sequence match score against the pattern at that position.

Parameters:

sequence (ndarray[Any, dtype[uint8]])
scores (ndarray[Any, dtype[float16]])

Return type:

list[tuple[int, Literal[‘+’, ‘-’], float, float, float]]

class bpreveal.motifUtils.Hit(chrom, start, end, shortName, metaclusterName, patternName, strand, sequence, index, contribMagnitude, contribMatchScore, seqMatchScore)

A small struct used to bundle up found hits for insertion in the pipe to the writer thread.

chrom: Chromosome belonging to hit coordinate
start: 0-based start site of hit coordinate
end: 0-based end site of hit coordinate
shortName: Combined identifier of metacluster-name and pattern-name
patternName: Identifier corresponding to the pattern returned by tf-modiscolite
metaclusterName: Will be assigned as pos_patterns or neg_patterns to represent whether the sequence feature increases or decreases SHAP value
strand: Strand alignment of the hit coordinate
sequence: Sequence of the hit coordinate
index: Region index that hit coordinate belongs to. Should following indexing of given contribution .h5 by which the hit was mapped across.
contribMagnitude: L1 magnitude of contribution (sum(abs(x))) across hit
contribMatchScore: Match score to the hit’s motif CWM based on Jaccard similarity
seqMatchScore: Match score to the hit’s motif PSSM given the log likelihood of a match

Parameters:

chrom (str)
start (int)
end (int)
shortName (str)
metaclusterName (str)
patternName (str)
strand (Literal['+', '-'])
sequence (str)
index (int)
contribMagnitude (float)
contribMatchScore (float)
seqMatchScore (float)

toDict()

Converts the class to a dictionary that’s easier to use with tsv writers.

chrom: Chromosome belonging to hit coordinate
start: 0-based start site of hit coordinate
end: 0-based end site of hit coordinate
short_name: Combined identifier of metacluster-name and pattern-name
contrib_magnitude: L1 magnitude of contribution (sum(abs(x))) across hit
strand: Strand alignment of the hit coordinate
sequence: Sequence of the hit coordinate
region_index: Region index that hit coordinate belongs to. Should following indexing of given contribution .h5 by which the hit was mapped across.
metacluster_name: Will be assigned as pos_patterns or neg_patterns to represent whether the sequence feature increases or decreases SHAP value
pattern_name: Identifier corresponding to the pattern returned by tf-modiscolite
seq_match: Match score to the hit’s motif PSSM given the log likelihood of a match
contrib_match: Match score to the hit’s motif CWM based on Jaccard similarity

Return type:: dict

class bpreveal.motifUtils.RegionScanner(contribFname, patternConfig)

Used to scan one particular region at a time, without cutoffs.

Parameters:

contribFname (str)
patternConfig (dict)

scanIndex(idx)

Scan a single locus.

Given an index into the hdf5 contribution file, extract the sequence and contribution scores there and run a scan. idx is an integer ranging from 0 to the number of entries in the hdf5 file (minus one, of course, since zero-based indexing). Returns a dict, where the keys are (patternName, patternShortName) tuples, and the values are (in order) contribMatch, contribMagnitude, seqMatch, reverseContribMatch, reverseContribMagnitude, reverseSeqMatch

Parameters:: idx (int)
Return type:: dict[tuple[str, str], list[ndarray[Any, dtype[float32]]]]

findRegions(chrom, start, end)

I don’t know the index, but I know the chromosome and position I want scanned.

Given a region, find where in the importance hdf5 that region would be found, and then scan it.

Parameters:

chrom (str)
start (int)
end (int)

Return type:

list[tuple[int, int, int, int]]

class bpreveal.motifUtils.PatternScanner(hitQueue, contribFname, patternConfig)

Used inside a scanner thread to do the actual scanning.

Provides functionality to scan a CWM/PSSM/other motif representation across a given set of contribution score windows to assign scores.

Parameters:

hitQueue (CrashQueue)
contribFname (str)
patternConfig (list[dict])

scanIndex(idx)

Scan a single locus.

Given an index into the hdf5 contribution file, extract the sequence and contribution scores there and run a scan. idx is an integer ranging from 0 to the number of entries in the hdf5 file (minus one, of course, since zero-based indexing). This function will look for hits in this region and then stuff any hits it finds into hitQueue for saving by the writer thread.

Parameters:: idx (int)
Return type:: None

done()

When scanning is finished, close the contribution .h5 file.

Return type:: None

bpreveal.motifUtils.scannerThread(queryQueue, hitQueue, contribFname, patternConfig)

The thread for one scanner.

Each scanner is looking for every pattern, and gets regions to scan from queryQueue. Every time it finds a hit in one of the queries it pulled, it stuffs those Hit objects into hitQueue.

Parameters:

queryQueue (CrashQueue) – The queue that this thread will read from to get its queries.
hitQueue (CrashQueue) – The queue this thread will put its results into.
contribFname (str) – is a string naming the hdf5-format file generated by interpretFlat.py.
patternConfig (list[dict]) – a dictionary/json generated by the quantile script that contains the necessary information to scan for the patterns.

Return type:

None

bpreveal.motifUtils.writerThread(hitQueue, scannerThreads, tsvFname)

A thread that runs concurrently with however many scanners you’re using.

It reads from hitQueue and saves out any hits to a csv file, a bed file, or both.

Parameters:

hitQueue (CrashQueue) – The queue that the scanners will write to.
scannerThreads (int) – gives the number of scanners running concurrently. Why does the writer need to know this? Because each scanner sends a special flag down the queue when it’s done, and the writer needs to know how many of these special values to expect before it knows that all the scanners have finished.
tsvFname (str) – The name of the tsv file that should be written.

Return type:

None

bpreveal.motifUtils.scanPatterns(contribH5Fname, patternConfig, tsvFname, numThreads)

ContribH5Fname is the name of a contribution score file generated by interpretFlat.py.

Parameters:

contribH5Fname (str) – a string naming the hdf5-format file generated by interpretFlat.py.
patternConfig (list[dict]) – a dictionary/json generated by the quantile script that contains the necessary information to scan for the patterns.
tsvFname (str) – the name of the output file containing the hits. Columns 1-6 of this file can be extracted with cut to get a bed file.
numThreads (int) – the total number of threads to use for the scanning. This must be at least three, since this function builds a pipeline with one thread generating queries, one saving results, and all the rest furiously scanning the query sequences for matches. I suggest running rampant with as many threads as they’ll let you use, the scanning is very computationally expensive!

Return type:

None