utils

Lots of helpful utilities for working with models.

bpreveal.utils.loadModel(modelFname)

Load up a BPReveal model.

Note

Sets bpreveal.internal.constants.GLOBAL_TENSORFLOW_LOADED.

Parameters:: modelFname (str) – The name of the model that Keras saved earlier, typically a directory.
Returns:: A Keras Model object.

The returned model does NOT support additional training, since it uses a dummy loss.

Example:

from bpreveal.utils import loadModel
m = loadModel("path/to/model")
preds = m.predict(myOneHotSequences)

bpreveal.utils.setMemoryGrowth()

Turn on the tensorflow option to grow memory usage as needed.

Note

Sets bpreveal.internal.constants.GLOBAL_TENSORFLOW_LOADED.

All of the main programs in BPReveal do this, so that you can use your GPU for other stuff as you work with models.

Return type:: None

bpreveal.utils.limitMemoryUsage(fraction, offset)

Limit tensorflow to use only the given fraction of memory.

Note

Sets bpreveal.internal.constants.GLOBAL_TENSORFLOW_LOADED.

This will allocate total-memory * fraction - offset Why use this? Well, for running multiple processes on the same GPU, you don’t want to have them both allocating all the memory. So if you had two processes, you’d do something like:

def child1():
    utils.limitMemoryUsage(0.5, 1024)
    # Load model, do stuff.

def child2():
    utils.limitMemoryUsage(0.5, 1024)
    # Load model, do stuff.

p1 = multiprocessing.Process(target=child1); p1.start()
p2 = multiprocessing.Process(target=child2); p2.start()

And now each process will use (1024 MB less than) half the total GPU memory.

Parameters:

fraction (float) – How much of the memory on the GPU can I have?
offset (float) – How much memory (in MB) should be reserved when I carve out my fraction?

Returns:

The memory (in MB) reserved.

Return type:

float

bpreveal.utils.loadChromSizes(chromSizesFname=None, genomeFname=None, bwHeader=None, bw=None, fasta=None)

Read in a chrom sizes file and return a dictionary mapping chromosome name → size.

Exactly one of the supplied parameters may be None.

Parameters:

chromSizesFname (str | None) – The name of a chrom.sizes file on disk.
genomeFname (str | None) – The name of a genome fasta file on disk.
bwHeader (dict[str, int] | None) – A dictionary loaded from a bigwig. (Using this makes this function a no-op.)
bw (bigWigFile | None) – An opened bigwig file.
fasta (FastaFile | None) – An opened genome fasta.

Returns:

{“chr1”: 1234567, “chr2”: 43212567, …}

Return type:

dict[str, int]

Example:

from bpreveal.utils import loadChromSizes, blankChromosomeArrays, writeBigwig
import pysam
genome = pysam.FastaFile("path/to/genome.fa")
chromSizeDict = loadChromSizes(fasta=genome)
chromArs = blankChromosomeArrays(chromSizes=chromSizeDict, numTracks=1)
myRegionDats = ...  # Some function that returns tuples of (chrom, start, end, data)
for rChrom, rStart, rEnd, rValues in myRegionDats:
    chromArs[rChrom][rStart:rEnd] = rValues
writeBigwig(bwFname="path/to/output.bw", chromArs)

bpreveal.utils.blankChromosomeArrays(genomeFname=None, chromSizesFname=None, bwHeader=None, chromSizes=None, bw=None, fasta=None, dtype=<class 'numpy.float32'>, numTracks=1)

Get a set of blank numpy arrays that you can use to save genome-wide data.

Exactly one of chromSizesFname, genomeFname, bwHeader, chromSizes, bw, or fasta may be None.

Parameters:

chromSizesFname (str | None) – The name of a chrom.sizes file on disk.
genomeFname (str | None) – The name of a genome fasta file on disk.
bwHeader (dict[str, int] | None) – A dictionary loaded from a bigwig.
chromSizes (dict[str, int] | None) – A dictionary mapping chromosome name to length.
bw (bigWigFile | None) – An opened bigwig file.
fasta (FastaFile | None) – An opened genome fasta.
dtype (type) – The type of the arrays that will be returned.
numTracks (int) – How many tracks of data do you have?

Returns:

A dict mapping chromosome name to a numpy array.

Return type:

dict[str, ndarray]

The returned dict will have an element for every chromosome in the input. The shape of each element of the dictionary will be (chromosome-length, numTracks).

See loadChromSizes for an example.

bpreveal.utils.writeBigwig(bwFname, chromDict=None, regionList=None, regionData=None, chromSizes=None)

Write a bigwig file given some region data.

You must specify either:

chromDict, in which case regionList, chromSizes
and regionData must be None, or
regionList, chromSizes, and regionData, in which
case chromDict must be None.

Parameters:

bwFname (str) – The name of the bigwig file to write.
chromDict (dict[str, ndarray] | None) – A dict mapping chromosome names to the data for that chromosome. The data should have shape (chromosome-length,).
regionList (list[tuple[str, int, int]] | None) – A list of (chrom, start, end) tuples giving the locations where the data should be saved.
regionData (Any) – An iterable with the same length as regionList. The ith element of regionData will be written to the ith location in regionList.
chromSizes (dict[str, int] | None) – A dict mapping chromosome name → chromosome size.

See loadChromSizes for an example.

bpreveal.utils.oneHotEncode(sequence, allowN=False)

Convert the string sequence into a one-hot encoded numpy array.

Parameters:

sequence (str) – A DNA sequence to encode. May contain uppercase and lowercase letters.
allowN (bool) – If False (the default), raise an AssertionError if the sequence contains letters other than ACGTacgt. If True, any other characters will be encoded as [0, 0, 0, 0].

Returns:

An array with shape (len(sequence), 4).

Return type:

ONEHOT_AR_T

The columns are, in order, A, C, G, and T. The mapping is as follows:

A or a → [1, 0, 0, 0]
C or c → [0, 1, 0, 0]
G or g → [0, 0, 1, 0]
T or t → [0, 0, 0, 1]
Other  → [0, 0, 0, 0]

Example:

from bpreveal.utils import oneHotEncode, oneHotDecode
seq = "ACGTTT"
x = oneHotEncode(seq)
print(x)
# [[1 0 0 0]
#  [0 1 0 0]
#  [0 0 1 0]
#  [0 0 0 1]
#  [0 0 0 1]
#  [0 0 0 1]]
y = oneHotDecode(x)
print(y)
# ACGTTT

bpreveal.utils.oneHotDecode(oneHotSequence)

Take a one-hot encoded sequence and turn it back into a string.

Given an array representing a one-hot encoded sequence, convert it back to a string. The input shall have shape (sequenceLength, 4), and the output will be a Python string. The decoding is performed based on the following mapping:

[1, 0, 0, 0] → A
[0, 1, 0, 0] → C
[0, 0, 1, 0] → G
[0, 0, 0, 1] → T
[0, 0, 0, 0] → N

See oneHotEncode for an example.

Parameters:: oneHotSequence (ndarray) –
Return type:: str

bpreveal.utils.logitsToProfile(logitsAcrossSingleRegion, logCountsAcrossSingleRegion)

Take logits and logcounts and turn it into a profile.

Parameters:

logitsAcrossSingleRegion (LOGIT_AR_T) – An array of shape (output-length * num-tasks)
logCountsAcrossSingleRegion (LOGCOUNT_T) – A single floating-point number

Returns:

An array of shape (output-length * num-tasks), giving the profile predictions.

Return type:

PRED_AR_T

Example:

from bpreveal.utils import loadModel, oneHotEncode, logitsToProfile
import pysam
import numpy as np
genome = pysam.FastaFile("/scratch/genomes/sacCer3.fa")
seq = genome.fetch("chrII", 429454, 432546)
oneHotSeq = oneHotEncode(seq)
print(oneHotSeq.shape)
model = loadModel("/scratch/mnase.model")
preds = model.predict(np.array([oneHotSeq]))
print(preds[0].shape)
# (1, 1000, 2)
# because there was one input sequence, the output-length is 1000 and
# there are two tasks in this head.
print(preds[1].shape)
# (1, 1)
# because there is one input sequence and there's just one logcounts value
# for each region.
# Note that if the model had two heads, preds[1] would be the logits from the
# second head and preds[2] and preds[3] would be the logcounts from head 1 and
# head 2, respectively.
profiles = logitsToProfile(preds[0][0], preds[1][0])
print(profiles.shape)
# (1000, 2)
# Because we have an output length of 1000 and two tasks.
# These are now the predicted coverage, in read-space.

bpreveal.utils.easyPredict(sequences, modelFname)

Make predictions with your model.

Parameters:

sequences (Iterable[str] | str) – The DNA sequence(s) that you want to predict on.
modelFname (str) – The name of the Keras model to use.

Returns:

An array of profiles or a single profile, depending on sequences

Return type:

list[list[ PRED_AR_T ]] or list[ PRED_AR_T ]

Spawns a separate process to make a single batch of predictions, then shuts it down. Why make it complicated? Because it frees the GPU after it’s done so other programs and stuff can use it. If sequences is a single string containing a sequence to predict on, that’s okay, it will be treated as a length-one list of sequences to predict. sequences should be as long as the input length of your model.

If you passed in an iterable of strings (like a list of strings), the shape of the returned profiles will be (numSequences x numHeads x outputLength x numTasks). Since different heads can have different numbers of tasks, the returned object will be a list (one entry per sequence) of lists (one entry per head) of arrays of shape (outputLength x numTasks). If, instead, you passed in a single string as sequences, it will be (numHeads x outputLength x numTasks). As before, this will be a list (one entry per head) of arrays of shape (outputLength x numTasks)

Example:

from bpreveal.utils import easyPredict
import pysam
genome = pysam.FastaFile("/scratch/genomes/sacCer3.fa")
seq = genome.fetch("chrII", 429454, 432546)
profile = easyPredict([seq], "/scratch/mnase.model")
print(len(profile))
# 1
# because we ran one sequence.
print(len(profile[0]))
# 1
# because there is one head in this model.
print(profile[0][0].shape)
# (1000, 2)
# Because we have an output length of 1000 and two tasks.
# These are now the predicted coverage, in read-space.
singleProfile = easyPredict(seq, "/scratch/mnase.model")
print(singleProfile[0].shape)
# (1000, 2)
# Note how I only had to index singleProfile once, (to get the first head)
# since I passed in a single string as the sequence.

bpreveal.utils.easyInterpretFlat(sequences, modelFname, heads, headID, taskIDs, numShuffles=20, kmerSize=1, keepHypotheticals=False)

Spin up an entire interpret pipeline just to interpret your sequences.

You should only use this for quick one-off things since it is EXTREMELY slow.

Parameters:

sequences (Iterable[str] | str) – is a list (or technically any Iterable) of strings, and the returned importance scores will be in an order that corresponds to your sequences. You can also provide just one string, in which case the return type will change: The first (length-one) dimension will be stripped.
modelFname (str) – The name of the BPReveal model on disk.
heads (int) – The TOTAL number of heads that the model has.
headID (int) – The index of the head of the model that you want interpreted.
taskIDs (list[int]) – The list of tasks that should be included in the profile score calculation. For most cases, you’d want a list of all the tasks, like [0,1].
numShuffles (int) – The number of shuffled sequences that are used to calculate shap values.
kmerSize (int) – The length of kmers for which the distribution should be preserved during the shuffle. If 1, shuffle each base independently. If 2, preserve the distribution of dimers, etc.
keepHypotheticals (bool) – Controls whether the output contains hypothetical contribution scores or just the actual ones.

Returns:

A dict containing the importance scores.

Return type:

dict[str, IMPORTANCE_AR_T | ONEHOT_AR_T ]

If you passed in an iterable of strings (like a list), then the output’s first dimension will be the number of sequences and it will depend on keepHypotheticals:

If keepHypotheticals is True, then it will be structured so:

{"profile": array of shape (numSequences x inputLength x 4),
 "counts": array of shape (numSequences x inputLength x 4),
 "sequence": array of shape (numSequences x inputLength x 4)}

This dict has the same meaning as shap scores stored in an interpretFlat hdf5.

If keepHypotheticals is False (the default), then the shap scores will be condensed down to the normal scores that we plot in a genome browser:
```
{"profile": array of shape (numSequences x inputLength),
 "counts": array of shape (numSequences x inputLength)}
```

However, if sequences was a string instead of an iterable, then the numSequences dimension will be suppressed:

For keepHypotheticals == True, you get:

{"profile": array of shape (inputLength x 4),
 "counts": array of shape (inputLength x 4),
 "sequence": array of shape (inputLength x 4)}

and if keepHypotheticals is False, you get:

{"profile": array of shape (inputLength,),
 "counts": array of shape (inputLength,)}

class bpreveal.utils.BatchPredictor(modelFname, batchSize, start=True)

A utility class for when you need to make lots of predictions.

Note

Sets bpreveal.internal.constants.GLOBAL_TENSORFLOW_LOADED.

It’s doubly-useful if you are generating sequences dynamically. Here’s how it works. You first create a predictor by calling BatchPredictor(modelName, batchSize). If you’re not sure, a batch size of 64 is probably good.

Now, you submit any sequences you want predicted, using the submit methods.

Once you’ve submitted all of your sequences, you can get your results with the getOutput() method.

Note that the getOutput() method returns one result at a time, and you have to call getOutput() once for every time you called one of the submit methods.

Parameters:

modelFname (str) – The name of the BPReveal model that you want to make predictions from. It’s the same name you give for the model in any of the other BPReveal tools.
batchSize (int) – is the number of samples that should be run simultaneously through the model.
start (bool) – Ignored, but present here to give BatchPredictor the same API as ThreadedBatchPredictor.

clear()

Reset the predictor.

If you’ve left your predictor in some weird state, you can reset it by calling clear(). This empties all the queues.

Return type:: None

submitOHE(sequence, label)

Submit a one-hot-encoded sequence.

Parameters:

sequence (ndarray[Any, dtype[uint8]]) – An (input-length x 4) ndarray containing the one-hot encoded sequence to predict.
label (Any) – Any object; it will be returned with the prediction.

Return type:

None

submitString(sequence, label)

Submit a given sequence for prediction.

Parameters:

sequence (str) – A string of length input-length
label (Any) – Any object. Label will be returned to you with the prediction.

Return type:

None

runBatch(maxSamples=None)

Actually run the batch.

Normally, this will be called by the submit functions, and it will also be called if you ask for output and the output queue is empty (assuming there are sequences waiting in the input queue.)

Parameters:: maxSamples (int | None) – (Optional) The maximum number of samples to run in this batch. It should probably be a multiple of the batch size.
Return type:: None

outputReady()

Is there any output ready for you?

Returns:: True if the batcher is sitting on results, and False otherwise.
Return type:: bool

empty()

Is it possible to getOutput()?

Returns:: True if predictions haven’t been made yet, but they would be made if you asked for output.
Return type:: bool

getOutput()

Return one of the predictions made by the model.

This implementation guarantees that predictions will be returned in the same order as they were submitted, but you should not rely on that in the future. Instead, you should use a label when you submit your sequences and use that to determine order.

Returns:: A two-tuple.
Return type:: tuple[list[ LOGIT_AR_T , LOGIT_T ], typing.Any]

The first element will be a list of length numHeads*2, representing the output from the model. Since the output of the model will always have a dimension representing the batch size, and this function only returns the result of running a single sequence, the dimension representing the batch size is removed. In other words, running the model on a single example would give a logits output of shape (1 x output-length x num-tasks). But this function will remove that, so you will get an array of shape (output-length x numTasks) As with calling the model directly, the first numHeads elements are the logits arrays, and then come the logcounts for each head. You can pass the logits and logcounts values to utils.logitsToProfile to get your profile.
The second element will be the label you passed in with the original sequence.

Graphically:

( [<head-1-logits>, <head-2-logits>, ...
   <head-1-logcounts>, <head-2-logcounts>, ...
  ],
  label)

class bpreveal.utils.ThreadedBatchPredictor(modelFname, batchSize, start=False, numThreads=1)

Mirrors the API of the BachPredictor class, but predicts in a separate thread.

This can give you a performance boost, and also lets you shut down the predictor thread when you don’t need it. Supports the with statement to only turn on the batcher when you’re using it, or you can leave it running in the background.

Usage examples:

predictor = utils.ThreadedBatchPredictor(modelFname, 64, start=True)
# Use as you would a normal batchPredictor
# When not needed any more:
predictor.stop()

Alternatively, you can use this as a context manager:

predictor = utils.ThreadedBatchPredictor(modelFname, 64, start=False)

with predictor:
    # use as a normal BatchPredictor.
# On leaving the context, the predictor is shut down.

The batcher guarantees that the order in which you get results is the same as the order you submitted them in, but this could change in the future!

Parameters:

modelFname (str) – The name of the model to use to make predictions.
batchSize (int) – The number of samples to calculate at once.
start (bool) – Should the predictor start right away?
numThreads (int) – How many predictors should be spawned?

start()

Spin up the batcher thread.

If you submit sequences without starting the batcher, this method will be called automatically (with a warning).

stop(): Shut down the processor thread.

clear(): Reset the batcher, emptying any queues and reloading the model.

submitOHE(sequence, label)

Submit a one-hot-encoded sequence.

Parameters:

sequence (ndarray[Any, dtype[uint8]]) – An (input-length x 4) ndarray containing the one-hot encoded sequence to predict.
label (Any) – Any object; it will be returned with the prediction.

Return type:

None

submitString(sequence, label)

Submit a given sequence for prediction.

Parameters:

sequence (str) – A string of length input-length
label (Any) – Any object. Label will be returned to you with the prediction.

Return type:

None

outputReady()

Is there any output ready for you?

Returns:: True if the batcher is sitting on results, and False otherwise.
Return type:: bool

empty()

Is it possible to getOutput()?

Returns:: True if predictions haven’t been made yet, but they would be made if you asked for output.
Return type:: bool

getOutput()

Get a single output.

Returns:: The model’s predictions.
Return type:: tuple[list[ LOGIT_AR_T , LOGCOUNT_T ], typing.Any]

Same semantics as BatchPredictor.getOutput.