interpretUtils

A big ol’ module that contains high-efficiency tools for calculating shap scores.

In user code, it’s actually not that bad to use this module. Here’s what you do.

Create a Generator.
Create a Saver.
Pass those to a Runner.
Call .run() on the Runner.

class bpreveal.interpretUtils.Query(oneHotSequence, passData, index)

A Query is what is passed to the batcher.

It has three things.

Parameters:

sequence – The (input-length, NUM_BASES) one-hot encoded sequence of the current base.
passData (Any) – As with Result objects, is either a tuple of (chromName, position) (for when you have a bed file) or a string with a fasta description line (for when you’re starting with a fasta). If you’re using the ListGenerator, then it can be anything.
index (int) – Indicates which output slot this data should be put in. Since there’s no guarantee that the results will arrive in order, we have to track which query was which.
oneHotSequence (ndarray[Any, dtype[uint8]])

Lifetime:

Created by a Generator’s iterator
_generatorThread iterates, and puts the Query in each inQueue.
inQueue is read by _batcherThread.
_batcherThread calls .add on the Batcher.
Batcher processes query.

class bpreveal.interpretUtils.Result

The base class for results.

Subclassed by PisaResult and FlatResult.

Lifetime:

Generated by a Batcher.
Batcher places in outQueue.
outQueue is read by saverThread
saverThread calls .add() on the Saver.

class bpreveal.interpretUtils.PisaResult(inputPrediction, shufflePredictions, sequence, shap, passData, index)

The output from shapping a single base.

It contains a few things.

Parameters:

inputPrediction (ndarray[Any, dtype[float32]]) – A scalar floating point value, of the predicted logit from the input sequence at the base that was being shapped.
shufflePredictions (ndarray[Any, dtype[float32]]) – A (numShuffles,) numpy array of the logits returned by running predictions on the reference sequence, again evaluated at the position of the base that was being shapped.
sequence (ndarray[Any, dtype[uint8]]) – is a (receptive-field, NUM_BASES) numpy array of the one-hot encoded input sequence.
shap (ndarray[Any, dtype[float16]]) – is a (receptive-field, NUM_BASES) numpy array of shap scores.
passData (Any) – is data that is not touched by the batcher, but added by the generator and necessary for creating the output file. If the generator is reading from a bed file, then it is a tuple of (chromName, position) and that data should be used to populate the coords_chrom and coords_base fields. If the generator was using a fasta file, it is the title line from the original fasta, with its leading > removed.
index (int) – Indicates which address the data should be stored at in the output hdf5. Since there’s no order guarantee when you’re receiving data, we have to keep track of the order in the original input.

class bpreveal.interpretUtils.FlatResult(sequence, shap, passData, index)

A Result object that is given to savers for flat interpretation analysis.

Parameters:

sequence (ndarray[Any, dtype[uint8]]) – A one-hot encoded array of the sequence that was explained, of shape (input-length, NUM_BASES)
shap (ndarray[Any, dtype[float16]]) – An array of shape (input-length, NUM_BASES), containing the shap scores.
passData (Any) – is a (picklable) object that is passed through from the generator. For bed-based interpretations, it will be a three-tuple of (chrom, start, end) (The start and end positions correspond to the INPUT to the model, so they are inflated with respect to the bed file.) For fasta-based interpretations it will be a string.
index (int) – Gives the position in the output hdf5 where the scores should be saved.

class bpreveal.interpretUtils.Generator

The base class for generating PISA samples.

Lifetime:

Initially created in user code. The generator at this point only contains serializable data.
Passed to a Runner.
The Runner passes it to a generatorThread, which runs in a separate Process.
generatorThread calls .construct(), and the Generator opens any files it needs.
generatorThread iterates over the Generator until it is exhausted.
generatorThread calls .done().

construct()

Set up the generator (in the child thread).

When the generator is about to start, this method is called once before requesting the iterator.

Return type:: None

done()

When the batcher is done, this is called to free any allocated resources.

Return type:: None

class bpreveal.interpretUtils.Saver

Class to receive Results and deal with them.

Descendants of this class shall receive Results from the batcher and save them to an appropriate structure, usually on disk. The user creates a Saver object in the main thread, and then the saver gets construct()ed inside a separate thread created by the runners. Therefore, you should only create members that are serializable in __init__.

Lifetime:

Created in user code with only serializable data.
Passed to a Runner, which passes it to a saverThread in a new process.
saverThread calls .construct().
saverThread reads from a queue and repeatedly calls .add().
saverThread calls .done()
The Runner, once saverThread is joined, calls parentFinish in the parent thread.

construct()

Do any setup necessary in the child thread.

This function should actually open the output files, since it will be called from inside the actual saver thread.

Return type:: None

add(result)

Add the given Result to wherever you’re saving them out.

Note that this will not be called with None at the end of the run, that is special-cased in the code that runs your saver. (That code will call done() when all of the results have been added.)

Parameters:: result (Result)
Return type:: None

parentFinish()

Called in the parent thread when the saver is done.

Usually, there’s nothing to do, but the parent thread might need to close shared memory, so this function is guaranteed to be called by the Runners.

Return type:: None

done()

Called when the batcher is complete (indicated by putting a None in its output queue).

Now is the time to close all of your files. This function is called in the child thread, not the parent thread.

Return type:: None

class bpreveal.interpretUtils.FlatRunner(modelFname, headID, numHeads, taskIDs, batchSize, generator, profileSaver, countsSaver, numShuffles, kmerSize)

Runs shap scores.

I try to avoid class-based wrappers around simple things, but this is not simple. This class creates threads to read in data, and then creates two threads that run interpretation samples in batches. Finally, it takes the results from the interpretation thread and saves those out to an hdf5-format file.

Parameters:

modelFname (str) – is the name of the model on disk
headID (int) – The head to shap.
taskIDs (list[int]) – The tasks to shap, obviously. Typically, you’d want to interpret all of the tasks in a head, so for a two-task head, taskIDs would be [0,1].
batchSize (int) – is the shap batch size, which should be your usual batch size divided by the number of shuffles. (since the original sequence and shuffles are run together)
generator (Generator) – is a Generator object that will be passed to _generatorThread
saver – is a Saver object that will be used to save out the data.
numShuffles (int) – is the number of reference sequences that should be generated for each shap evaluation. (I recommend 20 or so)
kmerSize (int) – Should the shuffle preserve k-mer distribution? If kmerSize == 1, then no. If kmerSize > 1, preserve the distribution of kmers of the given size.
numHeads (int)
profileSaver (Saver)
countsSaver (Saver)

run()

Start up the threads and waits for them to finish.

Return type:: None

class bpreveal.interpretUtils.PisaRunner(modelFname, headID, taskID, batchSize, generator, saver, numShuffles, receptiveField, kmerSize, numBatchers)

Tool to run PISA batches.

I try to avoid class-based wrappers around simple things, but this is not simple. This class creates threads to read in data, and then creates a thread that runs PISA samples in batches. Finally, it takes the results from the PISA thread and saves those out to an hdf5-format file.

Parameters:

modelFname (str) – is the name of the model on disk
headID (int) – The head to shap.
taskID (int) – The task to shap.
batchSize (int) – is the shap batch size, which should be your usual batch size divided by the number of shuffles. (since the original sequence and shuffles are run together)
generator (Generator) – is a Generator object that will be passed to _generatorThread
saver (Saver) – is a Saver object that will be used to save out the data.
numShuffles (int) – is the number of reference sequences that should be generated for each shap evaluation. (I recommend 20 or so)
receptiveField (int) – is the receptive field of the model. To save on writing a lot of zeroes, the result objects only contain bases that are in the receptive field of the base being shapped.
kmerSize (int) – Should the shuffle preserve k-mer distribution? If kmerSize == 1, then no. If kmerSize > 1, preserve the distribution of kmers of the given size.
numBatchers (int) – How many parallel batchers should be run? I find that my GPU usage is very efficient with three, but you can use just one if you’re running into memory issues.

run()

Start up the threads and waits for them to finish.

Return type:: None

class bpreveal.interpretUtils.FlatListSaver(numSamples, inputLength)

A simple Saver that holds the results in memory so you can use them immediately.

Since the Saver is created in its own thread, just storing the results in this object doesn’t work - they get removed when the writer process completes. So we need to create some shared memory. This Saver takes care of that for sequences and shap scores, but discards passData, since we don’t know a priori how large passData objects will be. In a typical use case, you’d use this in situations where you already know which sequence is which, so saving passData doesn’t really make sense anyway.

Parameters:

numSamples (int) – The total number of samples that this saver will get. It has to know this during construction so it can allocate enough memory.
inputLength (int) – The input length of the model.

construct()

Set up the data sets.

This is run in the child process.

Return type:: None

parentFinish()

Extract the data from the child process.

This must be called to load the shap data from the child, since it’s currently packed away inside a linear array. I could just expose _outShapArray, but this reorganizes it in a much more intuitive way.

Return type:: None

done()

Copy over the data from the child thread to the parent.

This method is called from the child process.

Return type:: None

add(result)

Add the result to the internal list.

This is called from the child process.

Parameters:: result (FlatResult)
Return type:: None

class bpreveal.interpretUtils.FlatH5Saver(outputFname, numSamples, inputLength, genome=None, useTqdm=False, config=None)

Saves the shap scores to the output file.

Parameters:

outputFname (str) – is the name of the hdf5-format file that the shap scores will be deposited in.
numSamples (int) – is the number of regions (i.e., bases) that PISA will be run on. This is needed because we store reference predictions.
inputLength (int) – The input length of the model.
genome (str | None) – (Optional) Gives the name of a fasta-format file that contains the genome of the organism. If provided, then chromosome name and size information will be included in the output, and, additionally, two other datasets will be created: coords_chrom, and coords_base.
useTqdm (bool) – Should a progress bar be displayed?
config (str | None)

construct()

Set up the data sets for writing.

This is called inside the child thread.

Return type:: None

done()

Close up shop.

Called in the child process.

Return type:: None

add(result)

Add the given result to the output file.

Parameters:: result (FlatResult) – The output from the batcher.
Return type:: None

class bpreveal.interpretUtils.PisaH5Saver(outputFname, numSamples, numShuffles, receptiveField, genome=None, useTqdm=False, config=None)

Saves the shap scores to the output file.

Parameters:

outputFname (str) – is the name of the hdf5-format file that the shap scores will be deposited in.
numSamples (int) – is the number of regions (i.e., bases) that PISA will be run on.
numShuffles (int) – is the number of shuffles that are used to generate the reference. This is needed because we store reference predictions.
receptiveField (int) – How wide is the model’s receptive field?
genome (str | None) – an optional parameter, gives the name of a fasta-format file that contains the genome of the organism. If provided, then chromosome name and size information will be included in the output, and, additionally, two other datasets will be created: coords_chrom, and coords_base.
useTqdm (bool) – Should a progress bar be displayed?
config (str | None)

construct()

Run in the child thread.

Return type:: None

done()

Called from the child process.

Return type:: None

add(result)

Add the given result to the output file.

Parameters:: result (PisaResult) – The output from the batcher.
Return type:: None

class bpreveal.interpretUtils.ListGenerator(sequences, passDataList=None)

A very simple Generator that is initialized with an iterable of strings.

(A list of strings is an iterable, but this works with generator functions and other things too!)

Parameters:

sequences (Iterable[str]) – Any iterable that yields strings, like a list of strings. Note that this function immediately converts whatever you pass in into a list, so very large iterables will consume a lot of memory.
passDataList (list | None) – (optional) Will be passed through the batcher to the saver.

construct()

Set up stuff in the child thread.

Note that this doesn’t load data - because the child thread is forked from the parent, it already contains the lists of data.

Return type:: None

done()

Called in the child thread, does nothing.

Return type:: None

class bpreveal.interpretUtils.FastaGenerator(fastaFname)

Reads a fasta file from disk and generates Queries from it.

Parameters:: fastaFname (str) – The name of the fasta-format file containing query sequences.

construct()

Open the file and start reading.

Return type:: None

done()

Close the Fasta file.

Return type:: None

class bpreveal.interpretUtils.FlatBedGenerator(bedFname, genomeFname, inputLength, outputLength)

Reads in lines from a bed file and fetches the genomic sequence around them.

Note that the regions should have length outputLength, and they will be automatically padded to the appropriate input length.

Parameters:

bedFname (str) – The bed file to read.
genomeFname (str) – The genome fasta that sequences will be drawn from.
inputLength (int) – The input length of your model.
outputLength (int) – The output length of your model.

construct()

Open the bed file and fasta genome.

Return type:: None

done()

Close the fasta file.

Return type:: None

class bpreveal.interpretUtils.PisaBedGenerator(bedFname, genomeFname, inputLength, outputLength)

Reads in lines from a bed file and fetches the genomic sequence at every base.

This is very different than the FlatBedGenerator, which generates one sequence query per bed file entry. This class generates a query for every base that the bed file contains.

Parameters:

bedFname (str) – The bed file to read.
genomeFname (str) – The genome fasta that sequences will be drawn from.
inputLength (int) – The input length of your model.
outputLength (int) – The output length of your model.

construct()

Run in the child thread, opens up the files and reads the bed.

Return type:: None

done()

Close the fasta.

Return type:: None

bpreveal.interpretUtils.combineMultAndDiffref(mult, orig_inp, bg_data)

Combine the shap multipliers and difference from reference to generate hypothetical scores.

This is injected deep into shap and generates the hypothetical importance scores.

bpreveal.interpretUtils.isShappable(model)

Checks to see if the model can be shapped.

Early versions of BPReveal created combined and transformation models that were incompatible with DeepShap.

Parameters:: model (keras.Model) – The (loaded) Keras model that you want to check.
Returns:: True if it’s safe to shap the model.
Return type:: bool