internal.interpretUtils
A big ol’ module that contains high-efficiency tools for calculating shap scores.
In user code, it’s actually not that bad to use this module. Here’s what you do.
Create a Generator.
Create a Saver.
Pass those to an InterpRunner.
Call .run() on the Runner.
- class bpreveal.internal.interpretUtils.Query(oneHotSequence, passData, index)
A Query is what is passed to the batcher.
It has three things.
- Parameters:
sequence – The
(input-length, NUM_BASES)one-hot encoded sequence of the current base.passData (Any) – As with Result objects, is either a tuple of
(chromName, position)(for when you have a bed file) or a string with a fasta description line (for when you’re starting with a fasta). If you’re using theListGenerator, then it can be anything.index (int) – Indicates which output slot this data should be put in. Since there’s no guarantee that the results will arrive in order, we have to track which query was which.
oneHotSequence (ndarray[tuple[Any, ...], dtype[uint8]])
Lifetime:
Created by a Generator’s iterator
_generatorThread iterates, and puts the Query in each inQueue.
inQueue is read by _batcherThread.
_batcherThread calls .add on the Batcher.
Batcher processes query.
- class bpreveal.internal.interpretUtils.Result(inputPrediction, shufflePredictions, sequence, shap, passData, index)
A result from doing shap.
Lifetime:
Generated by a Batcher.
Batcher places in outQueue.
outQueue is read by saverThread
saverThread calls .add() on the Saver.
- Parameters:
inputPrediction (ndarray[tuple[Any, ...], dtype[float32]]) – A scalar floating point value, the value of the metric on the given input window.
shufflePredictions (ndarray[tuple[Any, ...], dtype[float32]]) – A
(numShuffles,)numpy array of the values of the metric on each of the shuffled sequences.sequence (ndarray[tuple[Any, ...], dtype[uint8]]) – is a
(input-length, NUM_BASES)numpy array of the one-hot encoded input sequence.shap (ndarray[tuple[Any, ...], dtype[float16]]) – is a
(input-length, NUM_BASES)numpy array of shap scores.passData (Any) – is data that is not touched by the batcher, but added by the generator and necessary for creating the output file. If the generator is reading from a bed file, then it is a tuple of (chromName, position) and that data should be used to populate the coords_chrom and coords_base fields. If the generator was using a fasta file, it is the title line from the original fasta, with its leading
>removed.index (int) – Indicates which address the data should be stored at in the output hdf5. Since there’s no order guarantee when you’re receiving data, we have to keep track of the order in the original input. It is up to the Saver to re-organize the Results that it gets if the Saver wants to make an order guarantee.
- class bpreveal.internal.interpretUtils.Generator
The base class for generating PISA samples.
Lifetime:
Initially created in user code. The generator at this point only contains serializable data.
Passed to a Runner.
The Runner passes it to a generatorThread, which runs in a separate Process.
generatorThread calls .construct(), and the Generator opens any files it needs.
generatorThread iterates over the Generator until it is exhausted.
generatorThread calls .done().
- construct()
Set up the generator (in the child thread).
When the generator is about to start, this method is called once before requesting the iterator.
- Return type:
None
- done()
When the batcher is done, this is called to free any allocated resources.
- Return type:
None
- class bpreveal.internal.interpretUtils.Saver
Class to receive Results and deal with them.
Descendants of this class shall receive Results from the batcher and save them to an appropriate structure, usually on disk. The user creates a Saver object in the main thread, and then the saver gets construct()ed inside a separate thread created by the runners. Therefore, you should only create members that are serializable in __init__.
Lifetime:
Created in user code with only serializable data.
Passed to a Runner, which passes it to a saverThread in a new process.
saverThread calls .construct().
saverThread reads from a queue and repeatedly calls .add().
saverThread calls .done()
The Runner, once saverThread is joined, calls parentFinish in the parent thread.
- construct()
Do any setup necessary in the child thread.
This function should actually open the output files, since it will be called from inside the actual saver thread.
- Return type:
None
- add(result)
Add the given Result to wherever you’re saving them out.
- Parameters:
result (Result) – The thing to save out.
- Raises:
NotImplementedError – because this is abstract.
- Return type:
None
Note that this will not be called with None at the end of the run, that is special-cased in the code that runs your saver. (That code will call done() when all of the results have been added.)
- parentFinish()
Called in the parent thread when the saver is done.
Usually, there’s nothing to do, but the parent thread might need to close shared memory, so this function is guaranteed to be called by the Runners.
- Return type:
None
- done()
Called when the batcher is complete (indicated by putting a None in its output queue).
Now is the time to close all of your files. This function is called in the child thread, not the parent thread.
- Return type:
None
- class bpreveal.internal.interpretUtils.InterpRunner(modelFname, metrics, batchSize, generator, savers, numShuffles, kmerSize, numThreads, backend, useHypotheticalContribs=None, shuffler=None)
Runs shap scores.
I try to avoid class-based wrappers around simple things, but this is not simple. This class creates threads to read in data, and then creates two threads that run interpretation samples in batches. Finally, it takes the results from the interpretation thread and saves those out to an hdf5-format file.
- Parameters:
modelFname (str) – is the name of the model on disk
metrics (list[Callable]) – a list of functions that accept a model and return a scalar output. These are the values that will be explained.
batchSize (int) – is the shap batch size, which should be your usual batch size divided by the number of shuffles. (since the original sequence and shuffles are run together)
generator (Generator) – is a Generator object that will be passed to _generatorThread
savers (list[Saver]) – is a list of Saver objects that will be used to save out the data. There should be one saver per metric.
numShuffles (int) – is the number of reference sequences that should be generated for each shap evaluation. (I recommend 20 or so) (Only applicable when
backendis"shap").kmerSize (int) – Should the shuffle preserve k-mer distribution? If kmerSize == 1, then no. If kmerSize > 1, preserve the distribution of kmers of the given size. If using the
ismbackend, then this parameter controls the width of the shuffled region (in which case it must be an odd number).backend (str) – If “shap”, use the DeepShap backend for interpretation scores. If “ism”, use scanning mutagenesis instead.
useHypotheticalContribs (bool | None) – (Only valid with
shapbackend.) If True, calculate hypothetical contribution scores as was done in the original paper. If False, then use the normal shap algorithm. Normally, you set this to True for scores you intend to feed to MoDISco and false for PISA.shuffler (Callable | None) – (Only valid with the
ismbackend.) The function to use to generate shuffles of the sequences. The width of the shuffled sequence will be taken fromkmerSize, which must be an odd number when using theismbackend. This function will take two arguments. First, a one-hot encoded sequence to generate shuffles from, and second, an integer giving the position in the overall input where this shuffle is being performed. It should return a list of one-hot encoded sequences.numThreads (int)
- run()
Start up the threads and waits for them to finish.
- Return type:
None
- class bpreveal.internal.interpretUtils.FlatListSaver(numSamples, inputLength)
A simple Saver that holds the results in memory so you can use them immediately.
Since the Saver is created in its own thread, just storing the results in this object doesn’t work - they get removed when the writer process completes. So we need to create some shared memory. This Saver takes care of that for sequences, shap scores, and input predictions, but discards passData, since we don’t know a priori how large passData objects will be. In a typical use case, you’d use this in situations where you already know which sequence is which, so saving passData doesn’t really make sense anyway.
- Parameters:
numSamples (int) – The total number of samples that this saver will get. It has to know this during construction so it can allocate enough memory.
inputLength (int) – The input length of the model.
- construct()
Set up the data sets.
This is run in the child process.
- Return type:
None
- parentFinish()
Extract the data from the child process.
This must be called to load the shap data from the child, since it’s currently packed away inside a linear array. I could just expose _outShapArray, but this reorganizes it in a much more intuitive way.
- Return type:
None
- done()
Copy over the data from the child thread to the parent.
This method is called from the child process.
- Return type:
None
- class bpreveal.internal.interpretUtils.FlatH5Saver(outputFname, numSamples, inputLength, genome=None, useTqdm=False, config=None)
Saves the shap scores to the output file.
- Parameters:
outputFname (str) – is the name of the hdf5-format file that the shap scores will be deposited in.
numSamples (int) – is the number of regions (i.e., bases) that PISA will be run on. This is needed because we store reference predictions.
inputLength (int) – The input length of the model.
genome (str | None) – (Optional) Gives the name of a fasta-format file that contains the genome of the organism. If provided, then chromosome name and size information will be included in the output, and, additionally, two other datasets will be created: coords_chrom, and coords_base.
useTqdm (bool) – Should a progress bar be displayed?
config (str | None)
- construct()
Set up the data sets for writing.
This is called inside the child thread.
- Return type:
None
- done()
Close up shop.
Called in the child process.
- Return type:
None
- class bpreveal.internal.interpretUtils.PisaH5Saver(outputFname, numSamples, numShuffles, receptiveField, genome=None, useTqdm=False, config=None)
Saves the shap scores to the output file.
- Parameters:
outputFname (str) – is the name of the hdf5-format file that the shap scores will be deposited in.
numSamples (int) – is the number of regions (i.e., bases) that PISA will be run on.
numShuffles (int) – is the number of shuffles that are used to generate the reference. This is needed because we store reference predictions.
receptiveField (int) – How wide is the model’s receptive field?
genome (str | None) – an optional parameter, gives the name of a fasta-format file that contains the genome of the organism. If provided, then chromosome name and size information will be included in the output, and, additionally, two other datasets will be created: coords_chrom, and coords_base.
useTqdm (bool) – Should a progress bar be displayed?
config (str | None)
- construct()
Run in the child thread.
- Return type:
None
- done()
Close out any open files before the child process exits.
Called from the child process.
- Return type:
None
- class bpreveal.internal.interpretUtils.ListGenerator(sequences, passDataList=None)
A very simple Generator that is initialized with an iterable of strings.
(A list of strings is an iterable, but this works with generator functions and other things too!)
- Parameters:
sequences (Iterable[str]) – Any iterable that yields strings, like a list of strings. Note that this function immediately converts whatever you pass in into a list, so very large iterables will consume a lot of memory.
passDataList (list | None) – (optional) Will be passed through the batcher to the saver.
- construct()
Set up stuff in the child thread.
Note that this doesn’t load data - because the child thread is forked from the parent, it already contains the lists of data.
- Return type:
None
- done()
Called in the child thread, does nothing.
- Return type:
None
- class bpreveal.internal.interpretUtils.FastaGenerator(fastaFname)
Reads a fasta file from disk and generates Queries from it.
- Parameters:
fastaFname (str) – The name of the fasta-format file containing query sequences.
- construct()
Open the file and start reading.
- Return type:
None
- done()
Close the Fasta file.
- Return type:
None
- class bpreveal.internal.interpretUtils.FlatBedGenerator(bedFname, genomeFname, inputLength, outputLength)
Reads in lines from a bed file and fetches the genomic sequence around them.
Note that the regions should have length outputLength, and they will be automatically padded to the appropriate input length.
- Parameters:
bedFname (str) – The bed file to read.
genomeFname (str) – The genome fasta that sequences will be drawn from.
inputLength (int) – The input length of your model.
outputLength (int) – The output length of your model.
- construct()
Open the bed file and fasta genome.
- Return type:
None
- done()
Close the fasta file.
- Return type:
None
- class bpreveal.internal.interpretUtils.PisaBedGenerator(bedFname, genomeFname, inputLength, outputLength)
Reads in lines from a bed file and fetches the genomic sequence at every base.
This is very different than the
FlatBedGenerator, which generates one sequence query per bed file entry. This class generates a query for every base that the bed file contains.- Parameters:
bedFname (str) – The bed file to read.
genomeFname (str) – The genome fasta that sequences will be drawn from.
inputLength (int) – The input length of your model.
outputLength (int) – The output length of your model.
- construct()
Run in the child thread, opens up the files and reads the bed.
- Return type:
None
- done()
Close the fasta.
- Return type:
None
- bpreveal.internal.interpretUtils.combineMultAndDiffref(mult, originalInput, backgroundData)
Combine the shap multipliers and difference from reference to generate hypothetical scores.
- Parameters:
mult (ndarray[tuple[Any, ...], dtype[float16]]) – The shap multipliers.
originalInput (ndarray[tuple[Any, ...], dtype[uint8]]) – The one-hot encoded sequence being shapped.
backgroundData (ndarray[tuple[Any, ...], dtype[uint8]]) – The shuffled references.
- Returns:
A list of hypothetical contributions.
- Return type:
list
This is injected deep into shap and generates the hypothetical importance scores.
- bpreveal.internal.interpretUtils.isShappable(model)
Check to see if the model can be shapped.
Early versions of BPReveal created combined and transformation models that were incompatible with DeepShap.
- Parameters:
model (keras.Model) – The (loaded) Keras model that you want to check.
- Returns:
True if it’s safe to shap the model.
- Return type:
bool