internal.predictUtils
Some functions and classes for streaming predictions.
- class bpreveal.internal.predictUtils.FastaReader(fastaFname)
Streams a fasta file from disk lazily.
- Parameters:
fastaFname – The name of the fasta file to load.
- pop()
Pop the current sequence off the queue. Updates curSequence and curLabel.
- class bpreveal.internal.predictUtils.BedReader(bedFname, genomeFname, padding)
Streams a bed file from disk and loads sequence information lazily.
- Parameters:
bedFname (str) – The name of the fasta file to load.
genomeFname (str) – The name of the fasta-format genome file.
padding (int) – The amount by which each region should be expanded before fetching the sequence. This will be (inputLength - outputLength) // 2 for most cases.
- curSequence: str
The genomic sequence under the current region.
This will update when you call pop().
- curLabel = ''
Just for compatibility with the FastaReader, this will always be an empty string.
- numPredictions: int = 0
The total number of regions in the bed file.
- pop()
Pop the current sequence off the queue.
- class bpreveal.internal.predictUtils.H5Writer(fname, numHeads, numPredictions, bedFname=None, genomeFname=None)
Batches up predictions and saves them in chunks.
- Parameters:
fname – The name of the hdf5 file to save.
numHeads – The total number of heads for this model.
numPredictions – How many total predictions will be made?
bedFname (str | None)
genomeFname (str | None)
- buildDatasets(sampleOutputs)
Actually construct the output hdf5 file.
You must give this function the first prediction from the model so that it can size its datasets appropriately.
- Parameters:
sampleOutputs (list) – An output from the Batcher. This is not written to the file, it’s just used to get the right size for the datasets.
- addEntry(batcherOut)
Add a single output from the Batcher.
- Parameters:
batcherOut (tuple)
- commit()
Actually write the data out to the backing hdf5 file.
- close()
Close the output hdf5.
You MUST call close on this object, as otherwise the last bit of data won’t get written to disk.
- bpreveal.internal.predictUtils.addGenomeInfo(outFile, genome)
Create a chrom name and chrom size dataset so that this h5 can be converted into a bigwig.
- Parameters:
outFile (File) – The (opened) hdf5 file to write.
genome (FastaFile) – The (opened) FastaFile object containing genome information.
- Returns:
The types you need to use to store chromosome index and position information, as well as a dictionary to map chromosome name to index. The first is what you need for coords_chrom, the second for coords_start and coords_end. They will each be one of
np.uint8,np.uint16,np.uint32, ornp.uint64. The dictionary (the third element) maps string chromosome names (likechrII) to integers. It is the inverse of the createdchrom_namesdataset.- Return type:
tuple[type, type, dict[str, int]]
- bpreveal.internal.predictUtils.addCoordsInfo(regions, outFile, genome, stopName='coords_stop')
Initialize an hdf5 with coordinate information.
Creates the chrom_names, chrom_sizes, coords_chrom, coords_start, and coords_stop datasets.
- Parameters:
regions (BedTool) – A BedTool of regions that will be written.
outFile (File) – The opened hdf5 file.
genome (FastaFile) – An opened pysam.FastaFile with your genome.
stopName (str) – What should the stop point dataset be called? For interpretation scores, it should be called coords_end, while for predictions it should be called coords_stop. I’m sorry that this parameter exists.