internal.predictUtils

Some functions and classes for streaming predictions.

class bpreveal.internal.predictUtils.FastaReader(fastaFname)

Streams a fasta file from disk lazily.

Parameters:

fastaFname (str) – The name of the fasta file to load.

curSequence = ''

The current sequence in this file. Updated by pop().

curLabel = ''

The current description line in this file. Updated by pop().

pop()

Pop the current sequence off the queue. Updates curSequence and curLabel.

Return type:

None

class bpreveal.internal.predictUtils.BedReader(bedFname, genomeFname, padding)

Streams a bed file from disk and loads sequence information lazily.

Parameters:
  • bedFname (str) – The name of the fasta file to load.

  • genomeFname (str) – The name of the fasta-format genome file.

  • padding (int) – The amount by which each region should be expanded before fetching the sequence. This will be (inputLength - outputLength) // 2 for most cases.

curSequence: str

The genomic sequence under the current region.

This will update when you call pop().

curLabel = ''

Just for compatibility with the FastaReader, this will always be an empty string.

numPredictions: int = 0

The total number of regions in the bed file.

pop()

Pop the current sequence off the queue.

Return type:

None

class bpreveal.internal.predictUtils.H5Writer(fname, numHeads, numPredictions, bedFname=None, genomeFname=None, config=None)

Batches up predictions and saves them in chunks.

Parameters:
  • fname (str) – The name of the hdf5 file to save.

  • numHeads (int) – The total number of heads for this model.

  • numPredictions (int) – How many total predictions will be made?

  • bedFname (str | None)

  • genomeFname (str | None)

  • config (str | None)

buildDatasets(sampleOutputs)

Actually construct the output hdf5 file.

You must give this function the first prediction from the model so that it can size its datasets appropriately.

Parameters:

sampleOutputs (list) – An output from the Batcher. This is not written to the file, it’s just used to get the right size for the datasets.

Return type:

None

addEntry(batcherOut)

Add a single output from the Batcher.

Parameters:

batcherOut (tuple) – The result from one call to the batcher.

Return type:

None

commit()

Actually write the data out to the backing hdf5 file.

Return type:

None

close()

Close the output hdf5.

You MUST call close on this object, as otherwise the last bit of data won’t get written to disk.

Return type:

None

bpreveal.internal.predictUtils.addGenomeInfo(outFile, genome)

Create a chrom name and chrom size dataset so that this h5 can be converted into a bigwig.

Parameters:
  • outFile (File) – The (opened) hdf5 file to write.

  • genome (FastaFile) – The (opened) FastaFile object containing genome information.

Returns:

The types you need to use to store chromosome index and position information, as well as a dictionary to map chromosome name to index. The first is what you need for coords_chrom, the second for coords_start and coords_end. They will each be one of np.uint8, np.uint16, np.uint32, or np.uint64. The dictionary (the third element) maps string chromosome names (like chrII) to integers. It is the inverse of the created chrom_names dataset.

Return type:

tuple[type, type, dict[str, int]]

bpreveal.internal.predictUtils.addCoordsInfo(regions, outFile, genome, stopName='coords_stop')

Initialize an hdf5 with coordinate information.

Creates the chrom_names, chrom_sizes, coords_chrom, coords_start, and coords_stop datasets.

Parameters:
  • regions (BedTool) – A BedTool of regions that will be written.

  • outFile (File) – The opened hdf5 file.

  • genome (FastaFile) – An opened pysam.FastaFile with your genome.

  • stopName (str) – What should the stop point dataset be called? For interpretation scores, it should be called coords_end, while for predictions it should be called coords_stop. I’m sorry that this parameter exists.

Return type:

None