bedUtils

Some utilities for dealing with bed files.

bpreveal.bedUtils.makeWhitelistSegments(genome, blacklist=None)

Get a list of windows where it is safe to draw inputs for your model.

Parameters:
  • genome (FastaFile) – A FastaFile (pysam object, not a string filename!).

  • blacklist (BedTool | None) – (Optional) A BedTool that gives additional regions that should be excluded.

Returns:

A BedTool that contains the whitelisted regions.

Return type:

BedTool

Given a genome file, go over each chromosome and see where the Ns are. Create a BedTool that contains all regions that don’t contain N. For example, if your genome were ATATATATnnnnnnnATATATATATATnnn, then this would return a BedTool corresponding to the regions containing As and Ts.

blacklist, if provided, is a bed file of regions that should be treated as though they contained N nucleotides.

bpreveal.bedUtils.tileSegments(inputLength, outputLength, segments, spacing)

Tile the given segments with intervals.

Parameters:
  • inputLength (int) – The input-length of your model.

  • outputLength (int) – The output-length of your model, and also the length of the intervals in the returned BedTool.

  • segments (BedTool) – The regions of the genome that you’d like tiled.

  • spacing (int) – The distance between the windows.

Returns:

A BedTool containing Intervals of length outputLength.

Return type:

BedTool

segments will often come from makeWhitelistSegments().

spacing is the distance between the end of one region and the start of the next. So to tile the whole genome, set spacing=0. spacing may be negative, in which case the tiled regions will overlap.

When this algorithm reaches the end of a segment, it will try to place an additional region if it can. For example, if your window is 30 bp long, with outputLength 6, inputLength 10, and spacing 5 you’d get:

012345678901234567890123456789
--xxxxxx-----xxxxxx---xxxxxx--

The 2 bp padding on each end comes from the fact that (inputLength - outputLength) / 2 == 2 Note how the last segment is not 5 bp away from the second-to-last.

bpreveal.bedUtils.createTilingRegions(inputLength, outputLength, genome, spacing)

Create a list of regions that tile a genome.

Parameters:
  • inputLength (int) – The input-length of your model.

  • outputLength (int) – The output-length of your model.

  • genome (FastaFile) – A FastaFile (the pysam object, not a string!)

  • spacing (int) – The space you’d like between returned intervals.

Returns:

A BedTool containing regions that tile the genome.

Return type:

BedTool

The returned BedTool will contain regions that are outputLength wide, and all regions will be far enough away from any N nucleotides that there will be no Ns in the input to your model. spacing specifies the amount of space between the regions. A spacing of 0 means that the regions should join end-to-end, while a spacing of -500 would indicate regions that overlap by 500 bp. See tileSegments() for details on how the regions are placed.

bpreveal.bedUtils.resize(interval, mode, width, genome)

Resize a given interval to a new size.

Parameters:
  • interval (Interval) – A pyBedTools Interval object.

  • mode (str) – One of "none", "center", or "start".

  • width (int) – How long the returned Interval will be.

  • genome (FastaFile) – A FastaFile (the pysam object, not a string)

Returns:

A newly-allocated Interval of the correct size, or False if resizing would run off the edge of a chromosome.

Return type:

Interval | Literal[False]

Given an interval (a PyBedTools Interval object), return a new Interval that is at the same coordinate.

mode is one of:

  • “none”, meaning that no resizing is done. In that case, this function will check that the interval obeys stop-start == width. If an interval does not have the correct width, an assertion will fail.

  • “center”, in which case the interval is resized around its center.

  • “start”, in which case the start coordinate is preserved.

The returned interval will obey x.end - x.start == width. It will preserve the chromosome, name, score, and strand information, but not other bed fields.

bpreveal.bedUtils.metapeak(intervals, bigwigFname, numThreads=None)

Go over the given intervals and build a metapeak.

Parameters:
  • intervals (BedTool) – A pyBigWig file containing the regions to use. This can also be a list of Interval objects. Each interval must be of the same size.

  • bigwigFname (str) – The name of the bigwig file to read in.

  • numThreads (int | None) – If provided, the number of parallel workers to use. If not specified, use all of the CPUs on the machine.

Returns:

A numpy array of shape (interval-length,) containing the average profile over all of the intervals.

Return type:

ndarray[Any, dtype[float32]]

This produces a stranded metapeak. If an interval is on the negative strand, then the values extracted from the bigwig will be reversed before being added to the metapeak.

The parallel implementation is memory-efficient and can easily scale to metapeaks with millions of underlying regions. However, this means that interrupting the calculation can leave the program in a really ugly state. I recommend checking your inputs before you call this.

NaN entries in the bigwig are treated as zero.

bpreveal.bedUtils.getCounts(interval, bigwigs)

Get the total counts from all bigwigs at a given Interval.

Parameters:
  • interval (Interval) – A pyBedTools Interval.

  • bigwigs (list) – A list of opened pyBigWig objects (not strings!).

Returns:

A single number giving the total reads from all bigwigs at the given interval.

Return type:

float

NaN entries in the bigwigs are treated as zero.

bpreveal.bedUtils.sequenceChecker(interval, genome)

For the given interval, does it only contain A, C, G, and T?

Parameters:
  • interval (Interval) – The interval to check.

  • genome (FastaFile) – A FastaFile (pysam object, not a string!).

Returns:

True if the sequence matches "^[ACGTacgt]*$", False otherwise.

Return type:

bool

bpreveal.bedUtils.lineToInterval(line)

Take a line from a bed file and create a PyBedTools Interval object.

Parameters:

line (str) – The line from the bed file

Returns:

A newly-allocated pyBedTools Interval object, or False if the line is not a valid bed line.

Return type:

Interval | Literal[False]

class bpreveal.bedUtils.ParallelCounter(bigwigNames, numThreads)

A class that queues up getCounts() jobs and runs them in parallel.

This is used by the prepareBed scripts.

Parameters:
  • bigwigNames (list[str]) – The name of the bigwig files to read from

  • numThreads (int) – How many parallel workers should be used?

addQuery(query, idx)

Add a region (chrom, start, end) to the task list.

Parameters:
  • query (tuple[str, int, int]) – A tuple of (chromosome, start, end) giving the region to look at.

  • idx (Any) – An index that will be returned with the results.

Return type:

None

done()

Wrap up the show - close the child threads.

getResult()

Get the next result.

Returns:

A tuple of (counts, idx), where counts is the total counts for the bigwigs and idx is the index of the region you gave to addQuery.

Note that the results will NOT be given in order - you must look at the index values.