bedUtils
Some utilities for dealing with bed files.
- bpreveal.bedUtils.makeWhitelistSegments(genome, blacklist=None)
Get a list of windows where it is safe to draw inputs for your model.
- Parameters:
genome (FastaFile) – A FastaFile (pysam object, not a string filename!).
blacklist (BedTool | None) – (Optional) A BedTool that gives additional regions that should be excluded.
- Returns:
A BedTool that contains the whitelisted regions.
- Return type:
BedTool
Given a genome file, go over each chromosome and see where the Ns are. Create a BedTool that contains all regions that don’t contain N. For example, if your genome were
ATATATATnnnnnnnATATATATATATnnn, then this would return a BedTool corresponding to the regions containing As and Ts.blacklist, if provided, is a bed file of regions that should be treated as though they contained N nucleotides.
- bpreveal.bedUtils.tileSegments(inputLength, outputLength, segments, spacing)
Tile the given segments with intervals.
- Parameters:
inputLength (int) – The input-length of your model.
outputLength (int) – The output-length of your model, and also the length of the intervals in the returned
BedTool.segments (BedTool) – The regions of the genome that you’d like tiled.
spacing (int) – The distance between the output windows.
- Returns:
A BedTool containing Intervals of length
outputLength.- Return type:
BedTool
segmentswill often come frommakeWhitelistSegments().spacingis the distance between the end of one region and the start of the next. So to tile the whole genome, setspacing=0.spacingmay be negative, in which case the tiled regions will overlap.When this algorithm reaches the end of a segment, it will try to place an additional region if it can. For example, if your window is 30 bp long, with outputLength 6, inputLength 10, and spacing 5 you’d get:
012345678901234567890123456789 --xxxxxx-----xxxxxx---xxxxxx--
The 2 bp padding on each end comes from the fact that
(inputLength - outputLength) / 2 == 2Note how the last segment is not 5 bp away from the second-to-last.
- bpreveal.bedUtils.createTilingRegions(inputLength, outputLength, genome, spacing)
Create a list of regions that tile a genome.
- Parameters:
inputLength (int) – The input-length of your model.
outputLength (int) – The output-length of your model.
genome (FastaFile) – A FastaFile (the pysam object, not a string!)
spacing (int) – The space you’d like between returned intervals.
- Returns:
A BedTool containing regions that tile the genome.
- Return type:
BedTool
The returned BedTool will contain regions that are outputLength wide, and all regions will be far enough away from any N nucleotides that there will be no Ns in the input to your model. spacing specifies the amount of space between the regions. A spacing of 0 means that the regions should join end-to-end, while a spacing of -500 would indicate regions that overlap by 500 bp. See
tileSegments()for details on how the regions are placed.
- bpreveal.bedUtils.resize(interval, mode, width, genome)
Resize a given interval to a new size.
- Parameters:
interval (Interval) – A pyBedTools Interval object.
mode (str) – One of
"none","center", or"start".width (int) – How long the returned Interval will be.
genome (FastaFile) – A FastaFile (the pysam object, not a string)
- Raises:
ValueError – if you provide an invalid resize mode.
- Returns:
A newly-allocated Interval of the correct size, or
Falseif resizing would run off the edge of a chromosome.- Return type:
Interval | Literal[False]
Given an interval (a PyBedTools Interval object), return a new Interval that is at the same coordinate.
mode is one of:
"none", meaning that no resizing is done. In that case, this function will check that the interval obeys stop-start == width. If an interval does not have the correct width, an assertion will fail."center", in which case the interval is resized around its center."start", in which case the start coordinate is preserved.
The returned interval will obey
x.end - x.start == width. It will preserve the chromosome, name, score, and strand information, but not other bed fields.
- bpreveal.bedUtils.metapeak(intervals, bigwigFname, numThreads=None)
Go over the given intervals and build a metapeak.
- Parameters:
intervals (BedTool) – A pyBedTool object containing the regions to use. This can also be a list of Interval objects. Each interval must be of the same size.
bigwigFname (str) – The name of the bigwig file to read in.
numThreads (int | None) – If provided, the number of parallel workers to use. If not specified, use all of the CPUs on the machine.
- Returns:
A numpy array of shape (interval-length,) containing the average profile over all of the intervals.
- Return type:
ndarray[tuple[Any, …], dtype[float32]]
This produces a stranded metapeak. If an interval is on the negative strand, then the values extracted from the bigwig will be reversed before being added to the metapeak.
The parallel implementation is memory-efficient and can easily scale to metapeaks with millions of underlying regions. However, this means that interrupting the calculation can leave the program in a really ugly state. I recommend checking your inputs before you call this.
NaN entries in the bigwig are treated as zero.
- bpreveal.bedUtils.getCounts(interval, bigwigs)
Get the total counts from all bigwigs at a given Interval.
- Parameters:
interval (Interval) – A pyBedTools Interval.
bigwigs (list) – A list of opened pyBigWig objects (not strings!).
- Returns:
A single number giving the total reads from all bigwigs at the given interval.
- Return type:
float
NaN entries in the bigwigs are treated as zero.
- bpreveal.bedUtils.sequenceChecker(interval, genome)
For the given interval, does it only contain A, C, G, and T?
- Parameters:
interval (Interval) – The interval to check.
genome (FastaFile) – A FastaFile (pysam object, not a string!).
- Returns:
Trueif the sequence matches"^[ACGTacgt]*$",Falseotherwise.- Return type:
bool
- bpreveal.bedUtils.lineToInterval(line)
Take a line from a bed file and create a PyBedTools Interval object.
- Parameters:
line (str) – The line from the bed file
- Returns:
A newly-allocated pyBedTools Interval object, or
Falseif the line is not a valid bed line.- Return type:
Interval | Literal[False]
- class bpreveal.bedUtils.ParallelCounter(bigwigNames, numThreads)
A class that queues up
getCounts()jobs and runs them in parallel.This is used by the
prepareBedscript.- Parameters:
bigwigNames (list[str]) – The name of the bigwig files to read from
numThreads (int) – How many parallel workers should be used?
- addQuery(query, idx)
Add a region (chrom, start, end) to the task list.
- Parameters:
query (tuple[str, int, int]) – A tuple of (chromosome, start, end) giving the region to look at.
idx (Any) – An index that will be returned with the results.
- Return type:
None
- done()
Wrap up the show - close the child threads.
- Return type:
None
- getResult()
Get the next result.
- Returns:
A tuple of (counts, idx), where counts is the total counts for the bigwigs and idx is the index of the region you gave to addQuery.
- Return type:
tuple[float, int]
Note that the results will NOT be given in order - you must look at the index values.
- bpreveal.bedUtils.sortBed(regions)
Sorts a BedTool by chromosome and then by start position.
This is a drop-in replacement for how bedtools does sorting. I redefine it here so that BPReveal works without installing full-blown bedtools.
- Parameters:
regions (BedTool | list[Interval]) – A BedTool or a list of Intervals.
- Returns:
A new BedTool containing the regions in sorted order.
- Return type:
BedTool