tools.gcContents
Help Info
usage: gcContents [-h] [--peaks PEAKS] [--bias BIAS]
[--genome GENOME] [--output OUTPUT]
[--strictness STRICTNESS] [--verbose]
[--plot]
Named Arguments
- --peaks
A bed file of peaks that will be used for training.
- --bias
A bed file of peaks used for the bias model.
- --genome
A fasta file for extracting sequences.
- --output
The name of the bed file that will be written.
- --strictness
How strict should the algorithm be? 0.5 is lax, 0.9 is the default, and 0.98 is very strict. At higher strictness, the sampled set will be smaller.
Default:
0.9- --verbose
Default:
False- --plot
Show some pretty plots.
Default:
False
Usage
Select regions from a bed file to match the GC distribution of a reference bed.
For training bias models, the ChromBPNet method requires that the bias regions match the peaks regions in GC content. This little script arranges for that.
You feed it two bed files. One represents your training regions and one represents possible bias regions. This program first calculates the distribution of GC content in the peaks. Then, it selects a subset of regions in the bias bed file such that the selected regions mirror the GC content of the training bed.
- bpreveal.tools.gcContents.getGc(region, genome)
How mane Gs and Cs are there in the given Interval?
- Parameters:
region (Interval) – A PyBedTool Interval object.
genome (FastaFile) – An (opened) pysam genome, used to fetch the sequence.
- Returns:
The number of Gs and Cs in the region, or -1 if the region contains any Ns.
- Return type:
int
- bpreveal.tools.gcContents.getDistributionFromBed(bedFname, genome)
Get a histogram of GC content in all of the regions in a bed file.
- Parameters:
bedFname (str) – The name of a bed file on disk.
genome (FastaFile) – An (opened) pysam FastaFile, for extracting the sequence.
- Returns:
A dictionary where the keys are the number of Cs and Gs in a region and the values are the number of regions with that GC content.
- Return type:
dict[int, int]
- bpreveal.tools.gcContents.plotHist(bedFname, gcCounts)
Plot the GC distribution for a bed file.
- Parameters:
bedFname (str) – Just a label.
gcCounts (dict[int, int]) – A dictionary as returned by getDistributionFromBed.
- Return type:
None
- bpreveal.tools.gcContents.getCorrectionFactors(peakCounts, biasCounts, strictness)
Determine sampling rates that will transform the bias GC content to match peaks.
The returned array gives you the sampling weight you should apply to bias regions to transform their GC distribution to match the peak GC distribution.
The proper use of this array is best explained with some code:
cf = getCorrectionFactors, peakDist, biasDist) for region in biasBed: emitProbability = cf[getGc(region, genome)] if random() < emitProbability: outputBed.append(region)
As you can see, when you have a bias region with GC content
i, then the probability that it will be included in the selected regions iscorrectionFactors[i]. Note that you should make sure thati != -1!- Parameters:
peakCounts (dict[int, int]) – A dict as returned by getDistributionFromBed
biasCounts (dict[int, int]) – A dict as returned by getDistributionFromBed
strictness (float) – A float from 0 to 1. Default is 0.9. Higher strictness better match the input distribution at the price of fewer regions making it through.
- Returns:
An array of floats.
- Return type:
ndarray
- bpreveal.tools.gcContents.applyCorrection(inBed, correctionFactors, genome, rng=None)
Subsample the input bed according to the correction factors.
- Parameters:
inBed (BedTool) – The bias input bed.
correctionFactors (ndarray) – The factors given by getCorrectionFactors.
genome (FastaFile) – The (opened) pysam FastaFile.
rng (Generator | None) – A numpy Generator that will be used to get random samples.
- Returns:
A new BedTool that contains a subset of the regions in inBed.
- Return type:
BedTool
- bpreveal.tools.gcContents.getParser()
Build an arg parser (but don’t parse args).
- Return type:
ArgumentParser
- bpreveal.tools.gcContents.runMain()
Run the program.
- Return type:
None
- bpreveal.tools.gcContents.main()
Entry point for script.
- Return type:
None