prepareBed
Generates test, train, and validation splits and optionally performs some filtering.
BNF
<prepare-bed-configuration> ::= {<bigwig-section>, "splits" : {<split-settings>}, "genome" : <file-name>, "output-length" : <integer>, "input-length" : <integer>, "max-jitter" : <integer>, <output-file-name-section>, "resize-mode" : <resize-mode>, <overlap-section> <num-threads-section> <verbosity-section>}
<bigwig-section> ::= "heads" : [<head-preparation-list>] | (DEPRECATED) "bigwigs" : [<bigwig-preparation-list>]
<overlap-section> ::= "remove-overlaps" : true, "overlap-max-distance" : <integer>, | "remove-overlaps" : false,
<head-preparation-list> ::= <individual-preparation-head> | <individual-preparation-head>, <head-preparation-list>
<resize-mode> ::= "none" | "center" | "start"
<output-file-name-section> ::= "output-prefix" : "<string>" | "output-train" : <file-name>, "output-val" : <file-name>, "output-test" : <file-name>
<individual-preparation-head> ::= { "bigwig-names" : [<list-of-bigwig-files>], <max-cutoff-section>, <min-cutoff-section> }
<max-cutoff-section> ::= "max-quantile" : <number> | "max-counts" : <integer>
<min-cutoff-section> ::= "min-quantile" : <number> | "min-counts" : <integer>
<split-settings> ::= <split-by-chromosome-settings> | <split-by-name-settings> | <split-by-bed-settings>
<split-by-chromosome-settings> ::= "train-chroms" : [<list-of-string>], "val-chroms" : [<list-of-string> ], "test-chroms" : [<list-of-string> ], "regions" : [<list-of-bed-files>]
<split-by-bed-settings> ::= "train-regions" : [<list-of-bed-files>], "val-regions" : [<list-of-bed-files>], "test-regions" : [<list-of-bed-files>]
<split-by-name-settings> ::= "regions" : [<list-of-bed-files>], "test-regex" : "<string>", "train-regex" : "<string>", "val-regex" : "<string>"
<list-of-bigwig-files> ::= <file-name>, <list-of-bigwig-files> | <file-name>
<list-of-bed-files> ::= <file-name>, <list-of-bed-files> | <file-name>
<num-threads-section> ::= <empty> | "num-threads" <integer>,
<bigwig-preparation-list> ::= (DEPRECATED) <individual-preparation-bigwig> (DEPRECATED) | <individual-preparation-bigwig>, <bigwig-preparation-list>
<individual-preparation-bigwig> ::= (DEPRECATED) { "file-name" : <file-name>, (DEPRECATED) <max-cutoff-section>, (DEPRECATED) <min-cutoff-section> (DEPRECATED) }
Parameter Notes
- bigwig-names
A list of the data bigwigs that correspond to this head. For example, these might be the positive and negative strands of a ChIP-nexus sample.
- resize-mode
specifies where in the regions in the bed file the output regions should be centered. Note that this program assumes your bed files are in bed3 format, that is, (chrom, start, stop). If you have additional columns with information like peak offset, those data will be ignored.
- max-quantile, min-quantile
The max and min quantile values, if provided, will be used to threshold which regions are included in the output. First, all of the counts in the given regions are computed (which takes a while!), and then the given quantile is computed. All regions exceeding that value are not included in the output files.
- max-counts, min-counts
Similarly, if max and min counts are given, all regions having more (or fewer) reads than the given number will be excluded.
- output-prefix
Specifies the base name for the output bed files. You can use either output-prefix OR list all three output-train, output-val, and output-test. If you specify output-prefix, then five bed files will be made, called
output-prefix_train.bed,output-prefix_val.bed,output-prefix_test.bed,output-prefix_all.bed, andoutput-prefix_reject.bed- output-train, output-val, output-test
If you give these file names, then the training, validation and test splits will be written to these three files, respectively.
- regions
Needed for splits by chromosome or by regex. This is a bed file of every possible region that you might want to train on.
- train-chroms, val-chroms, test-chroms
Split up your input regions by chromosome.
- train-regex, val-regex, test-regex
If you use a regex, then the name field of each bed line will be matched against each of the three regexes. The line will be added to the split where it matches. If a bed line matches more than one regex, that will raise an error. If a line matches no regexes, it is added to the rejects.
- train-regions, val-regions, test-regions
You may provide a specific bed file for each of the splits. In this case, the regions in each of these files are used to construct each respective split.
- remove-overlaps
flag can be set to
trueif you’d like to exclude overlapping regions. This is done by resizing all regions down tooverlap-max-distance, and then, if multiple regions have an overlap, one is deleted at random. Ifremove-overlapsisfalse, then it is an error to setoverlap-max-distance.- num-threads
How many threads should be used for loading counts information? I recommend setting this to as many threads as your machine has.
Additional information
Counts windowing
I should mention that the maximum and minimum counts are not compared across the
same window.
When comparing a region against the maximum counts value, all counts within a
window of size input-length + 2*jitter are added up.
This way, if you have a crazy-huge spike just outside your region, that region
will be rejected if the jittering could include it in the training data.
Conversely, for minimum counts, the counts within a window of length
output-length - 2*jitter will be considered.
This way, no matter what jitter value is selected, there will be at least the
given number of counts in the region.
Most columns ignored
prepareBed takes a very lenient approach to validating your bed files. It will not check that the score column in your file is numeric, nor will it check to see if you have flipped some columns in your input file.
History
The old bigwigs format was deprecated in BPReveal 4.0.0 and will be
removed in BPReveal 5.0.0
The remove-overlaps field became mandatory in BPReveal 3.0.0.
API
- bpreveal.prepareBed.loadRegionsByChrom(trainChroms, valChroms, testChroms, regionFnames)
Load splits based on chromosomes.
Given chromosomes for the training, validation, and test splits, and a list of bed files, generate the regions in each split.
- Parameters:
trainChroms (list[str]) – A list of chromosome names for the training split.
valChroms (list[str]) – A list of chromosomes for the validation split.
testChroms (list[str]) – A list of chromosome names for the test split.
regionFnames (list[str]) – A list of bed file names to read in.
- Returns:
The training, test, validation, and reject regions as lists.
- Return type:
tuple[list[Interval], list[Interval], list[Interval], list[Interval]]
- bpreveal.prepareBed.loadRegionsByBed(trainRegionFnames, valRegionFnames, testRegionFnames)
Given bed file names, load them into lists of intervals.
- Parameters:
trainRegionFnames (list[str]) – A list of bed files containing regions to train on.
valRegionFnames (list[str]) – A list of bed files containing regions to use for validation.
testRegionFnames (list[str]) – A list of bed files for the test set of regions.
- Returns:
Four lists of regions: Training, test, validation, and rejects. (Rejects will always be empty with this function.)
- Return type:
tuple[list[Interval], list[Interval], list[Interval], list[Interval]]
- bpreveal.prepareBed.loadRegionsByRegex(trainString, testString, valString, regionFnames)
Go over the bed files and assign splits based on regexes matched against the name column.
- Parameters:
trainString (str) – The regex that matches samples in the training split
testString (str) – The regex that matches samples in the test split
valString (str) – The regex that matches samples in the validation split.
regionFnames (list[str]) – A list of bed files that will be read in.
- Returns:
Four lists of Intervals, corresponding to the training, test, validation, and rejected regions.
- Return type:
tuple[list[Interval], list[Interval], list[Interval], list[Interval]]
- bpreveal.prepareBed.loadRegions(config)
Given a configuration (see the specification), return four PyBedTools BedTool objects.
- Parameters:
config (dict) – A JSON object satisfying the prepareBed specification.
- Returns:
Four BedTools:
The first will consist of the training regions,
the second will be the validation regions,
then the test regions,
finally any regions that were rejected on loading.
- Return type:
tuple[BedTool, BedTool, BedTool, BedTool]
- bpreveal.prepareBed.removeOverlaps(config, regions, genome)
Remove overlaps among the given regions.
- Parameters:
config (dict) – Straight from the JSON.
regions (BedTool) – A BedTool (or list of Intervals)
genome (FastaFile) – A FastaFile (not string) giving the genome.
- Return type:
tuple[BedTool, BedTool]
Takes in the list of regions, resizes each to the minimum size, and if there are overlaps, randomly chooses one of the overlapping regions.
- bpreveal.prepareBed.filterByMaxCounts(config, bigRegionsList, bigwigLists, validRegions, numThreads)
Filters the regions in bigRegionList based on the max-quantile or max-counts in the config.
- Parameters:
config (dict) – Straight from the configuration JSON.
bigRegionsList (list[Interval]) – A list of intervals that have already been inflated to account for jitter
bigwigLists (list[list[str]]) – The bigwigs that should be scanned, grouped by head.
validRegions (ndarray) – A vector booleans, for each region in bigRegionList, if region i is rejected, then validRegions[i] will be 0 when this function exits.
numThreads (int) – How many parallel workers should be used?
- Returns:
A BedTool containing only valid regions.
- Return type:
BedTool
- bpreveal.prepareBed.filterByMinCounts(config, smallRegionsList, bigRegionsList, bigwigLists, validRegions, numThreads)
Filters the regions in smallRegionList based on the min-quantile or min-counts in the config.
- Parameters:
config (dict) – Straight from the configuration JSON.
bigRegionsList (list[Interval]) – A list of intervals that have already been inflated to account for jitter
smallRegionsList (list[Interval]) – A list of intervals that have already been inflated to account for jitter
bigwigLists (list[list[str]]) – The bigwigs that should be scanned, grouped by head.
validRegions (ndarray) – A vector booleans, for each region k in smallRegionList, corresponding to region i in bigRegionsList, if region k is rejected, then validRegions[i] will be 0 when this function exits.
numThreads (int) – How many parallel workers should be used?
- Return type:
None
- bpreveal.prepareBed.validateRegions(config, regions, genome, bigwigLists, numThreads)
The workhorse of this program.
- Parameters:
config (dict) – Straight from the JSON.
regions (BedTool) – A BedTool or list.
genome (FastaFile) – A FastaFile (not the name as a str.)
bigwigLists (list[list[str]]) – The names of the data files to use.
numThreads (int) – How many parallel workers should be used?
- Returns:
Two BedTools, one for regions that passed the filters and another for those that failed.
- Return type:
tuple[BedTool, BedTool]
Given a config (see the spec), a BedTool of regions, an open pysam FastaFile, and a list of bigwigs to check, filter down the regions so that they satisfy the configuration. Returns two BedTools: The first contains the regions that passed the filters, and the second contains the rejected regions.
- bpreveal.prepareBed.rewriteOldBigwigsFormat(config)
If the config has a bigwigs section, rewrite it to the new style.
- Parameters:
config (dict)
- Return type:
None
- bpreveal.prepareBed.prepareBeds(config)
The main function of this script.
- Parameters:
config (dict) – A JSON object matching the prepareBed specification.
- Return type:
None
Schema
{
"$schema": "http://json-schema.org/draft-07/schema#",
"title": "prepareBed",
"description": "Schema for prepareBed.py",
"type": "object",
"properties": {
"heads": {
"type": "array",
"items": {
"type": "object",
"properties": {
"bigwig-names": {
"type": "array",
"items": {"type": "string"}
},
"max-counts":{
"type": "number"
},
"min-counts":{
"type": "number"
},
"max-quantile":{
"$ref": "/schema/base#/definitions/fraction"
},
"min-quantile":{
"$ref": "/schema/base#/definitions/fraction"
}
},
"required": ["bigwig-names"],
"allOf": [
{"oneOf": [
{"required": ["max-counts"]},
{"required": ["max-quantile"]}]},
{"oneOf": [
{"required": ["min-counts"]},
{"required": ["min-quantile"]}]}
]
}
},
"splits":{
"oneOf":[
{
"type": "object",
"properties": {
"train-chroms": {"type": "array",
"items": {"type": "string"}},
"val-chroms": {"type": "array",
"items": {"type": "string"}},
"test-chroms": {"type": "array",
"items": {"type": "string"}},
"regions": {"type": "array",
"items": {"type": "string"}}
},
"required": ["train-chroms", "val-chroms",
"test-chroms", "regions"]
},
{
"type": "object",
"properties": {
"train-regions": {"type": "array",
"items": {"type": "string"}},
"val-regions": {"type": "array",
"items": {"type": "string"}},
"test-regions": {"type": "array",
"items": {"type": "string"}}
},
"required": ["train-regions", "val-regions", "test-regions"]
},
{
"type": "object",
"properties": {
"train-regex": {"type": "string"},
"val-regex": {"type": "string"},
"test-regex": {"type": "string"},
"regions": {"type": "array",
"items": {"type": "string"}}
},
"required": ["train-regex", "val-regex",
"test-regex", "regions"]
}]},
"genome": {"type": "string"},
"output-length": {"type": "integer"},
"input-length": {"type": "integer"},
"max-jitter": {"type": "integer"},
"output-prefix": {"type": "string"},
"output-train": {"type": "string"},
"output-val": {"type": "string"},
"output-test": {"type": "string"},
"resize-mode": {"type": "string", "enum": ["none", "center", "start"]},
"remove-overlaps": {"type": "boolean"},
"overlap-max-distance": {"type": "integer"},
"num-threads" : {"type" : "integer"},
"verbosity": {"$ref": "/schema/base#/definitions/verbosity"}
},
"required": ["heads", "splits", "genome", "output-length", "input-length",
"max-jitter", "remove-overlaps", "verbosity"],
"allOf": [
{
"if": {
"properties": {
"remove-overlaps": {"const": true}}},
"then":{
"required": ["overlap-max-distance"]
},
"else":{
"not": {"required": ["overlap-max-distance"]}
}
},
{
"oneOf": [
{"required": ["output-prefix"]},
{"required": ["output-train", "output-val", "output-test"]}]
}
]
}