prepareBed

Generates test, train, and validation splits and optionally performs some filtering.

BNF

<prepare-bed-configuration> ::=
    {<bigwig-section>,
     "splits" : {<split-settings>},
     "genome" : <file-name>,
     "output-length" : <integer>,
     "input-length" : <integer>,
     "max-jitter" : <integer>,
     <output-file-name-section>,
     "resize-mode" : <resize-mode>,
     <overlap-section>
     <num-threads-section>
     <verbosity-section>}
<bigwig-section> ::=
    "heads" : [<head-preparation-list>]
  | (DEPRECATED) "bigwigs" : [<bigwig-preparation-list>]
<overlap-section> ::=
    "remove-overlaps" : true,
    "overlap-max-distance" : <integer>,
  | "remove-overlaps" : false,
<head-preparation-list> ::=
    <individual-preparation-head>
  | <individual-preparation-head>, <head-preparation-list>
<resize-mode> ::=
    "none"
  | "center"
  | "start"
<output-file-name-section> ::=
    "output-prefix" : "<string>"
  | "output-train" : <file-name>,
    "output-val" : <file-name>,
    "output-test" : <file-name>
<individual-preparation-head> ::=
 { "bigwig-names" : [<list-of-bigwig-files>],
   <max-cutoff-section>,
   <min-cutoff-section>
   }
<max-cutoff-section> ::=
   "max-quantile" : <number>
 | "max-counts" : <integer>
<min-cutoff-section> ::=
   "min-quantile" : <number>
 | "min-counts" : <integer>
<split-settings> ::=
    <split-by-chromosome-settings>
  | <split-by-name-settings>
  | <split-by-bed-settings>
<split-by-chromosome-settings> ::=
    "train-chroms" : [<list-of-string>],
    "val-chroms" : [<list-of-string> ],
    "test-chroms" : [<list-of-string> ],
    "regions" : [<list-of-bed-files>]
<split-by-bed-settings> ::=
    "train-regions" : [<list-of-bed-files>],
    "val-regions" : [<list-of-bed-files>],
    "test-regions" : [<list-of-bed-files>]
<split-by-name-settings> ::=
    "regions" : [<list-of-bed-files>],
    "test-regex" : "<string>",
    "train-regex" : "<string>",
    "val-regex" : "<string>"
<list-of-bigwig-files> ::=
    <file-name>, <list-of-bigwig-files>
  | <file-name>
<list-of-bed-files> ::=
    <file-name>, <list-of-bed-files>
  | <file-name>
<num-threads-section> ::=
    <empty>
  | "num-threads" <integer>,
<bigwig-preparation-list> ::=
(DEPRECATED)    <individual-preparation-bigwig>
(DEPRECATED)  | <individual-preparation-bigwig>, <bigwig-preparation-list>
<individual-preparation-bigwig> ::=
(DEPRECATED)  { "file-name" : <file-name>,
(DEPRECATED)    <max-cutoff-section>,
(DEPRECATED)    <min-cutoff-section>
(DEPRECATED)    }

Parameter Notes

bigwig-names

A list of the data bigwigs that correspond to this head. For example, these might be the positive and negative strands of a ChIP-nexus sample.

resize-mode

specifies where in the regions in the bed file the output regions should be centered. Note that this program assumes your bed files are in bed3 format, that is, (chrom, start, stop). If you have additional columns with information like peak offset, those data will be ignored.

max-quantile, min-quantile

The max and min quantile values, if provided, will be used to threshold which regions are included in the output. First, all of the counts in the given regions are computed (which takes a while!), and then the given quantile is computed. All regions exceeding that value are not included in the output files.

max-counts, min-counts

Similarly, if max and min counts are given, all regions having more (or fewer) reads than the given number will be excluded.

output-prefix

Specifies the base name for the output bed files. You can use either output-prefix OR list all three output-train, output-val, and output-test. If you specify output-prefix, then five bed files will be made, called output-prefix_train.bed, output-prefix_val.bed, output-prefix_test.bed, output-prefix_all.bed, and output-prefix_reject.bed

output-train, output-val, output-test

If you give these file names, then the training, validation and test splits will be written to these three files, respectively.

regions

Needed for splits by chromosome or by regex. This is a bed file of every possible region that you might want to train on.

train-chroms, val-chroms, test-chroms

Split up your input regions by chromosome.

train-regex, val-regex, test-regex

If you use a regex, then the name field of each bed line will be matched against each of the three regexes. The line will be added to the split where it matches. If a bed line matches more than one regex, that will raise an error. If a line matches no regexes, it is added to the rejects.

train-regions, val-regions, test-regions

You may provide a specific bed file for each of the splits. In this case, the regions in each of these files are used to construct each respective split.

remove-overlaps

flag can be set to true if you’d like to exclude overlapping regions. This is done by resizing all regions down to overlap-max-distance, and then, if multiple regions have an overlap, one is deleted at random. If remove-overlaps is false, then it is an error to set overlap-max-distance.

num-threads

How many threads should be used for loading counts information? I recommend setting this to as many threads as your machine has.

Additional information

Counts windowing

I should mention that the maximum and minimum counts are not compared across the same window. When comparing a region against the maximum counts value, all counts within a window of size input-length + 2*jitter are added up. This way, if you have a crazy-huge spike just outside your region, that region will be rejected if the jittering could include it in the training data. Conversely, for minimum counts, the counts within a window of length output-length - 2*jitter will be considered. This way, no matter what jitter value is selected, there will be at least the given number of counts in the region.

Most columns ignored

prepareBed takes a very lenient approach to validating your bed files. It will not check that the score column in your file is numeric, nor will it check to see if you have flipped some columns in your input file.

History

The old bigwigs format was deprecated in BPReveal 4.0.0 and will be removed in BPReveal 5.0.0

The remove-overlaps field became mandatory in BPReveal 3.0.0.

API

bpreveal.prepareBed.loadRegionsByChrom(trainChroms, valChroms, testChroms, regionFnames)

Load splits based on chromosomes.

Given chromosomes for the training, validation, and test splits, and a list of bed files, generate the regions in each split.

Parameters:
  • trainChroms (list[str]) – A list of chromosome names for the training split.

  • valChroms (list[str]) – A list of chromosomes for the validation split.

  • testChroms (list[str]) – A list of chromosome names for the test split.

  • regionFnames (list[str]) – A list of bed file names to read in.

Returns:

The training, test, validation, and reject regions as lists.

Return type:

tuple[list[Interval], list[Interval], list[Interval], list[Interval]]

bpreveal.prepareBed.loadRegionsByBed(trainRegionFnames, valRegionFnames, testRegionFnames)

Given bed file names, load them into lists of intervals.

Parameters:
  • trainRegionFnames (list[str]) – A list of bed files containing regions to train on.

  • valRegionFnames (list[str]) – A list of bed files containing regions to use for validation.

  • testRegionFnames (list[str]) – A list of bed files for the test set of regions.

Returns:

Four lists of regions: Training, test, validation, and rejects. (Rejects will always be empty with this function.)

Return type:

tuple[list[Interval], list[Interval], list[Interval], list[Interval]]

bpreveal.prepareBed.loadRegionsByRegex(trainString, testString, valString, regionFnames)

Go over the bed files and assign splits based on regexes matched against the name column.

Parameters:
  • trainString (str) – The regex that matches samples in the training split

  • testString (str) – The regex that matches samples in the test split

  • valString (str) – The regex that matches samples in the validation split.

  • regionFnames (list[str]) – A list of bed files that will be read in.

Returns:

Four lists of Intervals, corresponding to the training, test, validation, and rejected regions.

Return type:

tuple[list[Interval], list[Interval], list[Interval], list[Interval]]

bpreveal.prepareBed.loadRegions(config)

Given a configuration (see the specification), return four PyBedTools BedTool objects.

Parameters:

config (dict) – A JSON object satisfying the prepareBed specification.

Returns:

Four BedTools:

  1. The first will consist of the training regions,

  2. the second will be the validation regions,

  3. then the test regions,

  4. finally any regions that were rejected on loading.

bpreveal.prepareBed.removeOverlaps(config, regions, genome)

Remove overlaps among the given regions.

Parameters:
  • config (dict) – Straight from the JSON.

  • regions (BedTool) – A BedTool (or list of Intervals)

  • genome (FastaFile) – A FastaFile (not string) giving the genome.

Return type:

tuple[BedTool, BedTool]

Takes in the list of regions, resizes each to the minimum size, and if there are overlaps, randomly chooses one of the overlapping regions.

bpreveal.prepareBed.filterByMaxCounts(config, bigRegionsList, bigwigLists, validRegions, numThreads)

Filters the regions in bigRegionList based on the max-quantile or max-counts in the config.

Parameters:
  • config (dict) – Straight from the configuration JSON.

  • bigRegionsList (list[Interval]) – A list of intervals that have already been inflated to account for jitter

  • bigwigLists (list[list[str]]) – The bigwigs that should be scanned, grouped by head.

  • validRegions (ndarray) – A vector booleans, for each region in bigRegionList, if region i is rejected, then validRegions[i] will be 0 when this function exits.

  • numThreads (int) – How many parallel workers should be used?

Returns:

A BedTool containing only valid regions.

Return type:

BedTool

bpreveal.prepareBed.filterByMinCounts(config, smallRegionsList, bigRegionsList, bigwigLists, validRegions, numThreads)

Filters the regions in smallRegionList based on the min-quantile or min-counts in the config.

Parameters:
  • config (dict) – Straight from the configuration JSON.

  • bigRegionsList (list[Interval]) – A list of intervals that have already been inflated to account for jitter

  • smallRegionsList (list[Interval]) – A list of intervals that have already been inflated to account for jitter

  • bigwigLists (list[list[str]]) – The bigwigs that should be scanned, grouped by head.

  • validRegions (ndarray) – A vector booleans, for each region k in smallRegionList, corresponding to region i in bigRegionsList, if region k is rejected, then validRegions[i] will be 0 when this function exits.

  • numThreads (int) – How many parallel workers should be used?

bpreveal.prepareBed.validateRegions(config, regions, genome, bigwigLists, numThreads)

The workhorse of this program.

Parameters:
  • config (dict) – Straight from the JSON.

  • regions (BedTool) – A BedTool or list.

  • genome (FastaFile) – A FastaFile (not the name as a str.)

  • bigwigLists (list[list[str]]) – The names of the data files to use.

  • numThreads (int) – How many parallel workers should be used?

Returns:

Two BedTools, one for regions that passed the filters and another for those that failed.

Given a config (see the spec), a BedTool of regions, an open pysam FastaFile, and a list of bigwigs to check, filter down the regions so that they satisfy the configuration. Returns two BedTools: The first contains the regions that passed the filters, and the second contains the rejected regions.

bpreveal.prepareBed.rewriteOldBigwigsFormat(config)

If the config has a bigwigs section, rewrite it to the new style.

bpreveal.prepareBed.prepareBeds(config)

The main function of this script.

Parameters:

config (dict) – A JSON object matching the prepareBed specification.

Schema

{
    "$schema": "http://json-schema.org/draft-07/schema#",
    "title": "prepareBed",
    "description": "Schema for prepareBed.py",
    "type": "object",
    "properties": {
        "heads": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "bigwig-names": {
                        "type": "array",
                        "items": {"type": "string"}
                    },
                    "max-counts":{
                        "type": "number"
                    },
                    "min-counts":{
                        "type": "number"
                    },
                    "max-quantile":{
                        "type": "number"
                    },
                    "min-quantile":{
                        "type": "number"
                    }
                },
                "required": ["bigwig-names"],
                "allOf": [
                    {"oneOf": [
                        {"required": ["max-counts"]},
                        {"required": ["max-quantile"]}]},
                    {"oneOf": [
                        {"required": ["min-counts"]},
                        {"required": ["min-quantile"]}]}
                ]
            }
        },
        "splits":{
            "oneOf":[
                {
                    "type": "object",
                    "properties": {
                        "train-chroms": {"type": "array",
                                         "items": {"type": "string"}},
                        "val-chroms":   {"type": "array",
                                         "items": {"type": "string"}},
                        "test-chroms":  {"type": "array",
                                         "items": {"type": "string"}},
                        "regions":  {"type": "array",
                                     "items": {"type": "string"}}
                    },
                    "required": ["train-chroms", "val-chroms",
                                 "test-chroms", "regions"]
                },
                {
                    "type": "object",
                    "properties": {
                        "train-regions": {"type": "array",
                                          "items": {"type": "string"}},
                        "val-regions":   {"type": "array",
                                          "items": {"type": "string"}},
                        "test-regions":  {"type": "array",
                                          "items": {"type": "string"}}
                    },
                    "required": ["train-regions", "val-regions", "test-regions"]
                },
                {
                    "type": "object",
                    "properties": {
                        "train-regex": {"type": "string"},
                        "val-regex":   {"type": "string"},
                        "test-regex":  {"type": "string"},
                        "regions":  {"type": "array",
                                     "items": {"type": "string"}}
                    },
                    "required": ["train-regex", "val-regex",
                                 "test-regex", "regions"]
                }]},
        "genome": {"type": "string"},
        "output-length": {"type": "integer"},
        "input-length": {"type": "integer"},
        "max-jitter": {"type": "integer"},
        "output-prefix": {"type": "string"},
        "output-train": {"type": "string"},
        "output-val": {"type": "string"},
        "output-test": {"type": "string"},
        "resize-mode": {"type": "string", "enum": ["none", "center", "start"]},
        "remove-overlaps": {"type": "boolean"},
        "overlap-max-distance": {"type": "integer"},
        "num-threads" : {"type" : "integer"},
        "verbosity": {"$ref": "/schema/base#/definitions/verbosity"}
    },
    "required": ["heads", "splits", "genome", "output-length", "input-length",
                 "max-jitter", "remove-overlaps",  "verbosity"],
    "allOf": [
        {
            "if": {
                "properties": {
                    "remove-overlaps": {"const": true}}},
            "then":{
                "required": ["overlap-max-distance"]
            },
            "else":{
                "not": {"required": ["overlap-max-distance"]}
            }
        },
        {
            "oneOf": [
                {"required": ["output-prefix"]},
                {"required": ["output-train", "output-val", "output-test"]}]
        }
    ]
}