prepareTrainingData

Create the data files that will be used to train the model.

This program reads in a genome file, a list of regions in bed format, and a set of bigwig files containing profiles that the model will use to train. It generates an hdf5-format file that is used during training. If you want to train on a custom genome, or you don’t have a meaningful genome for your experiment, you can still provide sequences and profiles by creating an hdf5 file in the same format as this tool generates.

BNF

<prepare-data-configuration> ::=
    {
        "genome" : <file-name>,
        "input-length" : <integer>,
        "output-length" : <integer>,
        "max-jitter" : <integer>,
        "regions" : <file-name>,
        "output-h5" : <file-name>,
        "reverse-complement" : <boolean>,
        "heads" : [<prepare-data-heads-list>],
        <verbosity-section>
    }

<prepare-data-heads-list> ::=
   <prepare-data-individual-head>«, <prepare-data-heads-list>»

<prepare-data-individual-head> ::=
    {
        «<revcomp-head-section>»
        "bigwig-files" : [<list-of-bigwig-file-names>]
    }

<revcomp-head-section> ::=
  | "revcomp-task-order" : "auto",
  | "revcomp-task-order" : [<list-of-integer>],

<list-of-bigwig-file-names> ::=
   <file-name>«, <file-name>»

Parameter Notes

genome: The name of the fasta-format file for your organism.
regions: is the name of the bed file of regions you will train on. These regions must be output-length in length.
reverse-complement: A boolean that sets whether the data files will include reverse-complement augmentation. If this is set to true then you must include revcomp-task-order in every head section.
revcomp-task-order: A list specifying which tasks in the forward sample should map to the tasks in the reverse sample. Alternatively, this may be the string "auto". If reverse-complement is false, it is an error to specify revcomp-task-order.

Output specification

It will generate a file that is organized as follows:

head_0, head_1, head_2, …: There will be a head_n entry for each head in your model. It will have shape (num-regions x (output-length + 2*jitter) x num-tasks).
sequence: The one-hot encoded sequence for each corresponding region. It will have shape (num-regions x (input-length + 2*jitter) x NUM_BASES).
metadata: A group containing the configuration used when the program was run.

Additional information

Revcomp tasks

The revcomp-task-order parameter can be a bit tricky to understand. Generally, ask yourself “If we had sequenced the other strand of this chromosome, which profile would look like which?” If the data from one task, say, the positive task, would appear on the other task in this hypothetical universe, then you should flip the tasks.

For example, if the two tasks represent reads on the plus and minus strand, then when you create a reverse-complemented training example, the minus strand becomes the plus strand, and vice versa. So you’d set this parameter to [1,0] to indicate that the data for the two tasks should be swapped (in addition to reversed 5’ to 3’, of course).

If you only have one task in a head, you should set this to [0], to indicate that there is no swapping. If you have multiple tasks, say, a task for the left end of a read, one for the right end, and one for the middle, then the left and right should be swapped and the middle left alone. In this case, you’d set revcomp-task-order to [1,0,2]. If this parameter is set to "auto", then it will choose [1,0] if there are two strands, [0] if there is only one strand, and it will issue an error if there are more strands than that.

auto is appropriate for data like ChIP-nexus.

History

reverse-complement became mandatory in BPReveal 2.0.0

API

bpreveal.prepareTrainingData.revcompSeq(oneHotSeq)

Reverse-complement the given sequence.

Parameters:: oneHotSeq (ndarray[Any, dtype[uint8]])
Return type:: ndarray[Any, dtype[uint8]]

bpreveal.prepareTrainingData.getSequences(bed, genome, outputLength, inputLength, jitter, revcomp)

Extract sequences from the fasta.

Parameters:

bed (BedTool) – A BedTool containing the regions to get sequence data for. These regions should be outputLength wide.
genome (FastaFile) – The (open) FastaFile containing your genome sequence
outputLength (int) – The output length of your model
inputLength (int) – The input length of your model
jitter (int) – The maximum jitter that will be applied during training
revcomp (bool) – Should the returned sequence include reverse-complement data? If so, then each region will produce two sequences: one forward and one reverse-complemented.

Returns:

An array of one-hot-encoded sequences of shape `(numSequences x inputLength + 2*jitter, NUM_BASES)`.

Return type:

ndarray[Any, dtype[uint8]]

bpreveal.prepareTrainingData.getHead(bed, bigwigFnames, outputLength, jitter, revcomp)

Get all the data for a particular head.

Parameters:

bed (BedTool) – A BedTool containing the regions that will be used to train. Each interval in this BedTool should have length outputLength.
bigwigFnames (list[str]) – The names of the bigwig files that will be loaded.
outputLength (int) – The output length of your model
jitter (int) – The jitter that will be applied during training.
revcomp (Literal[False] | list[int]) – If no reverse-complement is desired, this is just False. Otherwise, see the section on revcomp-task-order for what this list means.

Returns:

An array of data that can be put in the training hdf5. It has shape (numSequences * outputLength + 2 * jitter, numTasks). numSequences will be the length of your bed file if revcomp == False, or twice the length of your bed file if you include revcomp augmentation.

Return type:

PRED_AR_T

bpreveal.prepareTrainingData.writeH5(config)

Main method, load the config and then generate training data hdf5 files.

Parameters:: config (dict) – The configuration json.
Return type:: None

Schema

{
    "$schema": "http://json-schema.org/draft-07/schema#",
    "title": "prepareTrainingData",
    "description": "Schema for prepareTrainingData.py",
    "type": "object",
    "properties": {
        "genome": {"type": "string"},
        "output-length": {"type": "integer"},
        "input-length": {"type": "integer"},
        "max-jitter": {"type": "integer"},
        "regions": {"type": "string"},
        "output-h5": {"type": "string"},
        "reverse-complement": {"type": "boolean"},
        "verbosity": {"$ref": "/schema/base#/definitions/verbosity"},
        "heads": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "bigwig-files": {
                        "type": "array",
                        "items": {"type": "string"}
                    },
                    "revcomp-task-order":{
                        "oneOf": [
                            {"enum": ["auto"]},
                            {"type": "array",
                                "items": {"type": "integer"}}]
                    }
                },
                "required": ["bigwig-files"]
            }
        }
    },
    "required": ["heads", "genome", "output-length", "input-length", "max-jitter",
               "regions", "output-h5", "reverse-complement",  "verbosity"]
}