prepareTrainingData
Create the data files that will be used to train the model.
This program reads in a genome file, a list of regions in bed format, and a set of bigwig files containing profiles that the model will use to train. It generates an hdf5-format file that is used during training. If you want to train on a custom genome, or you don’t have a meaningful genome for your experiment, you can still provide sequences and profiles by creating an hdf5 file in the same format as this tool generates.
BNF
<prepare-data-configuration> ::= { "genome" : <file-name>, "input-length" : <integer>, "output-length" : <integer>, "max-jitter" : <integer>, "regions" : <file-name>, "output-h5" : <file-name>, "reverse-complement" : <boolean>, "heads" : [<prepare-data-heads-list>], <verbosity-section> }
<prepare-data-heads-list> ::= <prepare-data-individual-head>«, <prepare-data-heads-list>»
<prepare-data-individual-head> ::= { «<revcomp-head-section>» "bigwig-files" : [<list-of-bigwig-file-names>] }
<revcomp-head-section> ::= | "revcomp-task-order" : "auto", | "revcomp-task-order" : [<list-of-integer>],
<list-of-bigwig-file-names> ::= <file-name>«, <file-name>»
Parameter Notes
- genome
The name of the fasta-format file for your organism.
- regions
is the name of the bed file of regions you will train on. These regions must be
output-lengthin length.- reverse-complement
A boolean that sets whether the data files will include reverse-complement augmentation. If this is set to
truethen you must includerevcomp-task-orderin every head section.- revcomp-task-order
A list specifying which tasks in the forward sample should map to the tasks in the reverse sample. Alternatively, this may be the string
"auto". Ifreverse-complementis false, it is an error to specifyrevcomp-task-order.
Output specification
It will generate a file that is organized as follows:
- head_0, head_1, head_2, …
There will be a
head_nentry for each head in your model. It will have shape(num-regions x (output-length + 2*jitter) x num-tasks).- sequence
The one-hot encoded sequence for each corresponding region. It will have shape
(num-regions x (input-length + 2*jitter) x NUM_BASES).- metadata
A group containing the configuration used when the program was run.
Additional information
Revcomp tasks
The revcomp-task-order parameter can be a bit tricky to understand.
Generally, ask yourself “If we had sequenced the other strand of this
chromosome, which profile would look like which?” If the data from one task,
say, the positive task, would appear on the other task in this hypothetical
universe, then you should flip the tasks.
For example, if the two tasks represent reads on the plus and minus
strand, then when you create a reverse-complemented training example,
the minus strand becomes the plus strand, and vice versa.
So you’d set this parameter to [1,0] to indicate that the data
for the two tasks should be swapped (in addition to reversed 5’ to 3’,
of course).
If you only have one task in a head, you should set this to
[0], to indicate that there is no swapping.
If you have multiple tasks, say, a task for the left end of a read,
one for the right end, and one for the middle, then the left and right
should be swapped and the middle left alone.
In this case, you’d set revcomp-task-order to [1,0,2].
If this parameter is set to "auto", then it will choose
[1,0] if there are two strands, [0] if there is only
one strand, and it will issue an error if there are more strands than
that.
auto is appropriate for data like ChIP-nexus.
History
reverse-complement became mandatory in BPReveal 2.0.0
API
- bpreveal.prepareTrainingData.revcompSeq(oneHotSeq)
Reverse-complement the given sequence.
- Parameters:
oneHotSeq (ndarray[tuple[int, ...], dtype[uint8]])
- Return type:
ndarray[tuple[int, …], dtype[uint8]]
- bpreveal.prepareTrainingData.getSequences(bed, genome, outputLength, inputLength, jitter, revcomp)
Extract sequences from the fasta.
- Parameters:
bed (BedTool) – A BedTool containing the regions to get sequence data for. These regions should be outputLength wide.
genome (FastaFile) – The (open) FastaFile containing your genome sequence
outputLength (int) – The output length of your model
inputLength (int) – The input length of your model
jitter (int) – The maximum jitter that will be applied during training
revcomp (bool) – Should the returned sequence include reverse-complement data? If so, then each region will produce two sequences: one forward and one reverse-complemented.
- Returns:
An array of one-hot-encoded sequences of shape
`(numSequences x inputLength + 2*jitter, NUM_BASES)`.- Return type:
ndarray[tuple[int, …], dtype[uint8]]
- bpreveal.prepareTrainingData.getHead(bed, bigwigFnames, outputLength, jitter, revcomp)
Get all the data for a particular head.
- Parameters:
bed (BedTool) – A BedTool containing the regions that will be used to train. Each interval in this BedTool should have length outputLength.
bigwigFnames (list[str]) – The names of the bigwig files that will be loaded.
outputLength (int) – The output length of your model
jitter (int) – The jitter that will be applied during training.
revcomp (Literal[False] | list[int]) – If no reverse-complement is desired, this is just
False. Otherwise, see the section onrevcomp-task-orderfor what this list means.
- Returns:
An array of data that can be put in the training hdf5. It has shape (numSequences * outputLength + 2 * jitter, numTasks). numSequences will be the length of your bed file if
revcomp == False, or twice the length of your bed file if you include revcomp augmentation.- Return type:
- bpreveal.prepareTrainingData.writeH5(config)
Main method, load the config and then generate training data hdf5 files.
- Parameters:
config (dict) – The configuration json.
- Return type:
None
Schema
{
"$schema": "http://json-schema.org/draft-07/schema#",
"title": "prepareTrainingData",
"description": "Schema for prepareTrainingData.py",
"type": "object",
"properties": {
"genome": {"type": "string"},
"output-length": {"type": "integer"},
"input-length": {"type": "integer"},
"max-jitter": {"type": "integer"},
"regions": {"type": "string"},
"output-h5": {"type": "string"},
"reverse-complement": {"type": "boolean"},
"verbosity": {"$ref": "/schema/base#/definitions/verbosity"},
"heads": {
"type": "array",
"items": {
"type": "object",
"properties": {
"bigwig-files": {
"type": "array",
"items": {"type": "string"}
},
"revcomp-task-order":{
"oneOf": [
{"enum": ["auto"]},
{"type": "array",
"items": {"type": "integer"}}]
}
},
"required": ["bigwig-files"]
}
}
},
"required": ["heads", "genome", "output-length", "input-length", "max-jitter",
"regions", "output-h5", "reverse-complement", "verbosity"]
}