makePredictions
A script to make predictions using a BPReveal model.
This program streams input from disk and writes output as it calculates, so it can run with very little memory even for extremely large prediction tasks.
BNF
<prediction-input-configuration> ::= { <prediction-settings-section>, <prediction-input-section>, «"num-threads" : <integer>,» <verbosity-section> }
<prediction-input-section> ::= <prediction-fasta-input-section> | <prediction-bed-input-section>
<prediction-settings-section> ::= "settings" : { "output-h5" : <file-name>, "batch-size" : <integer>, "heads" : <integer>, "architecture" : <prediction-model-settings> }
<prediction-model-settings> ::= { "model-file" : <file-name>, "input-length" : <integer>, "output-length" : <integer> }
<prediction-bed-input-section> ::= "genome": <file-name>, "bed-file": <file-name>
<prediction-fasta-input-section> ::= « "coordinates" : { "bed-file" : <file-name>, "genome" : <file-name>}, » "fasta-file" : <file-name>
Parameter Notes
- heads
Gives the number of output heads for your model. You don’t need to tell this program how many tasks there are for each head, since it just blindly sticks whatever the model outputs into the hdf5 file.
- output-h5
The name of the output file that will contain the predictions.
- batch-size
How many samples should be run simultaneously? I recommend 64 or so.
- model-file
The name of the Keras model file on disk.
- input-length, output-length
The input and output lengths of your model.
- fasta-file
A file containing the sequences for which you’d like predictions. Each sequence in this bed file must be
input-lengthlong. If you specifyfasta-file, you cannot also specifybed-fileandgenome(except, optionally, in thecoordinatessection.)- bed-file, genome
If you do not give
fasta-file, you can instead give abed-fileandgenomefasta. Each region in the bed file should beoutput-lengthlong, and the program will automatically inflate the regions to theinput-lengthof your model.- num-threads
(Optional) How many parallel predictors should be run? Unless you’re really taxed for performance, leave this at 1.
- coordinates
(Optional, only valid with
fasta-file.) Thebed-fileandgenomeentries may be specified to add coordinate information when you predict fromfasta-file. If provided, then the output hdf5 will containchrom_names,chrom_sizes,coords_chrom,coords_start, andcoords_enddatasets, in addition to the descriptions dataset. Only the coordinate information is taken from the bed file, and only chromosome size information is loaded from the genome file. The actual sequences to predict will be drawn fromfasta-file. This way, you can make predictions from a fasta but then easily convert it to a bigwig.
Output Specification
This program will produce an hdf5-format file containing the predicted values. It is organized as follows:
- descriptions
A list of strings of length (numRegions,). If you give a fasta file, these will correspond to the description lines (i.e., the lines starting with
>). If you gave a bed file as input, each one will be an empty string.- head_0, head_1, head_2, …
You get a subgroup for each output head of the model. The subgroups are named
head_N, where N is 0, 1, 2, etc. Each head contains:- logcounts
A vector of shape (numRegions,) that gives the logcounts value for each region.
- logits
The array of logit values for each track for each region. The shape is (numRegions x outputWidth x numTasks). Don’t forget that you must calculate the softmax on the whole set of logits, not on each task’s logits independently. (Use
bpreveal.utils.logitsToProfile()to do this.)
- chrom_names
A list of strings that give you the meaning of each index in the
coords_chromdataset. This is particularly handy when you want to make a bigwig file, since you can extract a header from this data. Only populated if a bed file and genome were provided.- chrom_sizes
The size of each chromosome in the same order as
chrom_names. Mostly used to create bigwig headers. Only populated if a bed file and genome were provided.- coords_chrom
A list of integers, one for each region predicted, that gives the chromosome index (see
chrom_names) for that region. Only populated if a bed file and genome were provided.- coords_start
The start base of each predicted region. Only populated if a bed file and genome were provided.
- coords_stop
The end point of each predicted region. Only populated if a bed file and genome were provided.
- metadata
A group containing the configuration that was used when the program was run.
API
- bpreveal.makePredictions.getReader(config)
Loads the reader appropriate for the configuration.
- Parameters:
config (dict)
- Return type:
- bpreveal.makePredictions.getWriter(config, numPredictions)
Creates a writer appropriate for the configuration.
- Parameters:
config (dict)
numPredictions (int)
- Return type:
- bpreveal.makePredictions.main(config)
Run the predictions.
- Parameters:
config (dict) – is taken straight from the json specification.
- Return type:
None
Schema
{
"$schema": "http://json-schema.org/draft-07/schema#",
"title": "makePredictions",
"description": "Schema for makePredictions.py",
"type": "object",
"properties": {
"settings": {
"type": "object",
"properties": {
"output-h5": {"type": "string"},
"batch-size": {"type": "integer"},
"genome": {"type": "string"},
"heads": {"type": "integer"},
"architecture": {
"type": "object",
"properties": {
"input-length": {"type": "integer"},
"output-length": {"type": "integer"},
"model-file": {"type": "string"}
},
"required": ["input-length", "output-length", "model-file"]
}
},
"required": ["output-h5", "batch-size", "heads", "architecture"]
},
"fasta-file": {"type": "string"},
"bed-file": {"type": "string"},
"num-threads": {"type": "integer", "minimum" : 1},
"coordinates": {
"type": "object",
"properties": {
"bed-file" : {"type": "string"},
"genome" : {"type": "string"}
},
"required" : ["bed-file", "genome"]
},
"verbosity": {"$ref": "/schema/base#/definitions/verbosity"}
},
"required": ["settings", "verbosity"],
"oneOf": [
{
"required": ["fasta-file"],
"not": {"required": ["bed-file"]}
},
{
"required": ["bed-file"],
"oneOf": [
{"properties": {"settings": { "required": ["genome"]}}},
{"required": ["genome"]}],
"not": {"anyOf": [
{"required": ["fasta-file"]},
{"required": ["coordinates"]}]}
}]
}