makePredictions

A script to make predictions using a BPReveal model.

This program streams input from disk and writes output as it calculates, so it can run with very little memory even for extremely large prediction tasks.

BNF

<prediction-input-configuration> ::=
    {
        <prediction-settings-section>,
        <prediction-input-section>,
        «"num-threads" : <integer>,»
        <verbosity-section>
    }
<prediction-input-section> ::=
    <prediction-fasta-input-section>
  | <prediction-bed-input-section>
<prediction-settings-section> ::=
    "settings" : {
        "output-h5" : <file-name>,
        "batch-size" : <integer>,
        "heads" : <integer>,
        "architecture" : <prediction-model-settings> }
<prediction-model-settings> ::=
    {
        "model-file" : <file-name>,
        "input-length" : <integer>,
        "output-length" : <integer>
    }
<prediction-bed-input-section> ::=
    "genome": <file-name>,
    "bed-file": <file-name>
<prediction-fasta-input-section> ::=
   « "coordinates" : {
        "bed-file" : <file-name>,
        "genome" : <file-name>},
   »
    "fasta-file" : <file-name>

Parameter Notes

heads

Gives the number of output heads for your model. You don’t need to tell this program how many tasks there are for each head, since it just blindly sticks whatever the model outputs into the hdf5 file.

output-h5

The name of the output file that will contain the predictions.

batch-size

How many samples should be run simultaneously? I recommend 64 or so.

model-file

The name of the Keras model file on disk.

input-length, output-length

The input and output lengths of your model.

fasta-file

A file containing the sequences for which you’d like predictions. Each sequence in this bed file must be input-length long. If you specify fasta-file, you cannot also specify bed-file and genome (except, optionally, in the coordinates section.)

bed-file, genome

If you do not give fasta-file, you can instead give a bed-file and genome fasta. Each region in the bed file should be output-length long, and the program will automatically inflate the regions to the input-length of your model.

num-threads

(Optional) How many parallel predictors should be run? Unless you’re really taxed for performance, leave this at 1.

coordinates

(Optional, only valid with fasta-file.) The bed-file and genome entries may be specified to add coordinate information when you predict from fasta-file. If provided, then the output hdf5 will contain chrom_names, chrom_sizes, coords_chrom, coords_start, and coords_end datasets, in addition to the descriptions dataset. Only the coordinate information is taken from the bed file, and only chromosome size information is loaded from the genome file. The actual sequences to predict will be drawn from fasta-file. This way, you can make predictions from a fasta but then easily convert it to a bigwig.

Output Specification

This program will produce an hdf5-format file containing the predicted values. It is organized as follows:

descriptions

A list of strings of length (numRegions,). If you give a fasta file, these will correspond to the description lines (i.e., the lines starting with >). If you gave a bed file as input, each one will be an empty string.

head_0, head_1, head_2, …

You get a subgroup for each output head of the model. The subgroups are named head_N, where N is 0, 1, 2, etc. Each head contains:

logcounts

A vector of shape (numRegions,) that gives the logcounts value for each region.

logits

The array of logit values for each track for each region. The shape is (numRegions x outputWidth x numTasks). Don’t forget that you must calculate the softmax on the whole set of logits, not on each task’s logits independently. (Use bpreveal.utils.logitsToProfile() to do this.)

chrom_names

A list of strings that give you the meaning of each index in the coords_chrom dataset. This is particularly handy when you want to make a bigwig file, since you can extract a header from this data. Only populated if a bed file and genome were provided.

chrom_sizes

The size of each chromosome in the same order as chrom_names. Mostly used to create bigwig headers. Only populated if a bed file and genome were provided.

coords_chrom

A list of integers, one for each region predicted, that gives the chromosome index (see chrom_names) for that region. Only populated if a bed file and genome were provided.

coords_start

The start base of each predicted region. Only populated if a bed file and genome were provided.

coords_stop

The end point of each predicted region. Only populated if a bed file and genome were provided.

metadata

A group containing the configuration that was used when the program was run.

API

bpreveal.makePredictions.getReader(config)

Loads the reader appropriate for the configuration.

Parameters:

config (dict)

Return type:

BedReader | FastaReader

bpreveal.makePredictions.getWriter(config, numPredictions)

Creates a writer appropriate for the configuration.

Parameters:
  • config (dict)

  • numPredictions (int)

Return type:

H5Writer

bpreveal.makePredictions.main(config)

Run the predictions.

Parameters:

config (dict) – is taken straight from the json specification.

Return type:

None

Schema

{
    "$schema": "http://json-schema.org/draft-07/schema#",
    "title": "makePredictions",
    "description": "Schema for makePredictions.py",
    "type": "object",
    "properties": {
        "settings": {
            "type": "object",
            "properties": {
                "output-h5": {"type": "string"},
                "batch-size": {"type": "integer"},
                "genome": {"type": "string"},
                "heads": {"type": "integer"},
                "architecture": {
                    "type": "object",
                    "properties": {
                        "input-length": {"type": "integer"},
                        "output-length": {"type": "integer"},
                        "model-file": {"type": "string"}
                    },
                    "required": ["input-length", "output-length", "model-file"]
                }
            },
            "required": ["output-h5", "batch-size", "heads", "architecture"]
        },
        "fasta-file": {"type": "string"},
        "bed-file": {"type": "string"},
        "num-threads": {"type": "integer", "minimum" : 1},
        "coordinates": {
            "type": "object",
            "properties": {
                "bed-file" : {"type": "string"},
                "genome" : {"type": "string"}
            },
            "required" : ["bed-file", "genome"]
        },
        "verbosity": {"$ref": "/schema/base#/definitions/verbosity"}
    },
    "required": ["settings", "verbosity"],
    "oneOf": [
        {
            "required": ["fasta-file"],
            "not": {"required": ["bed-file"]}
        },
        {
            "required": ["bed-file"],
            "oneOf": [
                {"properties": {"settings": { "required": ["genome"]}}},
                {"required": ["genome"]}],
            "not": {"anyOf": [
                {"required": ["fasta-file"]},
                {"required": ["coordinates"]}]}
        }]

}