interpretPisa

A script to generate PISA scores.

PISA is described here

BNF

<pisa-fasta-configuration> ::=
    {"model-file" : <file-name>,
     <bed-or-fasta>,
     "num-shuffles" : <integer>,
     "head-id" : <integer>,
     "task-id" : <integer>,
     "output-h5" : <file-name>,
     "output-length" : <integer>,
     "input-length" : <integer>,
     <correct-receptive-field-section>
     <kmer-size-section>
     <num-threads-section>
     <verbosity-section>}

<bed-or-fasta> ::=
    "genome" : <file-name>,
    "bed-file": <file-name>
  | "fasta-file" : <file-name>
  | "sequence-fasta" : <file-name> (DEPRECATED)

<kmer-size-section> ::=
    "kmer-size" : <integer>,
  | <empty>

<correct-receptive-field-section> ::=
    "correct-receptive-field" : <boolean>,
  | <empty>

Parameter Notes

model-file: The name of the saved Keras model to interpret.
head-id: The head number that you’d like to interpret.
task-id: The task within that head that you’d like to interpret. Note that while interpretFlat can combine multiple tasks, PISA can only consider one task at a time.
output-h5: The name of the hdf5 file where you’d like the output saved.
output-length, input-length: The output and input length of your model.
genome, bed-file: If you provide a bed file, this file represents the individual bases that should be shapped. There is no restriction on the number of regions, nor on their length. If you are interested in the effects of a particular motif, then you’d put the region surrounding that motif in the bed file, making it as large as you want to see the interactions you’re interested in.
fasta-file: You can also supply the sequences directly with a fasta file. Since PISA calculates shap scores for a single base, this tool always calculates the PISA scores for the leftmost base in the output window. Suppose the input length is 3090 bp, and the output is 1000 bp. In this case, the receptive field is 2091 bp, and there are 1045 bp of overhang on each end of the input sequence. So, for each input sequence, this program will assign shap scores to the 1046th base (one-indexed) of the input sequence.
num-shuffles: This is the number of background samples that should be used for calculating shap values. I recommend 20.
kmer-size: (Optional) If provided, this changes how the shuffles work. By default (or if you specify kmer-size = 1) all of the bases in the input are jumbled randomly. However, if you specify kmer-size=2, then the distribution of dimers will be preserved in the shuffled sequences. If you specify kmer-size=3, then trimers will be preserved, and so on.
num-threads: (Optional) If provided, use multiple batcher threads in parallel, on the same GPU. Shap is relatively inefficient on the GPU, and by using two or three threads, you can get better throughput. If you run into memory issues, use one thread.
correct-receptive-field: (Optional) If set to true, then the output array will have the correct width, which is input-length - output-length + 1. By default, use the (incorrect) value of input-length - output-length, for compatibility with old scripts. In version 5.0.0, default will switch from false to true.

Output Specification

It produces an hdf5 format which is organized as follows:

input_predictions: A (numSamples,) array of the logit value of the target base when that sequence is run through the network.
shuffle_predictions: A (numSamples, numShuffles) array of the logits of the target base in the shuffled reference sequences.
sequence: A one-hot encoded array representing the sequence under each PISA value. The shape is (num regions * receptive-field * NUM_BASES). Note that this is receptive field, not input length, since each base being shapped will only be affected by bases in its receptive field, and there’s no reason to store the noise.
shap: A table of the shap scores. The shape is the same as the sequence table, and each position in the shap table represents the corresponding base in the sequence table. These values are contribution scores to the difference-from-reference of the logit at this base.
metadata: A group containing the configuration used to perform the calculation.

If you specified a bed file with input regions, it will also have these datasets:

chrom_names: A list of strings giving the name of each chromosome. This is used to figure out which chromosome each number in coords_chrom corresponds to.
chrom_sizes: A list of integers giving the size of each chromosome. This is mostly here as a handy reference when you want to make a bigwig file.
coords_base: The center point for each of the regions in the table of PISA values.
coords_chrom: The chromosome on which each PISA vector is found. This is a list of integers. The width of the integer data type may vary from run to run, and is calculated based on the number of chromosomes in the genome file.

If, however, you gave a fasta file of input sequences, it will instead have the following dataset:

descriptions: A string taken from the fasta file. These are the comment (>) lines, and are stored without the leading >. This will have shape (numSamples,)

HISTORY

Before BPReveal 4.0.0, this was two programs: interpretPisaBed and interpretPisaFasta. They shared almost all of the same code, so they were merged into interpretPisa. The old names are still present in the bin/ directory, where they symlink to the same python file in src. In BPReveal 6.0.0, the old symlinks will be removed.

API

bpreveal.interpretPisa.main(config)

Run the calculation.

Parameters:: config (dict) – A JSON object matching the interpretPisa specification.
Return type:: None

Schema

{
    "$schema": "http://json-schema.org/draft-07/schema#",
    "title": "interpretPisa",
    "description": "Schema for interpretPisa.py",
    "type": "object",
    "properties": {
        "genome": {"type": "string"},
        "bed-file": {"type": "string"},
        "fasta-file": {"type": "string"},
        "sequence-fasta": {"type": "string"},
        "input-length": {"type": "integer"},
        "output-length": {"type": "integer"},
        "head-id": {"type": "integer"},
        "task-id": {"type": "integer"},
        "output-h5": {"type": "string"},
        "correct-receptive-field": {"type": "boolean"},
        "num-shuffles": {"type": "integer", "minimum" : 1},
        "kmer-size": {"type": "integer", "minimum" : 1},
        "num-threads": {"type": "integer", "minimum" : 1},
        "verbosity": {"$ref": "/schema/base#/definitions/verbosity"}
    },
    "required": ["input-length", "output-length", "head-id", "task-id",
        "output-h5", "num-shuffles", "verbosity"],
    "oneOf": [
        {
            "required": ["genome", "bed-file"],
            "not": {"anyOf": [{"required": ["fasta-file"]},
                              {"required": ["sequence-fasta"]}]}
        },
        {
            "oneOf": [{"required": ["fasta-file"]},
                      {"required": ["sequence-fasta"]}],
            "not": {"required": ["genome", "bed-file"]}
        }
    ]
}