interpretFlat

A script to generate importance scores in the style of the original BPNet.

BNF

<flat-interpretation-configuration> ::=
    {<bed-or-fasta>,
     "model-file" : <file-name>,
     "input-length" : <integer>,
     "output-length" : <integer>,
     "heads" : <integer>,
     "head-id" : <integer>
     "profile-task-ids" : [<list-of-integer>],
     "profile-h5" : <file-name>,
     "counts-h5" : <file-name>,
     "num-shuffles" : <integer>,
     <kmer-size-section>
     <verbosity-section>}

<bed-or-fasta> ::=
    "genome" : <file-name>,
    "bed-file" : <file-name>
 |  "fasta-file" : <file-name>
 |  "fasta-file" : <file-name>,
    "coordinates": {
        "genome" : <file-name>,
        "bed-file" : <file-name>}

<kmer-size-section> ::=
    "kmer-size" : <integer>,
  | <empty>

Parameter Notes

genome, bed-file: If you specify these two parameters in the configuration, then this program will read coordinates from the bed file, extract the sequences from the provided fasta, and run interpretations on those sequences. In this case, the output file will include chrom_names, chrom_sizes, coords_start, coords_end, and coords_chrom. The bed-file that you give should have regions matching the model’s output length. The regions will be automatically inflated in order to extract the input sequence from the genome. Somewhat confusingly, this means that the contribution scores will include contributions from bases that are not in your bed file. This is because the contribution scores explain how all of the input bases contribute to the output observed at the region in the bed file.
fasta-file: If you specify a fasta file, then the sequences are taken directly from that file. In this case, the output hdf5 file will not include the chrom_names, chrom_sizes, coords_start, coords_end, or coords_chrom fields. Instead it will contain a descriptions dataset, which holds the description lines from the fasta. If you specify fasta-file, then the sequences in that fasta must be as long as the model’s input length. (Since we need the whole sequence that will be explained.) In this case, the contribution scores in the output will match one-to-one with the input bases.
coordinates, genome, bed-file: If you give a fasta file and also include a coordinates section, then the sequences to interpret will be drawn from the given fasta-file, but coordinate information and chromosome sizes will be taken from the bed file and genome fasta. This means that you can use shapToBigwig even though the sequences don’t come from a real genome. In this case, the output hdf5 will contain all of the usual coordinate datasets in addition to the description dataset that you usually get for interpreting from a fasta file.
heads: This parameter is the total number of heads that the model has.
head-id: This parameter gives which head you want importance values calculated for.
profile-task-ids: Lists which of the profile predictions (i.e., tasks) from the specified head you want considered. Almost always, you should include all of the profiles. For a single-task head, this would be [0], and for a two-task head this would be [0,1].
profile-h5, counts-h5: These are the names of the output files that will be saved to disk.
num-shuffles: This is the number of background samples that should be used for calculating shap values. I recommend 20.
kmer-size: (Optional) If provided, this changes how the shuffles work. By default (or if you specify kmer-size = 1) all of the bases in the input are jumbled randomly. However, if you specify kmer-size=2, then the distribution of dimers will be preserved in the shuffled sequences. If you specify kmer-size=3, then trimers will be preserved, and so on.

Output Specification

Genome and bed file

If you gave a genome fasta and a bed file of regions, the output will have this structure:

chrom_names: A list of strings giving the name of each chromosome. coords_chrom entries correspond to the order of chromosomes in this dataset.
chrom_sizes: A list of integers giving the size of each chromosome. This is mostly here as a handy reference when you want to make a bigwig file.
coords_start: The start point for each of the regions that were explained. This will have shape (num-regions,). Note that this starts at the beginning of the input to the model, so it will not match the coordinates in the bed file.
coords_end: The end point for each of the regions that were explained. This will have shape (num-regions,). As with coords_start, this corresponds to the last base in the input to the model.
coords_chrom: The chromosome number on which each region is found. These are integer indexes into chrom_names, and this dataset has shape (num-regions,)
input_seqs: A one-hot encoded array representing the input sequences. It will have shape (num-regions x input-length x 4)
hyp_scores: A table of the shap scores. It will have shape (num-regions x input-length x 4). If you want the actual contribution scores, not the hypothetical ones, multiply hyp_scores by input_seqs to zero out all purely hypothetical contribution scores.

Fasta file

descriptions: A list of strings that are the description lines from the input fasta file (with the leading > removed). This list will have shape (num-regions,)
input_seqs, hyp_scores: These have the same meaning as in the bed-and-genome based output files.

Additional Information

No fasta coordinate data

While you can use shapToNumpy on either format of interpretFlat output, you cannot convert a fasta-based interpretation h5 to a bigwig, since it doesn’t contain coordinate information. You can get around this limitation by providing a bed file and a genome in a coordinates section.

History

Before BPReveal 4.0.0, the coords_chrom dataset in the generated hdf5 file contained strings. For consistency with every other tool in the BPReveal suite, it was changed to contain an integer index into the chrom_names dataset.

API

bpreveal.interpretFlat.main(config)

Run the interpretation.

Parameters:: config (dict) – A JSON object matching the interpretFlat specification.

Schema

{
    "$schema": "http://json-schema.org/draft-07/schema#",
    "title": "interpretFlat",
    "description": "Schema for interpretFlat.py",
    "type": "object",
    "properties": {
        "genome": {"type": "string"},
        "bed-file": {"type": "string"},
        "fasta-file": {"type": "string"},
        "coordinates" : {
            "type" : "object",
            "properties" : {
                "genome": {"type": "string"},
                "bed-file": {"type": "string"}
            },
            "required" : ["genome", "bed-file"]
        },
        "input-length": {"type": "integer"},
        "output-length": {"type": "integer"},
        "heads": {"type": "integer", "minimum": 1},
        "head-id": {"type": "integer"},
        "profile-task-ids": {
            "type": "array",
            "items": {
                "type": "integer"}
        },
        "profile-h5": {"type": "string"},
        "counts-h5": {"type": "string"},
        "num-shuffles": {"type": "integer", "minimum" : 1},
        "kmer-size": {"type": "integer", "minimum" : 1},
        "verbosity": {"$ref": "/schema/base#/definitions/verbosity"}
    },
    "required": ["input-length", "output-length", "heads", "head-id",
        "profile-task-ids", "profile-h5", "counts-h5", "num-shuffles", "verbosity"],
    "oneOf": [
        {
            "required": ["genome", "bed-file"],
            "not": {
                "anyOf" : [
                    {"required": ["coordinates"]},
                    {"required": ["fasta-file"]}
                ]
            }
        },
        {
            "not": {
                "anyOf" : [
                    {"required": ["genome"]},
                    {"required": ["bed-file"]}
                ]
            },
            "required": ["fasta-file"]
        }
    ]
}