interpretFlat
A script to generate importance scores in the style of the original BPNet.
BNF
<flat-interpretation-configuration> ::= {<bed-or-fasta>, "model-file" : <file-name>, "input-length" : <integer>, "output-length" : <integer>, "heads" : <integer>, "head-id" : <integer> "profile-task-ids" : [<list-of-integer>], "profile-h5" : <file-name>, "counts-h5" : <file-name>, "num-shuffles" : <integer>, <kmer-size-section> <verbosity-section>}
<bed-or-fasta> ::= "genome" : <file-name>, "bed-file" : <file-name> | "fasta-file" : <file-name> | "fasta-file" : <file-name>, "coordinates": { "genome" : <file-name>, "bed-file" : <file-name>}
<kmer-size-section> ::= "kmer-size" : <integer>, | <empty>
Parameter Notes
- genome, bed-file
If you specify these two parameters in the configuration, then this program will read coordinates from the bed file, extract the sequences from the provided fasta, and run interpretations on those sequences. In this case, the output file will include
chrom_names,chrom_sizes,coords_start,coords_end, andcoords_chrom. Thebed-filethat you give should have regions matching the model’s output length. The regions will be automatically inflated in order to extract the input sequence from the genome. Somewhat confusingly, this means that the contribution scores will include contributions from bases that are not in your bed file. This is because the contribution scores explain how all of the input bases contribute to the output observed at the region in the bed file.- fasta-file
If you specify a fasta file, then the sequences are taken directly from that file. In this case, the output hdf5 file will not include the
chrom_names,chrom_sizes,coords_start,coords_end, orcoords_chromfields. Instead it will contain adescriptionsdataset, which holds the description lines from the fasta. If you specifyfasta-file, then the sequences in that fasta must be as long as the model’s input length. (Since we need the whole sequence that will be explained.) In this case, the contribution scores in the output will match one-to-one with the input bases.- coordinates, genome, bed-file
If you give a fasta file and also include a
coordinatessection, then the sequences to interpret will be drawn from the givenfasta-file, but coordinate information and chromosome sizes will be taken from the bed file and genome fasta. This means that you can useshapToBigwigeven though the sequences don’t come from a real genome. In this case, the output hdf5 will contain all of the usual coordinate datasets in addition to thedescriptiondataset that you usually get for interpreting from a fasta file.- heads
This parameter is the total number of heads that the model has.
- head-id
This parameter gives which head you want importance values calculated for.
- profile-task-ids
Lists which of the profile predictions (i.e., tasks) from the specified head you want considered. Almost always, you should include all of the profiles. For a single-task head, this would be
[0], and for a two-task head this would be[0,1].- profile-h5, counts-h5
These are the names of the output files that will be saved to disk.
- num-shuffles
This is the number of background samples that should be used for calculating shap values. I recommend 20.
- kmer-size
(Optional) If provided, this changes how the shuffles work. By default (or if you specify
kmer-size = 1) all of the bases in the input are jumbled randomly. However, if you specifykmer-size=2, then the distribution of dimers will be preserved in the shuffled sequences. If you specifykmer-size=3, then trimers will be preserved, and so on.
Output Specification
Genome and bed file
If you gave a genome fasta and a bed file of regions, the output will have this structure:
- chrom_names
A list of strings giving the name of each chromosome.
coords_chromentries correspond to the order of chromosomes in this dataset.- chrom_sizes
A list of integers giving the size of each chromosome. This is mostly here as a handy reference when you want to make a bigwig file.
- coords_start
The start point for each of the regions that were explained. This will have shape
(num-regions,). Note that this starts at the beginning of the input to the model, so it will not match the coordinates in the bed file.- coords_end
The end point for each of the regions that were explained. This will have shape
(num-regions,). As withcoords_start, this corresponds to the last base in the input to the model.- coords_chrom
The chromosome number on which each region is found. These are integer indexes into
chrom_names, and this dataset has shape(num-regions,)- input_seqs
A one-hot encoded array representing the input sequences. It will have shape
(num-regions x input-length x 4)- hyp_scores
A table of the shap scores. It will have shape
(num-regions x input-length x 4). If you want the actual contribution scores, not the hypothetical ones, multiplyhyp_scoresbyinput_seqsto zero out all purely hypothetical contribution scores.
Fasta file
- descriptions
A list of strings that are the description lines from the input fasta file (with the leading
>removed). This list will have shape(num-regions,)- input_seqs, hyp_scores
These have the same meaning as in the bed-and-genome based output files.
Additional Information
No fasta coordinate data
While you can use shapToNumpy on either format of
interpretFlat output, you cannot convert a fasta-based interpretation
h5 to a bigwig, since it doesn’t contain coordinate information. You can get
around this limitation by providing a bed file and a genome in a coordinates
section.
History
Before BPReveal 4.0.0, the coords_chrom dataset in the generated hdf5 file
contained strings. For consistency with every other tool in the BPReveal suite,
it was changed to contain an integer index into the chrom_names dataset.
API
- bpreveal.interpretFlat.main(config)
Run the interpretation.
- Parameters:
config (dict) – A JSON object matching the interpretFlat specification.
Schema
{
"$schema": "http://json-schema.org/draft-07/schema#",
"title": "interpretFlat",
"description": "Schema for interpretFlat.py",
"type": "object",
"properties": {
"genome": {"type": "string"},
"bed-file": {"type": "string"},
"fasta-file": {"type": "string"},
"coordinates" : {
"type" : "object",
"properties" : {
"genome": {"type": "string"},
"bed-file": {"type": "string"}
},
"required" : ["genome", "bed-file"]
},
"input-length": {"type": "integer"},
"output-length": {"type": "integer"},
"heads": {"type": "integer", "minimum": 1},
"head-id": {"type": "integer"},
"profile-task-ids": {
"type": "array",
"items": {
"type": "integer"}
},
"profile-h5": {"type": "string"},
"counts-h5": {"type": "string"},
"num-shuffles": {"type": "integer", "minimum" : 1},
"kmer-size": {"type": "integer", "minimum" : 1},
"verbosity": {"$ref": "/schema/base#/definitions/verbosity"}
},
"required": ["input-length", "output-length", "heads", "head-id",
"profile-task-ids", "profile-h5", "counts-h5", "num-shuffles", "verbosity"],
"oneOf": [
{
"required": ["genome", "bed-file"],
"not": {
"anyOf" : [
{"required": ["coordinates"]},
{"required": ["fasta-file"]}
]
}
},
{
"not": {
"anyOf" : [
{"required": ["genome"]},
{"required": ["bed-file"]}
]
},
"required": ["fasta-file"]
}
]
}