gaOptimize

Useful tools for creating sequences with a desired property.

bpreveal.gaOptimize.CorruptorLetter

Any letter that is a valid corruptor.

alias of Union[Literal[‘A’], Literal[‘C’], Literal[‘G’], Literal[‘T’], Literal[‘d’], Literal[‘Ǎ’], Literal[‘Č’], Literal[‘Ǧ’], Literal[‘Ť’]]

bpreveal.gaOptimize.Corruptor

A corruptor gives a coordinate and a change to make.

The change is represented by a single letter. The letters ACGT indicate a SNP at that locus, the letter d indicates that that base should be deleted, and the letters ǍČǦŤ indicate that the corresponding base should be inserted immediately after the locus.

Examples:

(85, "A"),
(1525, "Ť"),
(1603, "d")

alias of tuple[int, Union[Literal[‘A’], Literal[‘C’], Literal[‘G’], Literal[‘T’], Literal[‘d’], Literal[‘Ǎ’], Literal[‘Č’], Literal[‘Ǧ’], Literal[‘Ť’]]]

bpreveal.gaOptimize.Profile

A profile is the result from running a model.

It is a list of (logits, logcounts) tuples.

Type:: list[tuple[ LOGIT_AR_T , LOGCOUNT_T ]

alias of list[tuple[ndarray[Any, dtype[float32]], float32]]

bpreveal.gaOptimize.CandidateCorruptor

A candidate corruptor gives all possible mutations at a particular locus.

Examples:

(85, "ACTd"),
(1525, "AGTdǍČǦŤ"),
(1603, "G")

Note that it is possible for a CandidateCorruptor to only allow for one possible mutation at a locus.

alias of tuple[int, str]

bpreveal.gaOptimize.ANNOTATION_T: TypeAlias = tuple[tuple[int, int], str, str] | tuple[tuple[int, int], str, str, float, float]

The shape for an annotation object to pass to plotTraces().

It contains:

a pair of integers, giving the start and stop points of that annotation,
a string giving its label,
a string giving its color,
(Optional) a float giving the bottom of its annotation box, and
(Optional) a float giving the top of its annotation box.

bpreveal.gaOptimize.IN_A: CorruptorLetter = 'Ǎ': The letter Ǎ represents inserting an A

bpreveal.gaOptimize.IN_C: CorruptorLetter = 'Č': The letter Č represents inserting a C

bpreveal.gaOptimize.IN_G: CorruptorLetter = 'Ǧ': The letter Ǧ represents inserting a G

bpreveal.gaOptimize.IN_T: CorruptorLetter = 'Ť': The letter Ť represents inserting a T

bpreveal.gaOptimize.IN_L = 'ǍČǦŤ': The four insertion letters

bpreveal.gaOptimize.IN_D = {'A': 'Ǎ', 'C': 'Č', 'G': 'Ǧ', 'T': 'Ť'}: A dict mapping a regular letter ACGT to an insertion code ǍČǦŤ

bpreveal.gaOptimize.CORRUPTOR_TO_IDX: dict[CorruptorLetter, int] = {'A': 0, 'C': 1, 'G': 2, 'T': 3, 'd': 8, 'Č': 5, 'Ť': 7, 'Ǎ': 4, 'Ǧ': 6}: Use these to map corruptors to integers.

bpreveal.gaOptimize.IDX_TO_CORRUPTOR = 'ACGTǍČǦŤd': Given an integer, which corruptor does it represent? This is the inverse of CORRUPTOR_TO_IDX.

bpreveal.gaOptimize.corruptorColors

BPreveal’s coloring for bases. A is green, C is blue, G is yellow, and T is red.

These colors are drawn from the Wong palette, but with the green lightened a bit.

Type:: dict[CorruptorLetter, tuple[float, float, float]]

bpreveal.gaOptimize.corruptorsToArray(corruptorList)

Convert a list of Corruptors to an array of numbers.

Parameters:: corruptorList (list[Corruptor]) – A list of corruptors to serialize.
Returns:: A list of tuples of numbers representing the same information.
Return type:: list[tuple[int, int]]

Given a list of corruptor tuples, like [(1354, 'C'), (1514, 'Ť'), (1693, 'd')], convert that to an integer list that can be easily saved. This list will have the format [ [1354, 1], [1514, 7], [1693, 8] ] where the second number indicates the type of corruptor, as given by CORRUPTOR_TO_IDX. Returns a list of lists, not a numpy array! If you want an array, use np.array(corruptorsToArray(myCorruptors))

bpreveal.gaOptimize.arrayToCorruptors(corruptorArray)

Turn an array of numbers into a Corruptor list.

Parameters:: corruptorArray (list[tuple[int, int]]) – A list of tuples of ints.
Returns:: Corruptors corresponding to the array.
Return type:: list[Corruptor]

Takes an array of numerical corruptors in the form [ [1354, 1], [1514, 7], [1693, 8] ] and generates the canonical list with letters, as used in the rest of the code. This function will work on a list or a numpy array of shape (N x 2). Returns a list of tuples, one for each position that was corrupted, like: [(1354, 'C'), (1514, 'Ť'), (1693, 'd')]

bpreveal.gaOptimize.stringToCorruptorList(corruptorStr)

Parse a string and turn it into a Corruptor list.

Parameters:: corruptorStr (str) – The string to parse.
Returns:: A list of Corruptors.
Return type:: list[Corruptor]

Takes a string representing a list of corruptors and generates the actual list as a python object. For example, if the string is "[(1354, 'C'), (1514, 'Ť'), (1693, 'd')]" this function will parse it to the list of tuples [(1354, 'C'), (1514, 'Ť'), (1693, 'd')]

(The inverse of this function is simply str on a corruptor list.)

class bpreveal.gaOptimize.Organism(corruptors)

Represents the set of corruptors that are to be applied to the input sequence.

Parameters:: corruptors (list[Corruptor]) – A list of corruptors that this organism represents.

getSequence(initialSequence, inputLength)

Apply this organism’s corruptors to initialSequence, a string.

Parameters:

initialSequence (str) – A string representing the wild-type sequence that this organism will apply its corruptors to.
inputLength (int) – The length of the returned sequence.

Returns:

A string of length inputLength.

Return type:

str

Note that initialSequence will need to be longer than the inputLength to your model, since deletion corruptors can shorten the sequence. The length of initialSequence must be at least inputLength + maxDeletions, and unless you’re filtering somehow, the maximum number of deletions is the number of corruptors for this organism.

setScore(scoreFn)

Apply the score function to this organism’s profile.

Parameters:: scoreFn (Callable(Profile, list[Corruptor]) -> float) – A function that takes this organism’s profile and its corruptor list and returns a float.
Return type:: None

This function can only be called after the organism’s profile has been set!

cmp(other)

A general comparator between two organisms based on their corruptors.

Parameters:: other (Organism) – The organism to compare against.
Returns:: 1 if this organism is after (alphabetically) the other one, -1 if this organism comes first, or 0 if the two organisms have the same corruptors.
Return type:: int

Note that this does not compare profile or score!

mutated(allowedCorruptors, checkCorruptors)

Mutate this organism’s corruptors.

Parameters:

allowedCorruptors (list[CandidateCorruptor]) – All corruptors that this organism has access to
checkCorruptors (Callable[[list[Corruptor]], bool]) – A function that accepts a list of corruptors and returns True if they are a valid combination and False otherwise.

Returns:

A newly-allocated Organism with a new set of corruptors.

Return type:

Organism

This is something you may want to override in a subclass. This returns a NEW organism that has one changed corruptor. It chooses one of its current corruptors at random, and then changes it to a random selection from allowedCorruptors. (allowedCorruptors has the same structure as in a Population). It makes sure that the new corruptor doesn’t occur on the same base as an existing one (unless it’s an insertion), and also calls checkCorruptors on the resulting corruptor candidates. It will make 100 attempts to generate a new organism, and then it will error out.

mixed(other, checkCorruptors)

Make a new organism by combining this one with another.

Parameters:

other (Organism) – Another organism that this should be mixed with.
checkCorruptors (Callable[[list[Corruptor]], bool]) – A function that returns True if a corruptor list is valid, and False otherwise.

Returns:

A newly-allocated Organism that is a blend of both parents.

Return type:

Organism

This is also something you may wish to override in a subclass. Currently, it pools the corruptors from self and other, and then randomly selects numCorruptors of them. If that passes checkCorruptors, then it returns a NEW organism.

class bpreveal.gaOptimize.Population(initialSequence, inputLength, populationSize, numCorruptors, allowedCorruptors, checkCorruptors, fitnessFn, numSurvivingParents, predictor)

The main class for running the sequence optimization GA.

This is a heck of a constructor, but you need to make a lot of choices to use the GA.

Parameters:

initialSequence (str) – A string of length (input-length + numMutations)
inputLength (int) – is the length of sequence that is given to your model.
populationSize (int) – is the number of organisms at the end of each generation.
numCorruptors (int) – determines how many corruptors will be applied in each organism.
allowedCorruptors (list[CandidateCorruptor]) – is a list of tuples. (see below)
checkCorruptors (Callable[[list[Corruptor]], bool]) – is a function that determines if a list of corruptors is valid. (see below)
fitnessFn (Callable[[Profile, list[Corruptor]], float],) – The fitness function. It takes two arguments: The first argument is a list of tuples containing the model outputs. These have the same organization as the outputs from batchPredictor. The second argument is a list of corruptors, as presented to checkCorruptors.
numSurvivingParents (int) – The number of parents that will be kept for the next generation, usually referred to as elitism in GA terminology.
predictor (utils.BatchPredictor) – A BatchPredictor that has been set up with the model you want to use.

The initial sequence must be longer than needed for your model. The extra length is needed because this GA can have deletions and you need sequence to fill in the gaps when that happens. If you limit the number of deletions allowed (say, by not having them in allowedCorruptors or restricting them through checkCorruptors), then you need to provide (input-length + numPossibleDeletions), and if you’re only allowing SNPs, then numPossibleDeletions = 0.

allowedCorruptors is a list of tuples, with each tuple being a CandidateCorruptor. Each tuple contains two elements: a number, representing the position in the input sequence (starting at zero), and the second is a string containing the allowed corruptions at the base at the positing given by the number. For example, if your sequence is AGGCA, and you want any base to be corruptor to any other except that the C cannot be corrupted to a T and the second G cannot be corrupted at all, allowedCorruptors would be [(0, "CGT"), (1, "ACT"), (3, "AG"), (4, "CGT")]. In the string of possible corruptors, the capital letters A, C, G, and T refer to a SNP of the base at the given position to the new letter, a lowercase ‘d’ means that the base there should be deleted, and a letter with a caron, like Ǎ, Č, Ǧ, or Ť, means that the given letter should be inserted AFTER the base number.

checkCorruptors should take a list of Corruptor. These are tuples, and the first element of each tuple is a number, representing the base to be corrupted, and the second will be a single character, representing the corruption to be applied The corruptor locations will always be sorted in increasing order. You could use this, for example, to make sure that no two corruptors are next to each other. If you don’t wish to apply any logic to the corruptors, pass in lambda x: True.

initialSequence: str: The un-mutated sequence that all organisms will work with.

organisms: list[Organism]

The Organisms in this population.

When the population is created, the organisms will have no profile or score information.

After you execute runCalculation(), each organism will contain profile and score data, and the organisms list will be sorted in ascending order of score. So organisms[-1] is the best organism in the population.

After you call nextGeneration(), the organisms in the population are reset and don’t contain score or profile information any more.

runCalculation()

Run the GA.

Runs the current population through the model, assigns scores to each organism, and sorts the organisms by score. If you want to save a list of the best parents, remember that the organisms are sorted in ascending order of fitness, so the best organism is pop.organisms[-1].

Return type:: None

nextGeneration()

Perform mutation and mixing to generate new children.

Once you’ve runCalculation(), you call this to create new organisms for the next generation. This replaces the .organisms array with the new children, and those organisms will not have any profile or score data.

Return type:: None

bpreveal.gaOptimize.getCandidateCorruptorList(sequence, regions=None, allowDeletion=True, allowInsertion=True)

Give the corruptors that are possible in a given sequence.

Parameters:

sequence (str) – The DNA sequence that will be used.
regions (list[tuple[int, int]] | None) – An optional list of tuples giving allowed regions for mutations.
allowDeletion (bool) – Is deletion a valid corruptor type?
allowInsertion (bool) – Is insertion a valid corruptor type?

Returns:

A list of Corruptors.

Return type:

list[Corruptor]

Given a sequence (a string), this generates a list of tuples that contain all of the possible corruptors for each position in the sequence. It will have the format [(0, "ACGdǍČǦŤ"), (1, "AGTdǍČǦŤ"), (2, "CGTdǍČǦŤ"), ...] where each number is a position in the sequence and the letters are the things that can be done at that location. A regular letter means a SNP, a d means deletion and the accented letters mean an insertion after the position.

regions, if provided, must be a list of tuples. Each tuple contains two numbers, giving a start and stop location (left inclusive, right-exclusive, like python array slicing) of bases that are corruptor candidates. For example, regions=[(100,200), (250,300)] would return [(100, "ACG"), (101, "AGT"), ... (199, "CGT"), (250, "ACT"), ... (299,"ACG")] The start position in each region tuple must be less than the stop point, but this method will check for (and remove) overlapping regions. If allowDeletion is True, then the strings will contain ‘d’, and if allowInsertion is True, then the strings will contain ‘ǍČǦŤ’.

bpreveal.gaOptimize.anyCorruptorsCloserThan(corList, distance)

Are any corruptors close to each other?

Parameters:

corList (list[Corruptor]) – The corruptors to consider.
distance (int) – The minimum distance that must be between corruptors.

Returns:

True if there exist two corruptors from corList that are less than distance apart, False otherwise.

Return type:

bool

A utility function that you can integrate into checkCorruptors to see if any corruptors are close to each other. Given a sorted list of corruptors (each corruptor a tuple of (position, effect)), For example, to prevent corruptors on adjacent bases, distance=1. To ensure a gap of two bases between each corruptor, distance=2.

bpreveal.gaOptimize.removeCorruptors(corruptorList, corsToRemove)

Take corruptors out of a list.

Parameters:

corruptorList (list[CandidateCorruptor]) – A list of candidate corruptors, like [(100, “ACG”)].
corsToRemove (list[Corruptor]) – A list of corruptors.

Returns:

A newly-allocated list of candidate corruptors with the given corruptors removed.

Return type:

list[CandidateCorruptor]

Given a candidate corruptor list, like from getCandidateCorruptorList, and a list of tuples giving disallowed corruptors, return a new candidate corruptor list where those disallowed corruptors are removed. corsToRemove is a list of tuples. The first element of each tuple is the position (relative to the start of the input sequence) that needs a corruptor removed, and the second is a string containing the forbidden letters.

For example:

removeCorruptors([(100, "ACG"), (101, "AT"), (102, "CGT"),
                  (103, "CT"), (104, "AGT")],
                 [(101, "T"), (103, "CT"), (104, "G")])
-> [(100, "ACG"), (101, "A"), (102, "CGT"), (104, "AT")]

bpreveal.gaOptimize.validCorruptorList(corruptorList)

Is a list of corruptors even possible?

Parameters:: corruptorList (list[Corruptor]) – The corruptors to consider.
Returns:: True if the corruptors could be applied by an organism, False otherwise.
Return type:: bool

A valid list satisfies the following property. For each pair \((c_n, c_{n+1})\) in corruptorList:

\(c_n[0] <= c_{n+1}[0]\) (The list is ordered by the position of the corruptor.)
If \(c_n[0] == c_{n+1}[0]\):
- \((c_n[1], c_{n+1}[1])\) must be in sorted order. For stupid reasons, sorted order is ACGTdČŤǍǦ.
- Neither \(c_n[1]\) nor \(c_{n+1}[1]\) are "d".
  (You can’t delete a base and do anything else to it.)
- \(c_n[1] \in\) "ACGT" \(\implies c_{n+1}[1] \in\) "ǍČǦŤ".
  (If you have a SNP, the only other thing at that position must be an insertion.)

bpreveal.gaOptimize.plotTraces(posTraces, negTraces, xvals, annotations, corruptors, ax)

Generate a nice little plot with pips for corruptors and boxes for annotations.

Parameters:

posTraces (list[tuple[ndarray[Any, dtype[float32]], str, str]]) – The profiles to plot above the X axis.
negTraces (list[tuple[ndarray[Any, dtype[float32]], str, str]]) – The profiles to plot below the X axis.
xvals (ndarray[Any, dtype[float32]]) – An array giving the genomic coordinates of your data.
annotations (list[ANNOTATION_T]) – Annotations that you’d like to put on your plot.
corruptors (list[Corruptor]) – A list of Corruptors that you’d like to put on your plot.
ax (Axes) – The matplotlib Axes object to draw on.

Return type:

None

posTraces is a list of tuples. Each tuple has three things:

A one-dimensional array of values. The number of values must be the same as the number of points in xvals.

A string that will be used as a label for the trace.

A string that will be used for the color of the trace.

negTraces has the same structure as posTraces, but will be negated before being plotted. This is handy to make different conditions visually distinct, and of course for chip-nexus data.

annotations is a list of tuples. Each tuple contains:

a pair of integers, giving the start and stop points of that annotation,

a string giving its label,

a string giving its color,

a float giving the bottom of its annotation box, and
a float giving the top of its annotation box. For example:
`[((431075,431089), "FKH2", "red", 0.5, 0,7),
((431200, 431206), "PHO4", "blue", 0.3, 0.6)]`

corruptors has the same format as the corruptors in an organism, but be sure you shift the coordinates appropriately so that they line up with xvals. In other words, the coordinates of the corruptors should be the real genomic coordinates of the mutations. (Remember that corruptor coordinates are relative to the start of the INPUT, not the output.)