ushuffle

A wrapper around the ushuffle C implementation.

bpreveal.ushuffle.shuffleString(sequence, kmerSize, numShuffles=1, seed=None)

Given a string sequence, perform a shuffle that maintains the kmer distribution.

This is adapted from ushuffle.

Parameters:
  • sequence (str) – should be a string in ASCII, but it should theoretically work on multi-byte encoded utf-8 characters so long as the kmerSize is at least as long as the longest byte sequence for a character in the input. (Please don’t rely on this random fact!)

  • kmerSize (int) – The size of kmers that should have their distribution preserved.

  • numShuffles (int) – How many shuffled versions of the input do you want?

  • seed (int | None) – Seed for the random number generator.

Returns:

a list of shuffled strings.

Return type:

list[str]

bpreveal.ushuffle.shuffleOHE(sequence, kmerSize, numShuffles=1, seed=None)

Given a one-hot sequence, perform a shuffle that maintains the kmer distribution.

Parameters:
  • sequence (ndarray[tuple[Any, ...], dtype[uint8]]) – The sequence to shuffle.

  • kmerSize (int) – The size of kmers that should have their distribution preserved.

  • numShuffles (int) – How many shuffled versions of the input do you want?

  • seed (int | None) – Seed for the random number generator.

Returns:

an array of shape (numShuffles, length, alphabetLength)

Return type:

ndarray[tuple[Any, …], dtype[uint8]]

Sequence should have shape (length, alphabetLength). For DNA, alphabetLength == 4. It is an error to have an alphabet length of more than 8. Internally, this function packs the bits at each position into a character, and the resulting string is shuffled and then unpacked. For this reason, it is possible to have more than one letter be hot at one position, or even to have no letters hot at a position. For example, this one-hot encoded sequence is valid input:

Pos A C G T
0   1 0 0 0
1   0 1 0 0
2   1 0 1 0
3   0 1 1 1
4   0 0 0 0

This is adapted from ushuffle.