Poisson Distribution method for runs analysis
of random bit sequences
By Arthur R. Kopp 9/03/14
See my probability tidbits page TIDBITS
In a n bit sequence of random bits, a run of length two (L = 2 ) can be defined as a 0110 pattern for runs of ONES and as a
1001 pattern for runs of ZEROS. There are sixteen different possible patterns of four bits:
0000 0001 0010 0011 0100 0101 0110 0111
1000 1001 1010 1011 1100 1101 1110 1111
It can be seen that each run pattern, so defined, exists once in sixteen possibilities so that we might expect
each pattern to occur about n / 16 times on average. Similarly, runs of length three can be expected about
n / 32 times on average. Keeping runs of ONES separate from runs of ZEROS, each can be expected to occur
n / (2 ^ (L + 2 )) times on average. If runs of ONES and ZEROS are lumped together, and we're interested only
in RUN counts, we expect to have on average n / (2 ^ (L + 1)) (inclusive mean is twice the exclusive mean). A
discussion of the algorithm I have used to detect and count the runs will be presented later in this writeup.
Mark Nelson Mark discussed a Fibonacci
n-step (Fibo) method he used to answer the question of the probability of having a run
of twenty heads in one million tosses of a fair coin. Would a Poisson Distribution (PD) method provide a far simpler
means of computation?
Indeed it does. In fact, all that's needed in this case is the Poisson result for a count
of zero (x = 0). Then two of the PD formula terms simply equal one. The zero count result means the probabilty of no
runs, so subtracting that result from one gives the probability of at least one run (which is what the Fibo
method calculates). The Fibo method, developed as Mark did it, calculates the inclusive case so we use the inclusive mean.
Using (n)(L) notation we have for the (1,000,000)(20) problem:
lamda = 1000000 / (2 ^ 21) = 0.4768371
P(0) = epsilon ^ (- lamda) = 0.6207436
Probability of at least one run = 1 - P(0) = 0.3792563
The result is the same as Mark's result out to five decimal places. Here's a comparison of probability
calculations of the Fibo method to the Poisson method for n = 10,000 and n = 100,000:
I've wrtten a experimental program (TRAND) that uses PD cumulative sums to obtain a test metric for long runs
(low probabilities). Smaller run lengths are handled differently as discussed below. The program tests
run lengths as small as one (010 and 101 patterns). The test page shown below displays the actual
counts found on a random sample along with calculated probabilities.
The bell shaped curve below illustrates a Poisson Distribution (PD) having good symmetry. The curves are not symmetrical
for long runs, so cumulative sums must be done on one side or the other of the mean (center of the bell curve). If a
target count is less than the mean, summing of the PD probabilities begins at x = 0 and ends when x = the target count value
(sum a from left to right). If a target count is greater than the mean, summing begins at a large x and x is decremented until
x equals the target count value (sum b from right to left). If the sum is greater than the test limit (white portion of a
cumulative sum) the test is passed.
SUM A + SUM B = 1. If a test limit of 0.01 is set, the required pass/fail threshold is set at half that or 0.005. Since the
threshold is in effect for both sums, the result is 0.5% + 0.5% = 1.0% of the total sum of probabilities. Thus 1% of all files
tested are expected to fail at a 0.01 test limit.
For large values of the mean the complementary error function (erfc) can be used instead of cumulative summations.
Note that lamda * qa below is the Binomial Distribution (BD) variance. For the Poisson, lambda is the variance.
For run length L:
pa = 1 / (2 ^ (L+2)) qa = 1 - pa
diff = ABS(lamda - count) where ABS is the absolute value
x = diff / sqrt(2 * lambda * qa)
P = erfc(x)
Program TRAND uses the erfc method for lamda > 32. At this point, BD and PD produce practically identical results,
the bell curves are symmetrical, and qa is nearly 1.
The method of detecting and counting runs of various lengths is very simple. Sequential ONE bits increment ONES
counter C1. A ZERO bit will result in incrementing array element A(C1) providing C1 > 0. C1 is then reset to zero.
Similarly, sequential ZERO bits increment C0. A ONE bit will result in incrementing array element B(C0) providing
C0 > 0. Thus, after the file is read, A(1) will be the count of run length 1 of ONES, A(2) will be the count of run
length 2 of ONES, etc. Similarly, the B array will hold counts of runs of ZEROS.
If the first two bits of a file are 10 a run of length one ONES is counted. If the first three bits are 110 a run
of length two ONES is counted, etc. It's as if a dummy leading ZERO is supplied to form the 010 or 0110 patterns. The
algorithm includes a handler for the end of the file which, in effect, supplies a dummy run terminating bit so a run will
be counted. For example, if the last four bits are 0111 a run of length three will be counted as if a 01110 pattern
Bits in various patterns must be shared and not discarded. Here's two examples of the first 32 bits of a file:
This has 15 runs of length 1 of both ZEROS and ONES
2 runs of ZEROS L=13, 1 run of ZEROS L=1, and 4 runs of ONES L=1
A pseudo random bits generator was used in a study of the method. 1,000 files of 262,144 bits were
generated, and counts of various run lengths each file were averaged. Here's a sample of the first
three run size results.
Run length 1 result = 32,767.874 (32,768 assumed mean)
Run length 2 result = 16,384.92 (16,384 assumed mean)
Run length 3 result = 8,193.396 (8,192 assumed mean)
My DOS program can be downloaded from here. Trand is the experimental
program mentioned that tests random bit files. The programs run in Windows XP without installing any special provisions.
Download trand: TRAND.EXE
NIST Special Publication 800-22 Revision 1a is a excellent testing reference: 800-22
Back to main page: RUNS