Seeds#
Packages#
In NumPy, the two main packages you’ll come across in relation to random seeds are numpy
and random
.
In this script, we will demonstrate using numpy.
import numpy as np
Generating a random number#
In order to illustrate how different practices with random number generators work, we will use the random()
function, which returns a random float between 0 and 1.
Without a seed, you can see that we’ll get a different answer each time.
In Python, these values are produced by a pseudorandom number generator.
print(np.random.random())
print(np.random.random())
0.9160257015890815
0.014614208720352462
What is a pseudorandom number generator?#
A random number generator (RNG) is a system generating random numbers from a true source of randomness (e.g. electrostatic noise converted into random numbers). However, we don’t need true randomness, so we use a pseudorandom number generator (PRNG).
A PRNG is also known as a deterministic random bit generator (DRBG). It generates a sequence of numbers that look close to random - but are not truly random, since it is determined by an intial value called the seed. If you do not set a seed, then it may use the current system time in seconds or milliseconds as the seed. [source 1] [source 2]
Global random number generator#
We have what can be referred to a the “global” RNG for our script. It is common practice to set the seed for the global RNG at the start of the script using random.seed()
or np.random.seed()
.[source]
As you’ll see, setting this seed ensures you then get the same result.
np.random.seed(5)
print(np.random.random())
np.random.seed(5)
print(np.random.random())
0.22199317108973948
0.22199317108973948
However, this is actually considered bad practice. Global variables (like a global RNG) are not recommended as they can have influences throughout your script.
You could change the variable for one part of the script and inadvertently impact elsewhere in the script - such as in the case of random numbers, where you could unknowingly reset the seed elsewhere in the codebase.
The numpy RNG policy instead recommends that you create a generator object with a seed and pass that around…[source]
Create a generator object and pass it around#
You can use np.random.default_rng()
to create a RNG.
You then use that RNG object when calling .random()
.
# Illustration that this enables reproducibility
rng = np.random.default_rng(seed=12345)
print(rng.random())
rng = np.random.default_rng(seed=12345)
print(rng.random())
# Showing that this is a generator object
print(rng)
0.22733602246716966
0.22733602246716966
Generator(PCG64)
It has generally been said that you should create a single PRNG and draw from it sequentially (rather than creating new PRNGs within your script). This is having multiple streams, you can be concerned about what the streams from different seeds might be less independent/more correlated that those creates from a single seed. However, the numpy developers have explained that the size and entropy seeding of their PRNGs means that this concern is not particularly relevant. [source 1] [source 2]
# Using a single PRNG...
rng = np.random.default_rng(seed=12345)
print(rng.random())
print(rng.integers(1,5))
print(rng.standard_normal())
0.22733602246716966
4
-0.8706617379590857
Parallel random number generation#
If we have a stochastic simulation and we want to run it thirty times, we might do so via paralleling processing (such as with the joblib
library). We don’t want the same results each run, but we do want to be able to reproduce our set of thirty runs. We need to be wary that:
Not setting a seed, the forked Python processes might “use the same random seed, generated for instance from system entropy, and thus produce the exact same outputs”
Setting a single seed, the RNG will be deep copied
Setting different seeds for each RNG, we cannot be sure that the RNGs are statistically independent. [source]
NumPy offers a few different methods to create independent PRNGs:
SeedSequence spawning
Sequence of integer seeds
Independent BitGenerator
Jumping the BitGenerator state [source]
We’ll focus on SeedSequence spawning…
SeedSequence spawning#
You provide a seed to SeedSequence. It then uses this seed to create N BitGenerator states that are highly probable to be independent of each other. Conversion to bit generator states “uses hashing techniques to ensure that low-quality seeds are turned into high quality initial states”. You can then use these to instantiate PRNGs.[source]
ss = np.random.SeedSequence(entropy=12345)
print(f'ss: {ss}')
# Spawn 10 child SeedSequences to pass to child processes
child_seeds = ss.spawn(10)
print(f'example child_seeds: {child_seeds[0:2]}')
# Create 10 PRNGs
prngs = [np.random.default_rng(s) for s in child_seeds]
print(f'example PRNGs: {prngs[0:2]}')
ss: SeedSequence(
entropy=12345,
)
example child_seeds: [SeedSequence(
entropy=12345,
spawn_key=(0,),
), SeedSequence(
entropy=12345,
spawn_key=(1,),
)]
example PRNGs: [Generator(PCG64) at 0x70A2B3C559A0, Generator(PCG64) at 0x70A2B3C54D60]
For convenience, you can spawn independent child generators directly from a parent generator (without needing to directly use SeedSequence
). Those children can likewise be used to spawn yet more independent generators.[source]
parent_rng = np.random.default_rng(12345)
print(parent_rng)
streams = parent_rng.spawn(10)
streams
Generator(PCG64)
[Generator(PCG64) at 0x70A2B3C55E00,
Generator(PCG64) at 0x70A2B3C565E0,
Generator(PCG64) at 0x70A2B3C56340,
Generator(PCG64) at 0x70A2B3C56500,
Generator(PCG64) at 0x70A2B3C54BA0,
Generator(PCG64) at 0x70A2B3C567A0,
Generator(PCG64) at 0x70A2B3C56880,
Generator(PCG64) at 0x70A2B3C566C0,
Generator(PCG64) at 0x70A2B3C54C80,
Generator(PCG64) at 0x70A2B3C56DC0]
Creating individual random number streams for sampling#
In stochastic simulations, we sample from a distribution. However, it is important to set individual random number streams for sampling, as illustrated below (based on source).
In our example, we sample acute length of stay (LoS) from an exponential distribution and rehabilitation LoS from a uniform distribution.
# Seed and parameters for our distributions
seed = 42
acute_mean = 32
rehab_min = 15
rehab_max = 80
n_patients = 5
# Instantiate PRNG
rng = np.random.default_rng(seed)
# Sample 5
acute_los = rng.exponential(scale=1/acute_mean, size=n_patients)
rehab_los = rng.uniform(low=rehab_min, high=rehab_max, size=n_patients)
print(f'Acute LoS: {acute_los}')
print(f'Rehab LoS: {rehab_los}')
Acute LoS: [0.07513152 0.07300593 0.07452378 0.00874357 0.00270117]
Rehab LoS: [78.41545286 64.47408063 66.09417984 23.32738612 44.27508596]
Then we reset the stream using the same seed but limit to simulating just two patients.
As such, we might expect to get this when rerunning the code:
we reduce the size two (for example, reflecting less patients arriving per hour). We might expect to get:
Acute LoS: [0.07513152 0.07300593]
Rehab LoS: [78.41545286 64.47408063]
But we do not! Instead we get -
# Change to 2 patients
n_patients = 2
# Instantiate PRNG
rng = np.random.default_rng(seed)
# Sample 5
acute_los = rng.exponential(scale=1/acute_mean, size=n_patients)
rehab_los = rng.uniform(low=rehab_min, high=rehab_max, size=n_patients)
print(f'Acute LoS: {acute_los}')
print(f'Rehab LoS: {rehab_los}')
Acute LoS: [76.93467533 74.75806899]
Rehab LoS: [70.80886479 60.32892189]
This is because both sampling methods as using the same PRNG/same pseudo random number stream.
This problem will introduce noise between experiments, where variation in results are due to this and not actually the change we made to the simulation.
Hence, it is recommended to use individual PRNGs for each sampling distribution (as in this example).
# Source: https://pythonhealthdatascience.github.io/stars-simpy-example-docs/content/02_model_code/04_model.html
class Exponential:
def __init__(self, mean, random_seed=None):
self.rng = np.random.default_rng(seed=random_seed)
self.mean = mean
def sample(self, size=None):
return self.rng.exponential(self.mean, size=size)
class Normal:
def __init__(self, mean, sigma, random_seed=None):
self.rng = np.random.default_rng(seed=random_seed)
self.mean = mean
self.sigma = sigma
def sample(self, size=None):
return self.rng.normal(self.mean, self.sigma, size=size)
# Parameters
seed = 42
triage_mean = 3
exam_mean = 16
exam_var = 3
n_streams = 20
# Spawning seeds to use to create generators
seed_sequence = np.random.SeedSequence(seed)
seeds = seed_sequence.spawn(n_streams)
# Creating the seperate distributions...
triage_dist = Exponential(triage_mean, random_seed = seeds[0])
exam_dist = Normal(exam_mean, np.sqrt(exam_var), random_seed = seeds[1])
# Sampling five from each
print('Sampling five from each...')
print(triage_dist.sample(5))
print(exam_dist.sample(5))
# Creating the seperate distributions
triage_dist = Exponential(triage_mean, random_seed = seeds[0])
exam_dist = Normal(exam_mean, np.sqrt(exam_var), random_seed = seeds[1])
# Sampling two from each
print('Sampling two from each...')
print(triage_dist.sample(2))
print(exam_dist.sample(2))
Sampling five from each...
[2.90633292 5.48489155 1.37244157 0.83179124 9.79691459]
[18.17284798 17.05012411 13.67874431 14.14819982 18.3269625 ]
Sampling two from each...
[2.90633292 5.48489155]
[18.17284798 17.05012411]
Behaviour as expected - yay!
Another way of setting up this same thing (from me, just to help illustrate it in another way, but you’ll see it is the same results)…
From looking at it, you’ll also understand the benefit of the way the Classes are set up above.
# Create parent generator and spawn child sequences
parent_rng = np.random.default_rng(seed)
generators = parent_rng.spawn(10)
# Create generators for each distribution using the children
triage_rng = generators[0]
exam_rng = generators[1]
# Sample five from each
print('Sampling five from each...')
print(triage_rng.exponential(triage_mean, 5))
print(exam_rng.normal(exam_mean, np.sqrt(exam_var), 5))
# Remaking to illustrate reproducibility
# Sample two from each
parent_rng = np.random.default_rng(seed)
generators = parent_rng.spawn(10)
triage_rng = generators[0]
exam_rng = generators[1]
print('Sampling two from each...')
print(triage_rng.exponential(triage_mean, 2))
print(exam_rng.normal(exam_mean, np.sqrt(exam_var), 2))
Sampling five from each...
[2.90633292 5.48489155 1.37244157 0.83179124 9.79691459]
[18.17284798 17.05012411 13.67874431 14.14819982 18.3269625 ]
Sampling two from each...
[2.90633292 5.48489155]
[18.17284798 17.05012411]
Legacy Random Generation#
RandomState
is no longer supported. It is the old way of creating a random number generator, allowing you to set the seed just for the generator (and not messing about with the global random generation).
rng = np.random.RandomState(12345)
rng.randint(-1, 2, size=10)
array([ 1, 0, 0, 0, -1, 0, 1, 1, 0, 1])