Generating Correlated Random Variables in Python: A Guide

In a previous article, I explored the simulation of normally distributed random numbers with defined mean and variance through the Box-Muller technique. This method involves generating a set of independent uniform random numbers and transforming them into normally distributed random numbers using polar coordinates, yielding two uncorrelated variables, X and Y.

While this approach is effective for producing uncorrelated variables, it’s important to understand correlation—how well one variable Y can be predicted by another variable X. The correlation coefficient, denoted as ?, ranges from -1 to +1, with -1 indicating perfect anti-correlation and +1 indicating perfect correlation. Visual representations can clarify how these coefficients relate to different datasets.

When the correlation coefficient is near -1 or +1, it suggests a strong linear relationship between X and Y, allowing for reliable predictions. Conversely, a value close to zero indicates weak or no correlation, making it difficult to establish a meaningful connection between the two variables. Although the correlation coefficient provides valuable insights, it doesn't capture the entire picture. Our goal here is to create random variables that exhibit a predetermined correlation coefficient ?.

We will achieve this by employing two independent standard normal random variables, S1 and S2, each with a mean of 0 and a variance of 1. Since they are generated independently, they are uncorrelated by default. We'll verify this correlation using Python by computing the correlation coefficient. The formula for correlation is as follows:

Here, Cov[X, Y] denotes the covariance between X and Y, ? represents the standard deviation, and E[X] signifies the expected value or mean of variable X. For a finite number of N random samples, the expected value E[X] can be defined as:

Likewise, variance and standard deviation for a variable X can be defined as:

Next, we will run a simple code snippet to compute S1 and S2 by generating uniformly distributed random variables U1 and U2:

import matplotlib.pyplot as plt import numpy as np plt.close('all') np.random.seed(0) # Set seed

# Inputs samples = 1000000

# Random samples (Uniformly distributed) U1 = np.random.rand(samples, 1) U2 = np.random.rand(samples, 1)

# Random samples (normally distributed uncorrelated) S1 = np.sqrt(-2 * np.log(U1)) * np.cos(2 * np.pi * U2) S2 = np.sqrt(-2 * np.log(U1)) * np.sin(2 * np.pi * U2)

E_S1 = np.mean(S1) Var_S1 = np.mean(S1**2) - E_S1**2 sigma_S1 = np.sqrt(Var_S1) E_S2 = np.mean(S2) Var_S2 = np.mean(S2**2) - E_S2**2 sigma_S2 = np.sqrt(Var_S2)

Cov_S1_S2 = np.mean(S1 * S2) - E_S1 * E_S2 Corr_S1_S2 = Cov_S1_S2 / sigma_S1 / sigma_S2

print('corr(S1,S2) = ' + str(Corr_S1_S2))

Executing this code yields:

corr(S1,S2) = 0.0010687581117255355

This value is very close to zero. As we increase the sample size, the correlation tends to decrease and eventually approach zero for large N, although this convergence can be gradual.

Our next objective is to derive correlated variables X and Y from the uncorrelated normal random variables S1 and S2, using a specified correlation coefficient ?. We apply the following transformation:

But how does this yield normally distributed X and Y with correlation ?? To demonstrate this, we will use properties of normal random variables:

This holds true when S1 and S2 are statistically independent (uncorrelated). Utilizing our previous correlation definition and the fact that both S1 and S2 have a mean of zero and a variance of one:

Thus, we conclude:

Now, let's implement this in Python:

import matplotlib.pyplot as plt import numpy as np plt.close('all') np.random.seed(0) # Set seed

# Inputs samples = 1000000

# Random samples (Uniformly distributed) U1 = np.random.rand(samples, 1) U2 = np.random.rand(samples, 1)

# Random samples (normally distributed uncorrelated) S1 = np.sqrt(-2 * np.log(U1)) * np.cos(2 * np.pi * U2) S2 = np.sqrt(-2 * np.log(U1)) * np.sin(2 * np.pi * U2)

E_S1 = np.mean(S1) Var_S1 = np.mean(S1**2) - E_S1**2 sigma_S1 = np.sqrt(Var_S1) E_S2 = np.mean(S2) Var_S2 = np.mean(S2**2) - E_S2**2 sigma_S2 = np.sqrt(Var_S2)

Cov_S1_S2 = np.mean(S1 * S2) - E_S1 * E_S2 Corr_S1_S2 = Cov_S1_S2 / sigma_S1 / sigma_S2

# Correlated random samples mu_x = 0.5 mu_y = 0.66 sigma_x = 0.85 sigma_y = 1.24 rho = 0.5 X = mu_x + sigma_x * S1 Y = mu_y + sigma_y * (rho * S1 + np.sqrt(1 - rho**2) * S2)

E_X = np.mean(X) Var_X = np.mean(X**2) - E_X**2 sigma_X = np.sqrt(Var_X) E_Y = np.mean(Y) Var_Y = np.mean(Y**2) - E_Y**2 sigma_Y = np.sqrt(Var_Y)

Cov_X_Y = np.mean(X * Y) - E_X * E_Y Corr_X_Y = Cov_X_Y / sigma_X / sigma_Y

print('corr(X,Y) = ' + str(Corr_X_Y))

This code outputs:

corr(S1,S2) = 0.0010687581117255355 corr(X,Y) = 0.5005530467691287

As anticipated, the correlation between X and Y is approximately 0.5, aligning with our defined value of ?.

Finally, let's create scatter plots for Y versus X and S2 versus S1 to observe their differences. Based on theory, we expect S1 and S2 to be symmetrically distributed around (0, 0), while X and Y should form a tilted ellipse centered at (mu_x, mu_y). To ensure clarity in our plots, we'll reduce the number of samples:

# Inputs samples = 1000

# Generate plots plt.subplot(1, 2, 1) plt.plot(S1, S2, linestyle="", marker="o", color="blue") plt.xlabel('S1', fontsize=16) plt.ylabel('S2', fontsize=16)

plt.subplot(1, 2, 2) plt.plot(X, Y, linestyle="", marker="o", color="green") plt.xlabel('X', fontsize=16) plt.ylabel('Y', fontsize=16)

This results in the following visual representation:

As observed, the correlated variables X and Y present a more elliptical distribution compared to S1 and S2, and are tilted at a 45-degree angle. Now that you know how to create correlated random variables, your possibilities are virtually limitless!