Suppose we have two datasets, have \(n1\), \(n2\) data separately, and we know mean and variance of each, \(\mu_1\) ,\(\sigma_1^2\), \(\mu_2\) , \(\sigma_2^2\) , then we combined the two datasets to single one, what’s the variance of the combined dataset?

I find a solution in Internet, here is the formula.

Now, I will prove it. Consider \[ Var[x] = E[x^2] - (E[x])^2 \]


\[ \sigma^2 = \frac{n1E[x_1^2]+n2E[x_2^2]}{n1+n2} - (E[x])^2
= \frac{n1E[x_1^2]+n2E[x_2^2]}{n1+n2} - \mu^2
\frac{n1(\sigma_1^2 + \mu_1^2)+n2(\sigma_2^2 + \mu_2^2)}{n1+n2} - \mu^2 \]

The we expand the first formula

The interesting point is

So I think the formula from Internet is correct.