This article discusses about the relevant characteristics of the sample and its statistics in statistics, focusing on the problems related to the mean and variance of the sample.

Expected value of a statistic

Suppose we have a random variable \( X \) that conforms to a certain probability distribution, and the overall mathematical expectation and variance are:

$$ E(X) = \mu \\ D(X) = \sigma^2 $$

However, the expected value and variance of the whole are usually unknown, so we take a sampling method and use the sample's statistic to estimate them, which is in line with our intuition;

For example, we have the distribution of a random variable \( X \), and we display it in the form of a graph:

Its overall expected value is located in the red dot in the figure. Of course, we don't actually know where the red dot is, but it exists objectively. Its calculation formula is:

$$ \mu = {1 \over N}\sum X_i $$

\( N \) is the total amount of the original data, usually \( N \) is very large (to infinity), so it is impossible for us to calculate the above formula, so we do not know where the red dot is actually;

Therefore, we use the sampling method, and only take out a limited number of \( n \) values as samples each time, that is, circles in the figure; calculate the mean of this batch of samples, which is the green point in each circle, its The calculation formula is:

$$ \overline{X} = {1 \over n} \sum {X_i} $$

When we conduct countless such sampling experiments (circle) and get countless green points, then the average value of these green points is equal to the expected value of the original data, which is the red point;

That is to say, there is the following conclusion: the expected value of the sample mean is equal to the expected value of the original distribution, namely:

$$ E(\overline{X}) = E(X) =\mu $$

So much is written above, as if to say something intuitively obvious; yet this is mathematics, and even if it seems obvious, we'd better prove it mathematically:

$$ \begin{align} E(\overline{X})&=E({1 \over n}\sum {X_i})\\ & ={1 \over n}E(\sum {X_i})\ \ & ={1 \over n}[E(X_1)+...+E(X_n)]\\ & ={1 \over n}(n\mu) =\mu \end{align} $$


The purpose of giving the above proof is to elicit the following content. I want to clarify a problem: some conclusions in statistics seem to be intuitive and obvious, but they cannot be taken for granted. If there is no strict mathematical proof to support, you still need to think twice. .

For example, we consider the following expected values:

$$ E(X^2) $$

That is, the expected value of \( X^2 \), is it equal to the square of the original expected value \( \mu \) \( \mu^2 \)?

The answer is obviously no. For example, consider the distribution of a very simple random variable \( X \), which has only two values of 3 and 5, and the probability accounts for 0.5 each, then its original expected value is:

$$ \mu = 3 \cdot 0.5 + 5 \cdot 0.5 = 4 $$

However:

$$ E(X^2) = {3^2\cdot0.5+5^2\cdot0.5} = 17 \neq \mu^2 $$

It is not equal to the original expected value \( 4^2 =16 \), but larger than it;

To be more straightforward, this is an algebraically simple principle: mean of squares >= arithmetic mean

$$ {{a^2 + b^2}\over{2}} \geq ({{a + b}\over 2})^2 $$

So we get a conclusion:

$$ E(X^2) \geq \mu^2 $$

What is the expected value of \( X^2 \) ? It is actually equal to (square of original expected value + variance):

$$ E(X^2) = \mu^2 + \sigma^2 $$

This can also be derived from mathematical formulas, so I won’t go into details here, you can go through the book of probability and statistics by yourself;

sample variance

The sample mean, and \( X \) squared are discussed above, and a more complex quantity is discussed below: variance\( \sigma^2 \); like the expected value\( \mu \), usually the variance of the original data we are also unknown, we need to use the sample to estimate it;

Above we have calculated the average of \( n \) samples:

$$ \overline{X} = {1 \over n}\sum X_i $$

It has been proved above that its expected value is equal to the expected value of the original variable \( X \), namely:

$$ E(\overline{X}) = E(X) =\mu $$

That is to say, we can use the mean of the sample to estimate the expected value of the original data, which is called an unbiased estimate in statistics; in the example of the sample mean, this seems to be obvious;

However, if the variance of \( n \) samples is calculated:

$$ {1 \over n}\sum ({X_i - \overline{X}})^2 $$

Can we also use it to estimate the overall variance \( \sigma^2 \) unbiased? The answer is no, which means:

$$ E\,[{1 \over n}\sum ({X_i - \overline{X}})^2] \neq \sigma^2 $$

If you have read the calculation of the expected value of \( X^2 \) above, you should be able to see a problem, that is, the more complex statistics (non-linear, such as square, variance, etc.) of \( X \) ) data distribution cannot be taken for granted;

In fact, the variance of the \( n \) samples calculated above, its expected value is usually a little smaller than the original variance, that is to say, this estimate is too small, it underestimates the true variance; the true accurate estimate , which should be divided by \( n-1 \), not \( n \):

$$ {1 \over n-1}\sum ({X_i - \overline{X}})^2 $$

This is the strict definition of the sample variance in statistics, and its mathematical expectation is equal to the variance of the original distribution \( X \):

$$ E\,[{1 \over n-1}\sum ({X_i - \overline{X}})^2] = \sigma^2 $$

This is also a very magical conclusion that troubles many beginners: why is \( n-1 \)?


Regarding the mathematical derivation of the formula for this conclusion, I think it can be found in many places, and here I still try to give an intuitive understanding.

The overall variance is calculated as:

$$ \sigma^2 = {1 \over N}\sum ({x_i - \mu})^2 $$

\( N \) is the total amount of original data; the above calculation result is actually the squared average of the distances from all gray points to the red points, which is well understood;

Usually \( N \) is very large (to infinity), and we do not know what the mean of the whole is, so it is impossible for us to calculate the above formula; so we still use the sampling method, and only take out a limited \ (n \) values as samples:

Each circle is the sampling range of each time, each time \( n \) points are sampled, and the green point is the average value of each batch of samples, namely:

$$ \overline{X} = {1 \over n}\sum X_i $$

If the sample variance is calculated by the following formula:

$$ {1 \over n}\sum ({X_i - \overline{X}})^2 $$

It calculates the average of the squares of the distances from all the grey points to the green points in each circle;

But in fact, the exact variance of the real raw data should be calculated using the distance from the gray point to the red point, that is:

$$ {1 \over n}\sum ({X_i - \mu})^2 $$

But the problem is that the red point is unknown, so instead of the original expected value \( \mu \) we use each calculation, but the sample mean \( \overline X \), that is, the green point is used instead of the red point; this As a result, the average value (green point) we use is actually deviated from the original expected value (red point), and the variance calculated from it is of course biased.

So is it too big or too small? It can be seen intuitively from the figure that it is small every time . The points in the circle are sampled data, and the green points are their mean (or center points), which are obviously closer to themselves than the red points; of course, this is just an intuitive feeling on the graph, algebraically speaking, a bunch of data, to The sum of squares at their center points is smaller than the sum of squares to any other point.

It is precisely because the variance calculated from the sampling data is small every time, so on the whole, even if we conduct countless such sampling experiments, the expected value of the variance calculated at the end is definitely small. Note that a problem is emphasized here, that is, each sampling calculation is too small, so that the expected value calculated as a whole is too small.

This goes back to the original question, why using the variance of the sample to estimate the overall variance is biased and small:

$$ {1 \over n}\sum ({X_i - \overline{X}})^2 $$

For a really accurate estimate, you need to replace \( n \) with \( n-1 \):

$$ {1 \over n-1}\sum ({X_i - \overline{X}})^2 $$

As for why it is \( n-1 \), this requires formula derivation, so I will not do a detailed proof here, please go to the textbook to find it.


navi
612 声望191 粉丝

naive


« 上一篇
数学笔记