The One Most Important Theorem Every Data Scientist Should Know
Share This On
Nov 3, 2018 | 9168 Views
This article serves as a quick guide on one of the most important theorems that every data scientist should know, the Central Limit Theorem.
What is it? When can you not use it? Why is it important? Is it the same thing as the law of large numbers?
Central Limit Theorem vs. Law of Large Numbers
Often, the central limit theorem is confused with the law of large numbers. The law of large numbers states that as the size of a sample is increased, the more accurate of an estimate the sample mean will be of the population mean.
The difference between the two theorems is that the law of large numbers states something about a single sample mean whereas the central limit theorem states something about the distribution of the sample means.
Central Limit Theorem (CLT)
The Central Limit Theorem states that the sampling distribution of the mean of any independent, random variable will be normal or nearly normal if the sample size is large enough.
In other words, if we take enough random samples that are big enough, the proportions of all the samples will be normally distributed around the actual proportion of the population. Note that the underlying sample distribution does not have to be normally distributed for the CLT to apply. To break this down even further, imagine collecting a sample and calculating the sample mean. Repeat this over and over again, collecting a new, independent sample from the population each time. If we plotted a histogram of each sample mean, the distribution will be normally distributed.
What does that look like? A normal distribution has a bell shape curve like below. Most of the data is clustered in the middle or the mean. This distribution is centered around a mean of 0 and has a standard deviation of 1.
You may wonder, what qualifies as big enough? Well, the general rule is that if sample sizes are 30 or greater, the sample size is large enough for the CLT to hold.
Here's a fun demonstration of the CLT at work. In the bean machine or Galton Board, beads are dropped from the top and eventually accumulated in bins at the bottom in the shape of a bell curve.
When can you not use the CLT?
The sampling is not random.
The underlying distribution does not have a defined mean/variance.
Example With a Dice Roll
One of the classic examples of the CLT is rolling a six-sided dice. Each number has a 1 in 6 likelihood of showing up in the dice roll. We can use python to simulate our dice rolling.
Let's set our sample size to be 50 observations. The code randint(1, 7, 50) gives us an array of 50 numbers, in which the numbers 1 through 6 are equally likely to show up. Let's start off by looking at the distribution of the means of 10 samples.
means = [(randint(1, 7, 50)).mean() for i in range(10)]
plt.hist(means, bins=‚??auto‚??) plt.title(‚??Histogram of 50 Dice Roll Sample Means‚??) plt.xlabel(‚??Average‚??) plt.ylabel(‚??Count‚??) plt.show()
There's not much of a shape to this distribution just yet. Let's increase the number of samples to 1,000. Notice that we are getting closer to the bell-shaped curve.
Now, let's look at an extremely large number of samples, 100,000 to be exact. This looks like a very defined bell curve now. Isn't that amazing? Our sample distribution looks just like the Gaussian distribution! Just like the CLT tells us.
Why do we care about the CLT?
It serves as the foundation of statistics. It will be impossible to go out and collect data on an entire population of interest. However, by collecting a subset of data from that population and using statistics, we can draw conclusions about that population.
The CLT essentially simplifies analysis for us! If we can claim normal distribution, there are a number of things we can say about the data set. In data science, we often want to compare two different populations through statistical significance tests, that is, hypothesis testing. By the power of CLT and our knowledge of the Gaussian distribution, we are able to assess our hypothesis about the two populations.