Science and the art of collecting, analyzing and interpreting data.
Summarizing data (Summary statistics):
- by a typical value (Averages)
- Median – More robust than mean, because if one of the elements in data is missing, it might have a large impact on mean, but not a significant impact on the median.
The mean is affected by outliers(skewing) while the median isn’t
The average doesn’t efficiently summarize the data set in case of bimodal distribution.
2. How different the values of data set from this typical value(variability).
1. range(max-min) – Outliers have the big impact.
2. interquartile range – resistant to outliers.
3. Variance and standard deviation.
Law of Large numbers/Bernoulli’s law:
For a random variable, if you take a large number of samples and average them, as the number of samples n tends to infinity, the average tends to approach expectation of that random variable.
Gambler’s fallacy (wrong):
Law of large numbers doesn’t imply that if deviation from expected behavior occurs, then these deviations will be evened out by opposite occurrences in future.
Example: He is due for a hit because he hasn’t had any.
Regression to the mean (correct):
Following an extreme random event, the next random event is likely to be less extreme.
Central Limit Theorem:
You have a random variable with a random distribution.
Pick n samples from that random distribution and average them.
Do this multiple times and plot those averages of n samples of original distribution.
This new plot/ distribution will be Normal distribution irrespective of the shape of the original distribution.
If you increase sample size n, variance of the new distribution decreases by a factor of n (var^2/n).
Monte Carlo simulation:
Bigger sample size -> less standard deviation -> small confidence interval -> narrow error bar.
When confidence intervals don’t overlap.
When you conduct an experiment, you may come up with a hypothesis from the data. How do you know this result or hypothesis isn’t come up just by chance?
This is what we do to find out that.
We think of null hypothesis, where the intended element has no effect. Then we simulate the experiment with null hypothesis for multiple times and plot that.
What is the probability of the original hypothesis results appearing on this null hypothesis data/plot? If the original hypothesis is not in the 95% confidence interval of null hypothesis plot, then we can say that the experiment is statistically significant and can favor the original hypothesis that it does have an effect.
The standard deviation of the sampling distribution of the sample mean or
sampling distribution of the mean is also called the standard error of the mean.
Always take independent random samples.
Statistical fallacies and morals:
Statistics about the data is not the same as data.
Use visualization tools to look at the data.
Look at the axes labels scales.
Are things being compared really comparable?
Covariance: Covariance is a measure of the joint variability of two random variables. If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds for the lesser values, i.e., the variables tend to show similar behavior, the covariance is positive.
The sign of the covariance, therefore, shows the tendency in the linear relationship between the variables. The magnitude of the covariance is not easy to interpret because it is not normalized and hence depends on the magnitudes of the variables. The normalized version of the covariance, the correlation coefficient, however, shows by its magnitude the strength of the linear relation.