Statistical Fallacies
A Summary of lecture "Introduction to Computational Thinking and Data Science", via MITx:6.00.2x (edX)
Lies, Damned Lies and statistics
- Humans and Statistics
Note: "If you can`t prove what you want to prove, demonstrate something else and pretend they are the same thing. In the daze that follows the collision of statistics with the human mind, hardly anyone will notice the difference." - Darrel Huff- Moral: Statistics about the data is not the same as the data
- Moral: Use visualization tools to look at the data itself
- Moral: Look carefully at the axes labels and scales
- Moral: Ask whether the things being compared are actually comparable
Texas Sharpshooter
- Sampling
- All statistical techniques are based upon the assumption that by sampling a subset of a population we can infer things about the population as a whole
- As we have seen, if random sampling is used, one can make meaningful mathematical statements about the expected relation of the sample to the entire population
- Easy to get random samples in simulations
- Not so easy in the field, where some examples are more convenient to acquire than others
- Non-repesentative Sampling
- "Convenience sampling" not usually random, e.g.,
- Survivor bias, e.g. course evaluations at end of course or grading final exam in 6.00.2x on a curve
- Non-response bias, e.g., opinion polls conducted by mail or online
- When samples not random and independent, we can still do things like computer means and standard deviations, but we shouldn't draw conclusions from them using things like the empirical rule and central limit theorem.
- Moral: Understand how data was collected, and whether assumptions used in the analysis are satisfied. If not, be wary.
- "Convenience sampling" not usually random, e.g.,
- A Comforting Statistics?
- Moral: Context matters. A number means little without context.
- Relative to What?
- Moral: Beware of percentages when you don't know the baseline
Correlation and Causation
- Cum Hoc Ergo Propter Hoc
- With this, therefore because of this
- Humans "wired" to find patterns in information and like to think causally
- Establishing Causation
- Attempt to control for all variables other than the variables of interest
- Rarely possible
- Randomized control studies the gold standard
- Start with a population
-
Randomly assign members to either
- Control group
- Treatment group
- Deal with two groups identically except with respect to the one thing being evaluated
- Very hard to do
- Attempt to control for all variables other than the variables of interest