Statistical Fallacies
A Summary of lecture "Introduction to Computational Thinking and Data Science", via MITx:6.00.2x (edX)
Lies, Damned Lies and statistics
 Humans and Statistics
Note: "If you can`t prove what you want to prove, demonstrate something else and pretend they are the same thing. In the daze that follows the collision of statistics with the human mind, hardly anyone will notice the difference."  Darrel Huff Moral: Statistics about the data is not the same as the data
 Moral: Use visualization tools to look at the data itself
 Moral: Look carefully at the axes labels and scales
 Moral: Ask whether the things being compared are actually comparable
Texas Sharpshooter
 Sampling
 All statistical techniques are based upon the assumption that by sampling a subset of a population we can infer things about the population as a whole
 As we have seen, if random sampling is used, one can make meaningful mathematical statements about the expected relation of the sample to the entire population
 Easy to get random samples in simulations
 Not so easy in the field, where some examples are more convenient to acquire than others
 Nonrepesentative Sampling
 "Convenience sampling" not usually random, e.g.,
 Survivor bias, e.g. course evaluations at end of course or grading final exam in 6.00.2x on a curve
 Nonresponse bias, e.g., opinion polls conducted by mail or online
 When samples not random and independent, we can still do things like computer means and standard deviations, but we shouldn't draw conclusions from them using things like the empirical rule and central limit theorem.
 Moral: Understand how data was collected, and whether assumptions used in the analysis are satisfied. If not, be wary.
 "Convenience sampling" not usually random, e.g.,
 A Comforting Statistics?
 Moral: Context matters. A number means little without context.
 Relative to What?
 Moral: Beware of percentages when you don't know the baseline
Correlation and Causation
 Cum Hoc Ergo Propter Hoc
 With this, therefore because of this
 Humans "wired" to find patterns in information and like to think causally
 Establishing Causation
 Attempt to control for all variables other than the variables of interest
 Rarely possible
 Randomized control studies the gold standard
 Start with a population

Randomly assign members to either
 Control group
 Treatment group
 Deal with two groups identically except with respect to the one thing being evaluated
 Very hard to do
 Attempt to control for all variables other than the variables of interest