“Naked Statistics” from Charles Wheelan – this is not a book that explains the statistical methods in detail. The author, instead, takes a holistic view of statistics and describes using numerous examples how it can solve real world problems. Here I briefly summarize my takeaways using the CSI example from the book.
Why do we use statistics? Statistics is a powerful tool to identify important relationships between things in order to explain certain phenomenon. Such relationships include association and causal effect. The process of a statistical analyst figuring out the relationships is similar to that of a social detective answering the following questions:
The following paragraphs are the core parts (I think) included in the analysis / detective process. Let’s pretend to be social detectives.
This step is not explicitly called out in the book, but I think it is worth mentioning here, as defining a problem sets the purpose of an analysis. For example, there exists terrorism – this is a problem. If there is no terrorism in the world, we would not need to investigate that. Then we want to decompose the problem in order to address it. The ultimate solution is to eliminate terrorism. But to come up with a solution we need to figure out “why”, the causes of terrorism. This why question is also too complex to approach directly as the scope of possible causes is too broad. To narrow the scope, one starting question is “who are likely to be terrorists?”.
To answer the last question above, a common strategy is to compare the characteristics of the past terrorists with that of general population to see what stands out. We want to carefully sample both populations to avoid biases (e.g., focusing on a certain region), as otherwise with the problematic data, any statistical methods will be useless – “garbage in, garbage out”. Then we use descriptive statistics to summarize the two samples, such as their mean, median, standard deviation, of different factors, such as family income. (Mia plugging in data visualization here, complement to statistical summary, a powerful tool to unlock both the trends and issues from data.)
But with these samples, how do we know whether it reveals the information of the overall population groups (e.g., terrorists)? Central Limit Theorem (CLT) comes to help. It tells the probability that a sample mean (e.g., average family income) will lie within a certain distance of the population mean. We can then calculate how close the sample mean is from the population mean, and get a confidence of that distance. Note that we can never be 100% certain about our conclusion. We can only make an inference, indicating how likely the sample mean to be the truth.
With the collected data, we are now able to hypothesize and test out the factors (e.g., family income) associated with terrorists, using regression. Regression analysis then tells us, for each factor, the effect it has on whether a person is a terrorist (coefficient), together with a level of confidence. We then know what factors matter, and what are irrelevant.
With the answer above, we are now able to touch the second question with a narrower scope. What causes terrorism? One hypothesis is about political goals. There are several approaches to testing out hypotheses about causal effect, and controlled experiment is the most rigorous one, assigning people to different conditions and compare the outcome. However sometimes this approach is not applicable because people cannot be ‘assigned to’ certain conditions. Alternative approaches thus include natural experiment, inequivalent control, difference in differences, etc. These methods often require very careful and clever thoughts to mimic the randomization setup in controlled experiments.
In the above process, the key statistical concepts are highlighted. Again, statistics is a powerful tool to identify important relationships between things in order to explain certain phenomenon. I appreciate that the author jumps out of the text books and connects statistics to the real world.