Wednesday, July 20, 2011

A Quick and Dirty Introduction to Parametric Statistics, Point Estimation, and Hypothesis Testing

This post is highly technical in nature.  Obviously, this is not ideal.  However, I think that this is by far the best way to lead up to the next entry (which I should hopefully be able to complete by Friday).  I also hope that this post will give at least a small taste of the work of statisticians, since as I mentioned in my first post, the the public has only a vague idea of what we do.  So no matter what, I think this post is worth your attention.  Also, I want to note at the outset that feedback is welcome.  If anything is this post is vague or hard to follow, don't be afraid to let me know in the comments section!  I'll do my best to revise the post accordingly.


Generally when laypeople use the word “statistics,” what they really mean are percentages. “67% of all quoted statistics are made up on the spot,” and so forth.  The field of statistics is actually much broader than that. Simply put, statistics is the science (or art, depending on how you see it) of drawing defensible conclusions from data that has some element of randomness built into it.  Despite some recent challenges to its supremacy, the reigning methodology for drawing such conclusions remains what practitioners have come to call “parametric statistics.”  When practicing parametric statistics, we assume that the data are such that they follow a known probability distribution which can be defined solely in terms of a small set of parameters.  In practice, this means that even a very large data set can be summarized efficiently by only a few values, and that we can make predictions with a relatively small amount of computing power (among other benefits).  There are actually two competing ways of deciding upon reasonable values for parameters, but for the purposes of this post, we'll confine ourselves to the methodological assumptions of what has come to be called "frequentist" statistics (this set of assumptions is also sometimes referred to as “classical statistics,” but I happen to think that this designation is a bit of a historical distortion).  The competing “Bayesian” methodology for estimating parameter values will have to wait for another time, as it's not relevant to the post I want to introduce.



The best example of how parameters work is probably what non-statisticians often refer to as the “bell-curve” (statisticians generally call it either the "normal distribution" or the “Gaussian distribution”).  When we say that a random variable, X, is such that X~N(μ,σ2), (in which case we say “X follows a normal distribution with mean mu and variance sigma squared”) we mean that the values that X can take lie on a bell-curve with mean mu and standard deviation sigma (the variance of a distribution is just the square of its standard deviation, but statisticians tend to talk about variance more often for reasons that I won't go into here).  Once we know the values of the latter two parameters, we know everything of interest about X.  To see what I mean, look first at figure 1, which is a graph of the “standard” normal distribution: N(0,1).  The double arrow indicates one standard deviation about the mean.
Figure 1
What happens if we change the value of μ, while leaving σ fixed?  We get the graph shown in figure 2, with the original normal distribution in blue, and a new normal distribution to its right of N(2,1) shown in red.  The two graphs exhibit the same “spread” about the mean, but the mean itself has shifted to the right by 2.
Figure 2
On the other hand, if we keep μ fixed, but increase σ to 3 (giving us N(0,9) we get the graph shown in figure 3, with the new distribution in red, and the original N(0,1) distribution shown in blue.  Note that this time both bell-curves are centered in the same place (at μ=0), but they exhibit a different amount of “spread” about that mean.  Even as the graph becomes wider, however, the general shape of the curve remains recognizable.
Figure 3
So much for parameters.  What does the practice of parametric statistics actually look like?  Consider, in detail, the following example.  Suppose that we have a coin, and we wish to know what will happen when we flip it.

Step 1: Defining the Parameters

The behavior of the coin is governed by one parameter: the probability of getting heads on any given toss. For now, call this parameter π (by convention, parameters are generally denoted by Greek letters). Now suppose that we wish to determine if the coin is “fair.” That is to say, we wish to know if the probability that the coin shows heads on any toss is 0.5. We could test this by flipping the coin 1000 times. Suppose, then, that we observe 482 heads in 1000 tosses.

Step 2: Estimating the Parameters

Without doing any math whatsoever, it seems pretty obvious that the most logical estimate for π is 482/1000=0.482. If we wish to be mathematically rigorous about it (those who are uncomfortable with math should feel free to skip to step 3 here), we do the following. Since, for any toss, the probability of heads is π, and the probability of tails is (1-π), and the tosses are independent (i.e. the outcome of one toss has no effect whatsoever on the outcome of the next toss), we can say that the probability (or “likelihood” in statistical terminology) of our observed results is:

L(π;y)=π482(1-π)512 (where "y" represents the observed data)

We now wish to find the value of π for which the likelihood function takes the largest possible value, which is equivalent to finding the value of π that has the highest probability of producing the observed results. In practice, it is almost always easier to maximize the natural logarithm of the likelihood function (or "log likelihood"), i.e.:
l(π;y)=482log(π)+512log(1-π)
For anyone who's had a little calculus, it should be obvious that this can easily be maximized by taking the derivative, and setting it equal to 0:
dl/dπ=0
0=482/π+512/(1-π)
482-482π=512π
and so: π̂=482/1000=0.482, which is the same result that our initial intuition indicated.

Statisticians call the quantity π̂ the Maximum Likelihood Estimator (or MLE for short) for π. And because of the way it's written, we sometimes call it “pi hat” to boot (yeah, I know...). There are other methods of finding plausible estimates of a parameter (which we call a “point estimator” since they occupy a single point on the number line rather than a range of values), but maximum likelihood is by far the most often used, in part because it is logical, and also because it's often - though by no means always - mathematically simple to calculate.

Step 3: Testing our Estimates

Here's where it gets interesting. It should be fairly obvious that we can never know the true value of π, unless we had the time and the patience to toss the coin an infinite number of times (we obviously have neither). Due to randomness, tossing the coin another 1000 times could yield a different result (don't believe me? try tossing a coin 20 times, and then another 20, recording the number of heads for each 20-flip experiment). Furthermore, although we've decided that 0.482 is the most likely estimate for π, we haven't ruled out other possibilities, including the possibility of a fair coin (our impetus for conducting the whole experiment in the first place); note that the probability of a fair coin showing 482 or fewer heads in 1000 tosses is actually about 0.1342, which I suspect is larger than our intuition would lead most of us to expect. So we do what statisticians call hypothesis testing. In this case, we want to know how far our observed results deviate from the expected results if the coin were indeed fair (500 heads and 500 tails in 1000 tosses). Or the way statisticians put it, we are testing the “null hypothesis” (H0) of a fair coin against an “alternative hypothesis” (HA) of an unfair coin (though we do not necessarily say how unfair at this point). In statisticians' shorthand this is abbreviated as testing:

H0: π0=0.5        v.        HA: π̂≠0.5

The mathematics of how the standard tests that I'm going to use here are derived is a little involved (those who are interested in knowing more and know at least a little calculus, matrix algebra and probability can try following the notes that I've included at the end of this blog post, that were adapted from a lecture given by one of my graduate school professors, though I would recommend reading Wikipedia's entry on Fisher Information before attempting to tackle the notes themselves). Suffice it say that one such test utilizes the Wald statistic, W (where n denotes the number of trials, which here is obviously 1000):

W=n(π0-π̂)2/(π̂(1-π̂))=1000(0.5-0.482)2/(0.482*0.518)
Now we ask the following (seemingly odd) question: What is the probability of observing an even larger value of this number if the null hypothesis is true? In this case the probability that W could be even larger given a fair coin is 0.2546371. (For those who've had some probability, this can be computed by utilizing the fact that as the sample size gets increasingly large, W converges in distribution to a chi-squared distribution on one degree of freedom, so we just let p=1-Pr(W<0.2546371)). We call this probability a “p-value.” At the outset of the experiment, we choose a number between 0 and 1, called the “significance level” designated by the Greek letter α (by convention, the level is usually chosen to be 0.05). If the p-value is greater than or equal to our chosen value of α, we say that our result is "significant at level α" (in case you've ever heard the term “statistically significant” in the news, here's where it really comes from). On the other hand, if p<α, we reject the null hypothesis (i.e. conclude that it cannot explain the observed data and should therefore be ignored). Note that the larger α is chosen to be, the pickier we're being about how much precision we require in our estimate of π.

Before moving on, I just want to note something rather important. A common misconception among beginning students of statistics is that the p-value is the probability that the null hypothesis is true. It isn't. Frequentist statistics cannot, in fact, attach probabilities of this kind to parameters (which is something often pointed out by critics of frequentist methods, but more on that another time). The p-value is better thought of as a statement of how likely our observed results would be were the null hypothesis true.

What may be more useful than testing a hypothesis about a single value, however, is a confidence interval. A (1-α)*100% confidence interval for π is the entire set of values for which we would find that p≥α. In other words, the confidence interval is the range of values for which the parameter is significant at at least level α. Generally, we try to obtain a 95% confidence interval: the set of all possible values of π that would give a p-value of more than 0.05 if tested. In this case, the range of values is:
π̂±1.96(π̂(1-π̂)/n)^(1/2)
Don't worry too much about how this is derived. Mainly, I've included this expression here to point out that as n (the sample size) gets progressively larger, the range of values that π can reasonably assume within our framework gets smaller and smaller. This makes intuitive sense: the more times we flip the coin, the better the proportion of heads should approximate the true value of the parameter. I.e. the larger the sample size, the more precise any kind of statistical inference will be. (For a mathematical explanation of why this is so, c.f. the law of large numbers). Anyway, given our data, we find that the 95% confidence interval for π is:
0.4510298≤π≤0.5129702
In other words, were we to test any of the values in this range, the result would be considered statistically significant. So while a lot of the evidence points to the coin being fair, we can't necessarily rule out the possibility that this isn't the case (though even then, the coin is probably not too far from being fair). In fact, the coin could be biased towards either heads or tails, as surprising as the former result probably is.

A few more points. A common misconception is that a 95% confidence interval is the range of values within which the parameter lies with 95% probability. It isn't. It's something a little more subtle, and maybe somewhat disappointing: the range of values that would show at least 5% significance under a hypothesis test. One could also think of the interval as follows: Were this procedure to be repeated on multiple samples, the calculated confidence interval (which would differ for each sample) would encompass the true parameter 95% of the time. Those interested in learning more about the philosophical issues of confidence intervals are advised to read more here this link.

Another Coin: Rejecting Hypotheses

Now suppose we have a second coin, and we wish to conduct the same experiment. Call the probability of heads φ, and flip the coin 1,000 times. Assume that we observe 319 heads. Using the same logic as before, we get φ̂=0.319 as the MLE for φ. Is it still possible that this coin is fair?  Again, let's test the null hypothesis of a fair coin versus the alternative hypothesis of a coin that is biased in favor of heads or tails (again, without saying just how biased).  Once again using the Wald statistic, we get:

W=1000(0.5-0.319)2/(0.319(1-0.319))=150.8063

Note that the statistic is bigger this time. Much bigger. This is because our results deviate much further than previously from the expected observations for a fair coin. In this case, the probability of observing an even larger value of W for a fair coin is so small that it is essentially 0 (this is sensible, since the probability of observing less than 370 or so heads in 1000 tosses of a fair coin is essentially 0). We can therefore reject the hypothesis that the coin is fair.

What are the plausible values of φ for this coin? Again, we use a 95% confidence interval, which we find to be:

0.3043≤φ≤0.3337

In other words, we may not know the exact probability of the coin showing heads in any given toss, but we have found that the coin is heavily biased in favor of tails, and this finding is statistically significant at virtually all levels.


Some Closing Thoughts:

Some people are probably asking “Why talk about flipping coins? Surely the real world is more complicated!” Well yes, it is. However, even the most complicated statistical models use the same principles and procedures I've just outlined. The coin example just happens to be one that I think is user friendly for a layperson who's new to this way of thinking. But just to sketch how the same principles apply to more complicated models, let's talk about one of the most commonly used statistical methods: linear regression. Suppose that you have a set of data points that look like figure 4.
Figure 4

It seems pretty clear that some line of the form y=mx+b could be plausibly (but not perfectly) drawn here to explain the relationship between x and y. Maybe something like the one shown in figure 5, perhaps?
Figure 5

 But how do we get from figure 4 to figure 5? Generally in these situations, we can think of x as being deterministic, and y as being random. It is therefore often plausible to think of the following model (where i=1,...,n, n being the number of data points): yi01xii

Under the assumptions of classical linear regression, we say that for all values of i, εi~N(0,σ2) (where σ is unknown), and hence that yi~ N(β01xi2). The parameters for this model, then, are β0, β1 and σ. This model is more complicated than the model we used for the coin, but as before, we proceed by estimating the value of the parameters from the data, and then testing those parameters to see if they're sensible and how well they explain the observed data. For example, it's very common for this type of statistical work to test: H0: β1=0 v. HA: β1≠0. This is done in order to ascertain whether or not positing the existence of a linear relationship between x and y is actually justified by the data.

This is a condensation of quite a lot of material. Probability theory, statistical inference, and linear regression are each taught as semester courses at the master's level in American universities (not including the necessary prerequisites of multivariable calculus and matrix algebra). How the standard tests are derived, and how to interpret them requires a fair bit of mathematical machinery that I've skipped here in the interest of clarity and brevity. Hopefully, however, I've offered a useful window into the daily work of a practicing statistician, as well as a decent introduction to the next blog post.

Appendix:  Derivation of the Wald and other tests

Asymptotic Tests -

No comments:

Post a Comment